# Scraping an HTML table into a Pandas dataframe
With basically two statements!

In [None]:
import pandas as pd
import requests # a user-friendly web package

Step 1: Download a web page.  (The example below is good because the raw data is potentially very useful for further computation, but the page is larded with graphics and ads and whatnot.  Simple copy/paste not likely to work here.)

In [None]:
page = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")

The response is an object that can be examined.  200 = OK.

In [None]:
type(page)

In [None]:
page.status_code

In [None]:
page.content

Step 2: Have Pandas scan for tables and return a list of auto-constructed dataframes, one per table.  Lots of options here, check the documentation.  Caution: tables in the web page don't necessarily correspond to what you visually recognize as tables.  You need to look at the results, locate the dataframe you're after, then clean it up.

In [None]:
frames = pd.read_html(page.content)

In [None]:
len(frames)

In [None]:
frames[0]

In [None]:
frames[1]

In [None]:
# this is the dataframe we want
df = frames[0]

Start cleaning the data.

In [None]:
df = df.set_index("COUNTRY")

In [None]:
df

The `AMOUNT` column is text because some column values contain the word "million" (note that Pandas made the entire column text, even those values that don't contain "million").  Let's convert this to numeric.

In [None]:
df["AMOUNT"]

First, identify and select just those values needing conversion.  Then we'll incrementally build up a transformation.

In [None]:
subset = df.loc[ df["AMOUNT"].str.contains("million") , "AMOUNT" ]

In [None]:
subset

In [None]:
subset.str.split(" ")

In [None]:
subset.str.split(" ").str.get(0)

In [None]:
revised_subset = subset.str.split(" ").str.get(0).astype(float)*1e6

The following won't do what we want.  It will update `subset`, not `df`.

In [None]:
# won't work!!!
# subset = revised_subset

But this does.

In [None]:
df.loc[df["AMOUNT"].str.contains("million"),"AMOUNT"] = revised_subset

In [None]:
df["AMOUNT"]

Almost done.  But the values in the `AMOUNT` column we *didn't* replace still have type text (due to the way Pandas originally constructed the column).  Simplest to just convert the entire column to float.

In [None]:
df["AMOUNT"] = df["AMOUNT"].astype(float)

In [None]:
df["AMOUNT"]

In [None]:
df.describe()

## Using BeautifulSoup for more control
You can use BeautifulSoup, an HTML parser, for greater control in selecting which table to pass to Pandas.

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(page.content)

Returns an object.

In [None]:
type(soup)

HTML elements can be found various ways.

In [None]:
soup.title

In [None]:
tables = soup.find_all("table")

In [None]:
len(tables)

Each HTML element is actually an object that must be converted to a string before passing to Pandas.

In [None]:
type(tables[0])

Note that Pandas will still return a list even if there's only one dataframe.

In [None]:
frames = pd.read_html(str(tables[0]))

Now proceed as before.