Simple Web Scraping Example
===
Just a simple web scraping example using Python modules `requests`, `BeautifulSoup`, `lxml`, and `pandas`.

We want to get data from the following web page, which contains a list of countries ordered by number of Internet users:<br>
http://www.nationmaster.com/country-info/stats/Media/Internet-users<br>

1) Start by importing the tools necessary for downloading and parsing web pages.<br>
We may need to do first make sure the tools are installed by running this on the command line:<br>
``pip install pandas requests bs4 lxr``

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


2) Download the entire web page into memory using `requests.get()`.<br>
If the request is successful, we now have the content of the web page in memory in `res.content`.

In [2]:
res = requests.get("http://www.nationmaster.com/country-info/stats/Media/Internet-users")

3) Parse the entire web page using `BeautifulSoup` and `lxr`.<br>
If the parsing is successful, we can now access the tree structure of the web page via the `BeautifulSoup` `soup` object.

In [3]:
soup = BeautifulSoup(res.content,'lxml')

4) Get the table containing the data we are interested in, as a `BeautifulSoup` `Tag` object.<br>
It's the first table on the page.

In [4]:
table = soup.find_all('table')[0]

5) Convert the HTML code of the table into a `pandas` `DataFrame`.<br>
`pd.read_html()` returns an array of dataframes, and we want the first one.


In [5]:
df = pd.read_html(str(table))[0]
df

Unnamed: 0,#,COUNTRY,AMOUNT,DATE,GRAPH,HISTORY
0,1,China,389 million,2009,,
1,2,United States,245 million,2009,,
2,3,Japan,99.18 million,2009,,
3,,Group of 7 countries (G7) average (profile),80.32 million,2009,,
4,4,Brazil,75.98 million,2009,,
5,5,Germany,65.12 million,2010,,
6,6,India,61.34 million,2009,,
7,7,Russia,59.7 million,2010,,
8,,Non-religious countries average (profile),51.56 million,2009,,
9,8,United Kingdom,51.44 million,2009,,


6) The dataframe is easily converted to other formats, like JSON or, in this case, CSV.

In [6]:
print(df.to_csv())


,#,COUNTRY,AMOUNT,DATE,GRAPH,HISTORY
0,1,China,389 million,2009,,
1,2,United States,245 million,2009,,
2,3,Japan,99.18 million,2009,,
3,,Group of 7 countries (G7) average (profile),80.32 million,2009,,
4,4,Brazil,75.98 million,2009,,
5,5,Germany,65.12 million,2010,,
6,6,India,61.34 million,2009,,
7,7,Russia,59.7 million,2010,,
8,,Non-religious countries average (profile),51.56 million,2009,,
9,8,United Kingdom,51.44 million,2009,,
10,9,France,44.63 million,2010,,
11,10,Nigeria,43.99 million,2009,,
12,11,South Korea,39.4 million,2009,,
13,12,Turkey,35 million,2010,,
14,,Emerging markets average (profile),32.99 million,2009,,
15,13,Mexico,31.02 million,2009,,
16,14,Italy,30.03 million,2010,,
17,15,Spain,29.09 million,2010,,
18,16,Canada,26.96 million,2009,,
19,,East Asia and Pacific average (profile),25.43 million,2009,,
20,,High income OECD countries average (profile),24.75 million,2009,,
21,,Cold countries average (profile),23.69 million,2009,,
22,17,Vietnam,23.3