# Web scraping 

The following content heavily draws on [Web Scraping with Python](https://proquest.safaribooksonline.com/book/programming/python/9781491985564) (2018) by Ryan Mitchell.

## Definition

Web scraping collects data from Web other than using API. You can do that by writing a simple program to query a web server, request data, and parse the HTML data to extract information you need.


In most cases, collecting data from API is more convenient and legally safe. But when API does not exist, you have to do web scraping within *technical*, *legal*, and *ethical* boundaries. The issues around web scraping are complex because they are tied to Internet security, intellectual property, as well as knowledge as commons.


## Request and respond

In this tutorial, we use with the wikipedia entry of [List of countries ranked by ethnic and cultural diversity level](https://en.wikipedia.org/wiki/List_of_countries_ranked_by_ethnic_and_cultural_diversity_level). The main idea behind web scraping is to write code that mimics what web browser does. So, we start by learning how to make a request to a website.

![Diversity Index](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e5/List_of_countries_ranked_by_ethnic_and_cultural_diversity_level%2C_List_based_on_Fearon%27s_analysis.png/330px-List_of_countries_ranked_by_ethnic_and_cultural_diversity_level%2C_List_based_on_Fearon%27s_analysis.png)
List of countries ranked by ethnic and cultural diversity level. List based on Fearon's analysis

Also, before doing any web scraping, check the website's terms and agreements or [robots.txt](https://en.wikipedia.org/robots.txt). In case, if you want to use Python code to extract information from robots.txt, look at [this code](https://stackoverflow.com/questions/43085744/parsing-robots-txt-in-python).


In [1]:
# urlib is a standard Python library and contains functions for requesting data across the web 
from urllib.request import urlopen 
from urllib.error import HTTPError
from urllib.error import URLError

try:
    page = urlopen('https://en.wikipedia.org/wiki/List_of_countries_ranked_by_ethnic_and_cultural_diversity_level')
except HTTPError as e:
    print(e) # The HTTP error: "404 Page Not Found (you messd up)" or 500 Internal Server Error (I messed up)"
except URLError as e:
    print("The server is broken") # No server could be reached 
else:
    print("The site is working") 

The site is working


You can read the requested document by  `page.read() `. It shows something. But it's not very informative especially for those who are less familiar with HTML and CSS.

In [2]:
import requests 

page = requests.get('https://en.wikipedia.org/wiki/List_of_countries_ranked_by_ethnic_and_cultural_diversity_level')

print(page.status_code) # to check whether the down was successful; 200 is a okay sign

200


## Parse

Python is a great tool for web scraping because its [beautiful soup](https://www.crummy.com/software/BeautifulSoup/) library makes parsing HTML so easy. You can install beautiful soup library in several ways.

- 1. Unix/Linux: type `sudo apt-get install python-bs4` in terminal. This is same for Windows OS, though you should do it in bash.
- 2. Mac: `sudo easy_install pip` (in case, you havent't installed pip already) then `pip install beautifulsoup4`


### HTML parser

The most popular parser is html.parser. For malformed HTML documents, lxml and html5lib parsers work better.   

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, "html.parser")

### Parsing HTML

You can inspect the document using `print(soup.prettify())`. After exploring the web site of interest, you can extract parts of the document by identifying specific HTML/CSS tags or attributes. 

In [4]:
print(soup) # Not very informative

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of countries ranked by ethnic and cultural diversity level - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_ranked_by_ethnic_and_cultural_diversity_level","wgTitle":"List of countries ranked by ethnic and cultural diversity level","wgCurRevisionId":867538756,"wgRevisionId":867538756,"wgArticleId":36686681,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with too few wikilinks from March 2017","All articles with too few wikilinks","Articles covered by WikiProject Wikify from March 2017","All articles covered by WikiPr

In [5]:
print(soup.prettify()) # Better

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of countries ranked by ethnic and cultural diversity level - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>
   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_ranked_by_ethnic_and_cultural_diversity_level","wgTitle":"List of countries ranked by ethnic and cultural diversity level","wgCurRevisionId":867538756,"wgRevisionId":867538756,"wgArticleId":36686681,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with too few wikilinks from March 2017","All articles with too few wikilinks","Articles covered by WikiProject Wikify from March 2017","All 

### Extracing a table

In [6]:
wiki_table = soup.find('table', # find element  
                       {'class':'wikitable sortable'}) # find class attribute

#### Specific solution

Now, let's learn how to save the country information from the table using a particular attribute.

In [7]:
country_list = wiki_table('a') # by a (hyperlink)

In [8]:
country_list[10]

<a href="/wiki/Madagascar" title="Madagascar">Madagascar</a>

In [9]:
countries = [] # list container (placeholder)

for country in country_list:
    countries.append(country.get('title')) # we need get('title') to get only title information not 
    # the other elements of beautiful soup objects
    
print(countries)

[None, 'Papua New Guinea', 'Tanzania', 'Democratic Republic of the Congo', 'Uganda', 'Liberia', 'Cameroon', 'Togo', 'South Africa', 'Republic of the Congo', 'Madagascar', 'Gabon', 'Kenya', 'Ghana', 'Malawi', 'Guinea-Bissau', 'Somalia', 'India', 'Nigeria', 'Socialist Federal Republic of Yugoslavia', 'Central African Republic', 'Ivory Coast', 'Lebanon', 'Chad', 'Indonesia', 'Mozambique', 'The Gambia', 'Sierra Leone', 'Ethiopia', 'Angola', 'Mali', 'Afghanistan', 'Bolivia', 'United Arab Emirates', 'Senegal', 'Zambia', 'Namibia', 'Soviet Union', 'Sudan', 'Kuwait', 'Burkina Faso', 'Bosnia and Herzegovina', 'Kyrgyzstan', 'Nepal', 'Iran', 'Guinea', 'Kazakhstan', 'Colombia', 'Ecuador', 'Eritrea', 'Trinidad and Tobago', 'Peru', 'Niger', 'Mauritius', 'Mauritania', 'Benin', 'Guyana', 'Djibouti', 'Bhutan', 'Malaysia', 'Canada', 'Latvia', 'Syria', 'Switzerland', 'Kingdom of Yugoslavia', 'Belgium', 'Fiji', 'Saudi Arabia', 'Bahrain', 'Iraq', 'Brazil', 'Mexico', 'Republic of Macedonia', 'Pakistan', 'Is

#### General solution

We can scrap the entire table using looping. We also use regular expressions to differentiate strings from numbers (or some other tasks).

In [10]:
wiki_table.find_all('th') # heading 
#wiki_table.find_all('tr')[1].find_all('td') # to get some ideas about how looping would work 
#len(wiki_table.find_all('tr')[1].find_all('td'))

[<th>Rank
 </th>, <th>Country
 </th>, <th>Ethnic Fractionalization Index
 </th>, <th>Cultural Diversity Index
 </th>]

In [11]:
import re # regular expression -- string matching method 

# create empty lists
rank = []  
country = []
frac_index = []
div_index = []  

for row in wiki_table.find_all('tr'): # for rows 
    cells = row.find_all(['th','td']) # to iterater through each row
    if len(cells) == 4: # no heading
        rank.append(cells[0].find(text=True)) # compile a regular expression pattern; only numbers
        country.append(cells[1].find_all(text=re.compile('[A-Z]+')))
        frac_index.append(cells[2].find(text=re.compile('[0-9]+')))
        div_index.append(cells[3].find(text=re.compile('[0-9]+')))
    else:
        print("something is wrong") # for debugging

**Questions**

The above code told us something is wrong. Can you guess what casued the problem?

## Turn into a data frame

Combine these lists as parts of the same data frame.

In [37]:
import pandas as pd # convention

div_pd = pd.DataFrame() # create a data frame
 
div_pd['rank'] = rank
div_pd['country'] = country
div_pd['ethnic_fractionalization_index'] = frac_index
div_pd['cultural_diversity_index'] = div_index

In [38]:
div_pd # Oops, I don't like first row 

Unnamed: 0,rank,country,ethnic_fractionalization_index,cultural_diversity_index
0,Rank,[Country ],,
1,1,[Papua New Guinea],1.000,
2,2,[Tanzania],0.953,0.564
3,3,[Democratic Republic of Congo],0.933,0.628
4,4,[Uganda],0.930,0.647
5,5,[Liberia],0.899,0.644
6,6,[Cameroon],0.887,0.733
7,7,[Togo],0.883,0.602
8,8,[South Africa],0.880,0.530
9,9,[Congo],0.878,0.562


In [40]:
div_pd = div_pd.drop(0) # drop the first row

## Export the file

In [36]:
div_pd.to_csv("div_index.csv")