# Web scraping 

The following content heavily draws on [Web Scraping with Python](https://proquest.safaribooksonline.com/book/programming/python/9781491985564) (2018) by Ryan Mitchell.

## Definition

Web scraping collects data from Web other than using API. You can do that by writing a simple program to query a web server, request data, and parse the HTML data to extract information you need.


In most cases, collecting data from API is more convenient and legally safe. But when API does not exist, you have to do web scraping within *technical*, *legal*, and *ethical* boundaries. The issues around web scraping are complex because they are tied to Internet security, intellectual property, as well as knowledge as commons.


## Request and respond

In this tutorial, we use with the wikipedia entry of [Democracy Index](https://en.wikipedia.org/wiki/Democracy_Index). The main idea behind web scraping is to write code that mimics what web browser does. So, we start by learning how to make a request to a website.

Also, before doing any web scraping, check the website's terms and agreements or [robots.txt](https://en.wikipedia.org/robots.txt). In case, if you want to use Python code to extract information from robots.txt, look at [this code](https://stackoverflow.com/questions/43085744/parsing-robots-txt-in-python).


In [2]:
# urlib is a standard Python library and contains functions for requesting data across the web 
from urllib.request import urlopen 
from urllib.error import HTTPError
from urllib.error import URLError

try:
    page = urlopen('https://en.wikipedia.org/wiki/Democracy_Index')
except HTTPError as e:
    print(e) # The HTTP error: "404 Page Not Found" or 500 Internal Server Error"
except URLError as e:
    print("The server is broken") # No server could be reached 
else:
    print("The site is working") 

The site is working


You can read the requested document by  `page.read() `. It shows something. But it's not very informative especially for those who are less familiar with HTML and CSS.

In [3]:
import requests 

page = requests.get('https://en.wikipedia.org/wiki/Democracy_Index')

print(page.status_code) # to check whether the down was successful

200


## Parse

Python is a great tool for web scraping because its [beautiful soup](https://www.crummy.com/software/BeautifulSoup/) library makes parsing HTML so easy. You can install beautiful soup library in several ways.

- 1. Unix/Linux: type `sudo apt-get install python-bs4` in terminal. This is same for Windows OS, though you should do it in bash.
- 2. Mac: `sudo easy_install pip` (in case, you havent't installed pip already) then `pip install beautifulsoup4`


### HTML parser

The most popular parser is html.parser. For malformed HTML documents, lxml and html5lib parsers work better.   

In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, "html.parser")

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Democracy Index - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Democracy_Index","wgTitle":"Democracy Index","wgCurRevisionId":886913445,"wgRevisionId":886913445,"wgArticleId":8775637,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: Multiple names: authors list","All articles with specifically marked weasel-worded phrases","Articles with specifically marked weasel-worded phrases from November 2015","All articles with unsourced statements","Articles with unsourced statements from April 2018","Commons category link is on Wikidata","Democr

<h1 class="firstHeading" id="firstHeading" lang="en">Democracy Index</h1>

### Parsing HTML

You can inspect the document using `print(soup.prettify())`. After exploring the web site of interest, you can extract parts of the document by identifying specific HTML/CSS tags or attributes. 

In [None]:
print(soup) # Not very informative

In [None]:
print(soup.prettify()) # Better

### Extracing a table

In [4]:
wiki_table = soup.find('table',{'class':'wikitable sortable'})

# the same code can be written in multiple ways ways 
# soup.find('table').find(class_= 'wikitable sortable')
# also try sortable instead of wikitable sortable. Does it work?

#### Specific solution

Now, let's learn how to save the country information from the table using a particular attribute.

In [5]:
country_list = wiki_table('a') # by a (hyperlink)

In [6]:
country_list[10]

<a href="/wiki/Netherlands" title="Netherlands">Netherlands</a>

In [7]:
countries = []

for country in country_list:
    countries.append(country.get('title')) # we need get('title') to get only title information not 
    # the other elements of beautiful soup objects
    
print(countries)

['Norway', 'Iceland', 'Sweden', 'New Zealand', 'Denmark', 'Republic of Ireland', 'Canada', 'Australia', 'Finland', 'Switzerland', 'Netherlands', 'Luxembourg', 'Germany', 'United Kingdom', 'Austria', 'Mauritius', 'Malta', 'Uruguay', 'Spain', 'South Korea', 'United States', 'Italy', 'Japan', 'Cape Verde', 'Costa Rica', 'Chile', 'Portugal', 'Botswana', 'France', 'Estonia', 'Israel', 'Belgium', 'Taiwan', 'Taiwan', 'Czech Republic', 'Cyprus', 'Slovenia', 'Lithuania', 'Greece', 'Jamaica', 'Latvia', 'South Africa', 'India', 'East Timor', 'Slovakia', 'Panama', 'Trinidad and Tobago', 'Bulgaria', 'Argentina', 'Brazil', 'Suriname', 'Philippines', 'Ghana', 'Poland', 'Colombia', 'Dominican Republic', 'Lesotho', 'Hungary', 'Croatia', 'Malaysia', 'Mongolia', 'Peru', 'Sri Lanka', 'Guyana', 'Romania', 'El Salvador', 'Serbia', 'Mexico', 'Indonesia', 'Tunisia', 'Singapore', 'Hong Kong', 'Namibia', 'Paraguay', 'Senegal', 'Papua New Guinea', 'Ecuador', 'Albania', 'Moldova', 'Georgia (country)', 'Guatemala'

#### General solution

We can scrap the entire table using looping. We also use regular expressions to differentiate strings from numbers (or some other tasks).

In [8]:
wiki_table.find_all('th') # heading 
#wiki_table.find_all('tr')[1].find_all('td') # to get some ideas about how looping would work 
#len(wiki_table.find_all('tr')[1].find_all('td'))

[<th data-sort-type="number">Rank
 </th>, <th data-sort-type="text">Country
 </th>, <th data-sort-type="number">Score
 </th>, <th data-sort-type="number" style="line-height: 1em;">Electoral process<br/>and pluralism
 </th>, <th data-sort-type="number" style="line-height: 1em;">Functioning of<br/>government
 </th>, <th data-sort-type="number" style="line-height: 1em;">Political<br/>participation
 </th>, <th data-sort-type="number" style="line-height: 1em;">Political<br/>culture
 </th>, <th data-sort-type="number" style="line-height: 1em;">Civil<br/>liberties
 </th>, <th data-sort-type="number">Category
 </th>, <th data-sort-type="number">Rank
 </th>, <th data-sort-type="text">Country
 </th>, <th data-sort-type="number">Score
 </th>, <th data-sort-type="number" style="line-height: 1em;">Electoral process<br/>and pluralism
 </th>, <th data-sort-type="number" style="line-height: 1em;">Functioning of<br/>government
 </th>, <th data-sort-type="number" style="line-height: 1em;">Political<br/>

In [9]:
import re

# create empty lists
rank = []  
country = []
score = []
electoral = []  
government = [] 
participation = []
culture = []
liberties = [] 
category = []

for row in wiki_table.find_all('tr'): # for rows 
    cells = row.find_all('td') # to iterater through each row
    if len(cells) == 9: # no heading
        rank.append(cells[0].find(text=re.compile('[0-9]+'))) # to differentiate strings from numbers
        country.append(cells[1].find_all(text=True))
        score.append(cells[2].find(text=re.compile('[0-9]+')))
        electoral.append(cells[3].find(text=re.compile('[0-9]+')))
        government.append(cells[4].find(text=re.compile('[0-9]+')))
        participation.append(cells[5].find(text=re.compile('[0-9]+')))
        culture.append(cells[6].find(text=re.compile('[0-9]+')))
        liberties.append(cells[7].find(text=re.compile('[0-9]+')))
        category.append(cells[8].find(text=True))
    else:
        print("Something is wrong") # for debugging

Something is wrong
Something is wrong


**Questions**

The above code told us something is wrong. Can you find what casued the problem?

## Turn into a data frame

Combine these lists as parts of the same data frame.

In [10]:
import pandas as pd # convention

demo_pd = pd.DataFrame() # create a data frame
 
demo_pd['rank'] = rank
demo_pd['country'] = country
demo_pd['score'] = score
demo_pd['electoral'] = electoral
demo_pd['government'] = government
demo_pd['participation'] = participation
demo_pd['culture'] = culture
demo_pd['liberties'] = liberties
demo_pd['category'] = category

In [11]:
demo_pd[1:20]

Unnamed: 0,rank,country,score,electoral,government,participation,culture,liberties,category
1,2,"[ , Iceland]",9.58,10.0,9.29,8.89,10.0,9.71,Full democracy
2,3,"[ , Sweden]",9.39,9.58,9.64,8.33,10.0,9.41,Full democracy
3,4,"[ , New Zealand]",9.26,10.0,9.29,8.89,8.13,10.0,Full democracy
4,5,"[ , Denmark]",9.22,10.0,9.29,8.33,9.38,9.12,Full democracy
5,=6,"[ , Ireland]",9.15,9.58,7.86,8.33,10.0,10.0,Full democracy
6,=6,"[ , Canada]",9.15,9.58,9.64,7.78,8.75,10.0,Full democracy
7,8,"[ , Australia]",9.09,10.0,8.93,7.78,8.75,10.0,Full democracy
8,=9,"[ , Finland]",9.03,10.0,8.93,7.78,8.75,9.71,Full democracy
9,=9,"[ , Switzerland]",9.03,9.58,9.29,7.78,9.38,9.12,Full democracy
10,11,"[ , Netherlands]",8.89,9.58,9.29,8.33,8.13,9.12,Full democracy


**Challenges**

But country column values look weird. What's going on? And how can you fix this?

In [12]:
type(demo_pd['country'][1])

bs4.element.ResultSet

In fact, the solution was already suggested, when we learned how to extract a column using a particular HTML attribute. In the end, exploring both ways of scraping a table was not a waste of our time.

In [13]:
wiki_table.find_all('tr')[1].find_all('td')[1].find('a').get('title') # get some ideas about how looping works 


'Norway'

## Export the file

In [14]:
# demo_pd.to_csv("type the file address where you want to save the dataframe")