# Web Scraping using Python
By Shuhei Kitamura

- Many data are available online.
    - Even if many online sources do not look like data at a glance, they can be nice data if you have an idea to use them.
- You can automatically download data using Python.
- For example, go to BBC News [https://www.bbc.com/news](https://www.bbc.com/news).
- Right click (for Win users) at any point of the screen and choose "Inspect Element".
    - This shows the source code of the page.
- What you will do is to scrape the information of the page using its source code

### Outline<a id='top'></a>
1. [Website Basics](#sec1)
2. [Web Scraping](#sec2)
    1. [No API + No additional operations](#sec2_1)
    2. [No API + Additional operations](#sec2_2)
    3. [API](#sec2_3)

## 1. Website Basics<a id='sec1'></a>
- A web page is often made of the following three core technologies.
    - HTML (HyperText Markup Language) for structure
    - CSS (Cascading Style Sheets) for style
    - JavaScript ($\neq$ Java) for interactive web pages
    
[back to top](#top)

#### HTML
- HTML defines the structure of a web page.
- A web page is made of many HTML tags. Typical tags are:
    - `<html></html>`: Define a page
    - `<head></head>`: Container for all the head elements
        - `<title></title>`: Title
    - `<h1></h1>`, `<h2></h2>`...: Headers
    - `<body></body>`: Document's body
        - `<div></div>`: A section
        - `<p></p>`: A paragraph
    - `<br>`: line break
    - `<b></b>`: bold
- For other tags, see, e.g., [this page](https://www.w3schools.com/html/default.asp).
- HTML elements in `<body></body>` are categorized as block and inline elements.
    - Block elements: `<div>`, `<h1>`, etc.
    - Inline elements: `<br>`, `<b>`, etc.

#### HTML Example:
```html
<!DOCTYPE html>
<html>
    <head>
        <title>Christmas songs</title>
    </head>
    <h1>Christmas songs</h1>
    <body>
        <div>
            <p class="title"><b>Jingle Bells</b></p>
            <p class="lyrics">Jingle bells, jingle bells, jingle all the way.
                <a href="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQCBU3bFIE3cd-v4mNvJJOyuU5EQzGZ5NGiW6jZcZPaiOaMuO3l" id="link1">Santa Claus</a>
            </p>
        </div>
    </body>
</html>
```

#### CSS
- CSS defines the style of a web page.
- Typical CSS properties are:
    - `background-color`: Background color
    - `color`: Color of texts 
    - `font-family`: Font
    - `font-size`: Font size
- See, e.g., [this page](https://www.w3schools.com/css/default.asp).

#### HTML + CSS Example:
```html
<!DOCTYPE html>
<html>
    <head>
    <title>Christmas songs</title>
    <style>
    body {
        background-color: green;
    }

    h1 {
        color: red;
        text-align: center;
        font-family: Comic Sans MS;
    }

    p {
        font-family: Comic Sans MS;
        font-size: medium;
    }
    </style>
    </head>
    <h1>Christmas songs</h1>
    <body>
        <div>
            <p class="title"><b>Jingle Bells</b></p>
            <p class="lyrics">Jingle bells, jingle bells, jingle all the way.
                <a href="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQCBU3bFIE3cd-v4mNvJJOyuU5EQzGZ5NGiW6jZcZPaiOaMuO3l" id="link1">Santa Claus</a>
            </p>
        </div>
    </body>
</html>
```

#### JavaScript
- JavaScript makes the page more interactive.
- See, e.g., [this page](https://www.w3schools.com/js/default.asp).

#### HTML + CSS + JavaScript Example:
```html
<!DOCTYPE html>
<html>
    <head>
    <title>Christmas songs</title>
    <style>
    body {
        background-color: green;
    }

    h1 {
        color: red;
        text-align: center;
        font-family: Comic Sans MS;
    }

    p {
        font-family: Comic Sans MS;
        font-size: medium;
    }
    </style>
    </head>
    <h1>Christmas songs</h1>
    <body>
        <div>
            <p class="title"><b>Jingle Bells</b></p>
            <p class="lyrics">Jingle bells, jingle bells, jingle all the way.
                <a href="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQCBU3bFIE3cd-v4mNvJJOyuU5EQzGZ5NGiW6jZcZPaiOaMuO3l" id="link1">Santa Claus</a>
            </p>
	        <button type="button"
			onclick="document.getElementById('time').innerHTML = Date()">
			Click here to check the date and time</button>
            <p id="time"></p>
        </div>
    </body>
</html>
```

## 2. Web Scraping<a id='sec2'></a>
- There are several ways to scrape data from the web.
    1. Using Web API (Application Programming Interface)
    2. Without using API
        - (a) Additional operations are needed
            - E.g., writing a password, clicking a button...
        - (b) Additional operations are not needed
- The following modules/packages are often used for web scraping:
    - `BeautifulSoup4` + `requests`
    - `Selenium`
            
[back to top](#top)    

###  A. No API + No additional operations<a id='sec2_1'></a>
- Your task: Get a table of GDP per capita (PPP adjusted) and country names from a Wikipedia page.
- Go to [https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita).
            
[back to top](#top)

In [None]:
import re
import pandas as pd
import requests
from bs4 import BeautifulSoup

- First, get the URL using `requests`.
- Then, get the source code using `BeautifulSoup`. 
- `BeautifulSoup` has four parser libraries but you may often use `"lxml"`.
    - Other parsers: `"html.parser"`, `"xml"`, and `"html5lib"` (see [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)).

In [None]:
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita')
soup = BeautifulSoup(response.content, 'lxml')
#print(soup.prettify())

- Next, access a relevant table whose `'class'` is `'wikitable sortable'`.
    - `'table'` tag is for a table (see, e.g., [this page](https://www.w3schools.com/html/html_tables.asp)).
    - `'class'` is used to define a name of a group. A similar option is `'id'`. The same `'class'` can be used several times, while an `'id'` can be used only once in a page.

In [None]:
wiki_table = soup.find_all('table', {'class':'wikitable sortable'}) # tables whose class is 'wikitable sortable'
print(wiki_table[0])
#print(wiki_table[0].text)

- Next, make a list of country names.
- To do so, we first find all blocks that contain links (because the blocks also have country names).
    - `'a'` means an anchar for a link.

In [None]:
links = wiki_table[0].find_all('a') # or use `select`, which is a css selector
#links = wiki_table[0].find_all('a', text=re.compile("J")) # get all links where the country name contains "J"
print(len(links)); print(links) 
#import pprint as pp # print data in a pretty way
#pp.pprint(links)

- Save the country names as a list using `'title'`.

In [None]:
name = [] # make a list
for link in links:
    name.append(link.get('title'))
print(name)

- Let's make a list of GDP per capita (PPP).
- To do so, find all `'td'`'s that contain `'\n'`.
    - HTML: `'td'` is an item in a table.
    - `'\n'` means line break.
- Note: If your computer's default language is not Japanese, you may see a backward slash instead of a yen mark. That's fine.

In [None]:
tds = wiki_table[0].find_all('td')
#print(tds)
print([item.text for item in tds if re.search('\n', item.text)])
gdppc = [item.text.strip() for item in tds if re.search('\n', item.text)] # remove '\n'
print(gdppc)

- Finally, combine two lists and make data.

In [None]:
data = pd.DataFrame({'name':name, 'gdppc':gdppc})

- Print `data`.

### B. No API + Additional operations<a id='sec2_2'></a>
- In some cases, you may also need to write a search keyword or password and/or press a button before reaching the page that you want to scrape.
- An easy way to automate the entire process is to use the `selenium` package.
- To use the package, you need to download a driver for Firefox or Google Chrome or change a setting for Safari.
    - [Firefox](https://github.com/mozilla/geckodriver/releases)
    - [Google Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads)
    - [Safari](https://developer.apple.com/documentation/webkit/testing_with_webdriver_in_safari)
- Notes:
    - To check the version of your browser, click "Help" and then "About Firefox/Google Chrome".
    - Once you download a driver, you need to unzip it.
    - For Mac users using a driver: If you get an error while using a driver, change the system setting so that Mac can access the driver (see, e.g. https://qiita.com/apukasukabian/items/77832dd42e85ab7aa568 (in Japanese)).   
- The following example will use Google Chrome.
            
[back to top](#top)

- Your task: Download "Adjusted net national income per capita (constant 2010 US$)" from the World Bank's database.
- Go to [https://databank.worldbank.org/source/world-development-indicators](https://databank.worldbank.org/source/world-development-indicators).

In [None]:
import numpy as np
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time as t

- Set the location of a driver and open a web page using the driver.

In [None]:
driver = webdriver.Chrome('C:/Users/shu/Desktop/chromedriver') # set the path to the driver. launch a web browser.
driver.get('https://databank.worldbank.org/source/world-development-indicators') # open a page
t.sleep(10) # wait 10 seconds (this can be any duration). alternatively, use driver.implicitly_wait()

- A useful way of selecting an item in a web page is to use XPath (XML Path Language).
- To get an XPath of an item:
    - Right click (for Win users) on that item, select "Inspect Element" to access the source code of the page.
    - Right click (for Win users) on the relevant section of the source code, click "Copy", and then "Copy XPath" or "Copy full XPath".
    - The following example uses "full XPath".

- First, select all countries by clicking a single button.

In [None]:
path_part = '/html/body/form/div[8]/div[2]/div[1]/div[3]/div[2]/div[1]/div'
driver.find_element_by_xpath(path_part+'/div[2]/div[3]/div/div[2]/div/div/div/div/div[1]/div[3]/div[1]/div[1]/div/a[1]/span[1]').click()
t.sleep(5)
# alternatively, use WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, path_part+'/div[2]/div[3]/div/div[2]/div/div/div/div/div[1]/div[3]/div[1]/div[1]/div/a[1]/span[1]'))).click()

- Click "Series" tab.

In [None]:
driver.find_element_by_xpath(path_part+'/div[3]/div[1]/h4/a').click()
t.sleep(5) 

- Write "income" and search.

In [None]:
driver.find_element_by_xpath(path_part+'/div[3]/div[3]/div/div[2]/div/div/div/div/div[1]/div[3]/div[1]/div[2]/div[3]/input').clear() 
driver.find_element_by_xpath(path_part+'/div[3]/div[3]/div/div[2]/div/div/div/div/div[1]/div[3]/div[1]/div[2]/div[3]/input').send_keys("income")
driver.find_element_by_xpath(path_part+'/div[3]/div[3]/div/div[2]/div/div/div/div/div[1]/div[3]/div[1]/div[2]/div[3]/a/span').click()
t.sleep(5)
# alternatively, use WebDriverWait(driver, 30).until(EC.visibility_of_element_located((By.XPATH, path_part+'/div[3]/div[3]/div/div[2]/div/div/div/div/div[1]/div[3]/div[1]/div[2]/div[3]/input'))).clear()

- Check mark a relevant file.

In [None]:
driver.find_element_by_xpath(path_part+'/div[3]/div[3]/div/div[2]/div/div/div/div/div[1]/div[5]/ul[1]/li[5]/input').click()
t.sleep(5)
# alternatively, use WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, path_part+'/div[3]/div[3]/div/div[2]/div/div/div/div/div[1]/div[5]/ul[1]/li[5]/input'))).click()

- Click "Time" tab.

In [None]:
driver.find_element_by_xpath(path_part+'/div[4]/div[1]/h4/a').click()
t.sleep(5)

- Select all years.

In [None]:
driver.find_element_by_xpath(path_part+'/div[4]/div[3]/div/div/div/div/div/div/div[2]/div[3]/div[1]/div[1]/div/a[1]/span[1]').click()
t.sleep(10)
# alternatively, use WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, path_part+'/div[4]/div[3]/div/div/div/div/div/div/div[2]/div[3]/div[1]/div[1]/div/a[1]/span[1]'))).click()

- Click "Apply Changes".

In [None]:
driver.find_element_by_xpath('/html/body/form/div[8]/div[2]/div[3]/div[7]/div[1]/div[3]/div/a').click()
t.sleep(10) 

- Click "Download options".

In [None]:
driver.find_element_by_xpath('/html/body/form/div[7]/div[3]/div/ul/li[5]/a').click()
t.sleep(10)
# alternatively, use WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '/html/body/form/div[7]/div[3]/div/ul/li[5]/a'))).click()

- Choose "CSV" format.
- Then, download will automatically start.

In [None]:
driver.find_element_by_xpath('/html/body/form/div[7]/div[3]/div/ul/li[5]/ul/li[2]/a').click()
# alternatively, use WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '/html/body/form/div[7]/div[3]/div/ul/li[5]/ul/li[2]/a'))).click()

### 3. API<a id='sec2_3'></a>
- Some websites offer an API that accepts your request and returns the data you are looking for.
    - Examples: Twitter, Google Maps, weather forecast.
    - You should use an API if available (much faster, avoid imposing a high load on a server).
- For example, World Bank offers several APIs.
- To use the WB's APIs, use the `world_bank_data` package.
    - Alternatively, use the `wbdata` package.
            
[back to top](#top)    

In [None]:
import re
import world_bank_data as wb
import matplotlib.pyplot as plt

- First, get the information on all available data.

In [None]:
wb.get_sources() 
#wb.get_countries() # get all countries

- Select "World Development Indicators".

In [None]:
wdi = wb.get_indicators(source=2) # select world development indicators
print(type(wdi))
print(wdi)

- Keep time-series data "Adjusted net national income per capita" for Japan, the United Kingdom, and the United States.
    - `'NY.ADJ.NNTY.PC.KD'` means "Adjusted net national income per capita (constant 2010 USD)".

In [None]:
df_wdi = wb.get_series('NY.ADJ.NNTY.PC.KD', country=['GB', 'JP', 'US'], simplify_index=True).to_frame() # to_frame() converts serieses to a dataframe
print(df_wdi)

- Finally, plot the data.

In [None]:
df_wdi_unstack = df_wdi.unstack(level=0) # unstack data for plot
print(df_wdi_unstack)
df_wdi_unstack.plot()
plt.legend(loc='best')
plt.show()