# 09-Web scraping

This notebook gives an introduction to web scraping in Python using `pandas`.

Web sites often contain large amounts of data.

Many web sites have developed *APIs* in order to grant people access to their data. To use an API, we make a request to their web server and if the request is approved, the API returns the requested data. 

All APIs are different, and in order to know how to use a web site's API, we must read their API documentation (or find an online tutorial!). See e.g. this [tutorial](https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a) for the Twitter API.

However, a lot of data online is not available through an API. In which case, if we want to extract the data, we must do so through *web scraping*. In web scraping, we write programs that extracts information directly from the web site. 

Unfortunately, there is no ONE way of doing web scraping. What type of information we can extract from a web site and how to extract it varies from web site to web site. In fact many web sites do not want people to scrape their content, and therefore makes it difficult (in some cases, it might even be illegal to scrape the content from their web sites). 

We will look at two different ways that we can use `pandas` to scrape content off the web:
- Import data from URLs
- Read HTML tables

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
plt.style.use('ggplot')

## Import data from URLs

We have seen how to use `read_csv` to import CSV files. However, notice that `read_csv` can also import CSV files directly from an URL (see the [function documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).

**Example: titanic data**

We have used the `titanic` data throughout this course. This is a common data set to work on when learning how to do data science in Python and R. I collected the data set from [this](https://github.com/datasciencedojo/datasets/blob/master/titanic.csv) user on github.

Instead of downloading the data file to our computer and then import it using `pandas`, we can import the file directly into our Python program by using the URL. 

Let us store the URL in a variable.

In [None]:
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

print(url)

We then pass the URL directly to `read_csv`. 

In [None]:
titanic = pd.read_csv(url)

In [None]:
titanic.head()

Notice that the `titanic` data is a static data set (i.e. it is not likely to change over time). That means that we only need to download it once to our computer. The benefit of importing the data directly from the URL, as opposed to downloading it to our computer first, is therefore small.

However, when the data set is dynamic (i.e. more information is being added over time), there can be large gains from importing the data through the URL.

**Example: covid deaths**

The Center for Systems Science and Engineering at Johns Hopkins University has an online repository where they publish data related to covid. The file [time_series_covid19_confirmed_global.csv](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv) contains global time series on covid deaths, and it is being updated on a daily basis.


Let us write a program that imports the data, extracts the time series for a specific country and plots the data.

In [None]:
# define url
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

# import file from url
df_full = pd.read_csv(url)

df_full.head()

Let us extract the time series for Norway.

In [None]:
# drop columns
df_full.drop(['Province/State', 'Lat', 'Long'], axis = 1, inplace = True)

# extract country
country = 'Norway'
df_subset = df_full[df_full['Country/Region'] == country].copy()

df_subset.head()

Notice that this is *wide* data. We use `melt` to convert the data to a *long* format.

In [None]:
# melt df
df_subset = df_subset.melt(id_vars = ['Country/Region'], var_name = 'date', value_name = 'deaths')

# convert to datetime
df_subset['date'] = pd.to_datetime(df_subset['date'], format = '%m/%d/%y')

# rename
df_subset.columns = ['country', 'date', 'total']

df_subset

We can now use the data to plot the cumulative sum of covid deaths in Norway over time. 

In [None]:
fig, ax = plt.subplots(figsize = (10, 3))

ax.plot(df_subset['date'],
        df_subset['total'])

# set xrange
ax.set_xlim(df_subset['date'].min(), df_subset['date'].max())

# set title
ax.set_title(country + ' (total covid deaths)')


plt.show()

Notice that the `deaths` column contains cumulative sum of the deaths over time. If we instead want the number of daily deaths, we can use `diff` to calculate the difference in the number of deaths from the day before.

In [None]:
df_subset['new'] = df_subset['total'].diff()

df_subset

In [None]:
fig, ax = plt.subplots(figsize = (10, 3))

ax.plot(df_subset['date'],
        df_subset['new'])

# set xrange
ax.set_xlim(df_subset['date'].min(), df_subset['date'].max())

# set title
ax.set_title(country + ' (new covid deaths)')


plt.show()

The online repository is updated daily, so we can simply re-run the program everytime we want the newest numbers. 

However, let us improve our program by placing it into two functions.

1. `get_deaths` take the name of a country, and extracts the data for that country and wrangles it into a suitable format. It returns a tidy `DataFrame` containing daily time series for total and new covid deaths.

In [None]:
def get_deaths(country):
    
    # define url
    url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

    # import file from url
    df_full = pd.read_csv(url)

    # drop columns
    df_full.drop(['Province/State', 'Lat', 'Long'], axis = 1, inplace = True)

    # extract country
    df_subset = df_full[df_full['Country/Region'] == country].copy()
    
    # melt df
    df_subset = df_subset.melt(id_vars = ['Country/Region'], var_name = 'date', value_name = 'deaths')

    # convert to datetime
    df_subset['date'] = pd.to_datetime(df_subset['date'], format = '%m/%d/%y')

    # rename
    df_subset.columns = ['country', 'date', 'total']
    
    # take difference
    df_subset['new'] = df_subset['total'].diff()

    
    return df_subset

2. `plot_deaths` takes a `DataFrame` with the country-specific time series and a string indicating whether we want to plot total or new covid deaths. It returns a plot of total or new deaths over time.

In [None]:
def plot_deaths(ylabel, country, df):
        
    fig, ax = plt.subplots(figsize = (10, 3))

    ax.plot(df['date'],
            df[ylabel])

    # set xrange
    ax.set_xlim(df['date'].min(), df['date'].max())

    # set title
    ax.set_title(country + ' (' + ylabel + ' covid deaths)')

    plt.show()

We can now use the function to extract and plot covid deaths for any country in the online data file.

In [None]:
country = 'Norway'
#country = 'Sweden'
#country = 'Denmark'
df_subset = get_deaths(country)
df_subset

In [None]:
plot_deaths('new', country, df_subset)

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Notice that <code>get_plot</code> returns strange plots for some countries, e.g. Denmark. Inspect the output  of <code>get_deaths</code> and the online data set and see if you can figure out what is causing this. Fix <code>get_deaths</code> so that we get correct plots for all countries.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#

# Notice that some countries are split over multiple rows in the online data. That is because they report covid deaths 
# seperately for the different regions/states in that country. 
# We can fix our program by simply adding an additional line of code in get_data that sums all of the deaths across the 
# regions/state for each country.


def get_deaths(country):
    
    # define url
    url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

    # import file from url
    df_full = pd.read_csv(url)

    # drop columns
    df_full.drop(['Province/State', 'Lat', 'Long'], axis = 1, inplace = True)
    
    # sum all provinces/states
    df_full = df_full.groupby('Country/Region').sum().reset_index()

    # extract country
    df_subset = df_full[df_full['Country/Region'] == country].copy()
    
    # melt df
    df_subset = df_subset.melt(id_vars = ['Country/Region'], var_name = 'date', value_name = 'deaths')

    # convert to datetime
    df_subset['date'] = pd.to_datetime(df_subset['date'], format = '%m/%d/%y')

    # rename
    df_subset.columns = ['country', 'date', 'total']
    
    # take difference
    df_subset['new'] = df_subset['total'].diff()

    
    return df_subset


def plot_deaths(ylabel, country, df):
        
    fig, ax = plt.subplots(figsize = (10, 3))

    ax.plot(df['date'],
            df[ylabel])

    # set xrange
    ax.set_xlim(df['date'].min(), df['date'].max())

    # set title
    ax.set_title(country + ' (' + ylabel + ' covid deaths)')

    plt.show()
    
    
country = 'Denmark'
df_subset = get_deaths(country)
plot_deaths('new', country, df_subset)
```

</p>
</details> 

**Example: Yahoo finance**

Yahoo finance contains historical data on price and trading volume for many different stocks. Yahoo finance used to have an official API, but it was shutdown in 2017. However, we can download historical data by scraping it directly off the web site.

Let us extract the historical data for [Apple](https://finance.yahoo.com/quote/AAPL/history?p=AAPL).

We import the data directly from the URL (we get the url from right-clicking the "download" button and pressing "save as...").

In [None]:
url = 'https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1603411200&period2=1619136000&interval=1d&events=history&includeAdjustedClose=true'
print(url)

In [None]:
apple = pd.read_csv(url)

apple

Notice that the URL contains several "parameters". Let us store these parameters in variables and then concat the URL back together.

In [None]:
ticker = 'AAPL'       # ticker name
period1 = 1603411200  # start period
period2 = 1619049600  # end period

url = 'https://query1.finance.yahoo.com/v7/finance/download/' + ticker + '?period1=' + str(period1) + '&period2=' + str(period2) + '&interval=1d&events=history&includeAdjustedClose=true'
print(url)

Let us download historical data for Apple for every weekday last year. 

Notice that the time periods are measured in Unix time, i.e. the number of seconds that have elapsed since midnight on January 1, 1970. This point of reference is known as the Unix epoch. It is common for computer systems to use Unix time. 

We can convert between Unix time and a date by using the `datetime` module.

In [None]:
import datetime as dt

`datetime` has a function called `datetime` that we can use to create convert date to timestamps.

In [None]:
dt.datetime(2021, 1, 1, 23, 59)

We can then apply the function `timestamp` to convert the timestamp to the number of second between that date and the Unix Epoch.

In [None]:
datetime(2021, 1, 1, 23, 59).timestamp()

In [None]:
# define periods
period1 = int(datetime(2021, 1, 1, 23, 59).timestamp())
period2 = int(datetime(2021, 12, 31, 23, 59).timestamp())

print(period1)
print(period2)

In [None]:
# define ticker
ticker = 'AAPL'

# define url
url = 'https://query1.finance.yahoo.com/v7/finance/download/' + ticker + '?period1=' + str(period1) + '&period2=' + str(period2) + '&interval=1d&events=history&includeAdjustedClose=true'

# import data
apple = pd.read_csv(url)

apple

Let us instead extract historical data for Amazon. The ticker for Amazon is `AMZN`.

In [None]:
# define ticker
ticker = 'AMZN'

# define url
url = 'https://query1.finance.yahoo.com/v7/finance/download/' + ticker + '?period1=' + str(period1) + '&period2=' + str(period2) + '&interval=1d&events=history&includeAdjustedClose=true'

print(url)

In [None]:
# import data
amazon = pd.read_csv(url)

amazon

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> The file <code>closing_prices.csv</code> contains the daily closing price in 2020 for ten different companies. Import the file and create a list of the tickers. Use this list of tickers to extract the daily opening price for the ten companies from Yahoo finance. Store the daily opening prices in a file called <code>opening_prices.csv</code> on your computer. 
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#

# import file
df_close = pd.read_csv('data/closing_prices.csv')

# extract tickers
tickers = df_close['Stock'].unique()

# define periods
period1 = int(datetime(2020, 1, 1, 23, 59).timestamp())
period2 = int(datetime(2020, 12, 31, 23, 59).timestamp())

# define folder
df_lst = []

for ticker in tickers:
    
    # define url
    url = 'https://query1.finance.yahoo.com/v7/finance/download/' + ticker + '?period1=' + str(period1) + '&period2=' + str(period2) + '&interval=1d&events=history&includeAdjustedClose=true'
    
    # extract data
    temp_df = pd.read_csv(url)
    
    # keep only open price and add stock name
    temp_df = temp_df[['Date', 'Open']].copy()
    temp_df['Stock'] = ticker

    # append to list
    df_lst.append(temp_df)

# concat to single df (and reset index)
df_open = pd.concat(df_lst).reset_index(drop = True)

df_open.to_csv('data/opening_prices.csv', index = False)
```

</p>
</details> 

## Scrape HTML tables

Many web sites display data in the form of HTML tables.

`pandas` has a function, `read_html`, that takes an URL as a parameter, and returns all HTML tables found on that web site as a list of `DataFrame`s.

#### Example: Wikipedia

[This](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)) Wikipedia page contains a table with countries and their estimated GDP by IMF, World Bank and United Nations. We want to scrape the information in this table of the web page.

In [None]:
# define url
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

# scrape web page for html tables
tables = pd.read_html(url)
print(len(tables))

Notice that `read_html` ended up scraping more than the one table that we are after... 

In [None]:
tables[0]

By inspecting the page source of the url (search for "table class"), we can see that the table that we are after belongs to the table class `wikitable sortable static-row-numbers plainrowheaders srn-white-background`.

We can narrow down the number of tables being scraped by giving the parameter `attrs` a dictionary where we specify that we only want the tables that have `class` equal to a `wikitable sortable static-row-numbers plainrowheaders srn-white-background`.

In addition, we set the parameter `header` equal to `0` in order to make sure that the first row in each table is used as the column labels.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'

tables = pd.read_html(
    url, 
    header = 0, 
    attrs = {'class' : 'wikitable sortable static-row-numbers plainrowheaders srn-white-background'}
)

print(len(tables))

In [None]:
tables[0].head()

Let us extract the GDP data from the United Nations.

In [None]:
# extract table 
df_gdp = tables[0]

# rename and extract columns
df_gdp.rename(columns = {'Country/Territory' : 'Country', 'United Nations[12]' : 'GDP'}, inplace = True)
df_gdp = df_gdp[['Country', 'GDP']].copy()

df_gdp.head()

In [None]:
# drop first row
df_gdp.drop(0, inplace = True)

# convert gdp to float
df_gdp['GDP'] = df_gdp['GDP'].astype(float)

# drop missing
df_gdp.dropna(inplace = True)

print('Number of countries: ' + str(df_gdp['Country'].nunique()))
df_gdp.head()

In [None]:
#df_gdp['Country'].unique()

[This](https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)) Wikipedia page contains information on countries and their estimated population by the United Nations.

In [None]:
# define url
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

# scrape html tables of web page
tables = pd.read_html(url)
print(len(tables))

In [None]:
tables[0]

In [None]:
# extract table
df_pop = tables[0]

# rename and extract columns
df_pop.rename(columns = {'Country/Area' : 'Country', 'Population(1 July 2019)' : 'pop'}, inplace = True)
df_pop = df_pop[['Country', 'pop']].copy()

df_pop.head()

In [None]:
# remove parenthesis and square brackets from country names
df_pop['Country'] = df_pop['Country'].str.split('(', expand = True)[0]
df_pop['Country'] = df_pop['Country'].str.split('[', expand = True)[0]

# drop world
df_pop = df_pop[df_pop['Country'] != 'World'].copy()

print('Number of countries: ' + str(df_pop['Country'].nunique()))
df_pop.head()

In [None]:
#df_pop['Country'].unique()

Merge the `DataFrame`s in order to estimate countries' GDP per capita.

In [None]:
# inner join
df = df_gdp.merge(df_pop, on = 'Country', how = 'inner')

# calculate GDP per capita (multiply with 1,000,000 since GDP is measured in million $)
df['GDP_pc'] = df['GDP']*1000000 / df['pop']

print('Number of countries: ' + str(df['Country'].nunique()))
df.head()

In [None]:
#df['Country'].unique()

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 4))

ax[0].scatter(df['GDP_pc'], df['pop'] / 1000000)
ax[0].set_ylabel('Population (in millions)')
ax[0].set_xlabel('GDP per capita (in dollars)')

ax[1].hist(df['GDP_pc'], bins = 30)
ax[1].set_ylabel('Number of countries')
ax[1].set_xlabel('GDP per capita (in dollars)')

plt.show()

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> <a href="https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy">This</a> Wikipedia site contains information on life expectancy for countries in the world. Notice that the site contains many different tables. Choose the table with the estimated life expectancies from UNDP 2019. 
        
Use this data to create a scatter plot between country GDP per capita and life expectancy, with each marker weighted by country population.
</div>

**Solution**

<details>
    
<summary> Click to expand!</summary>
<p> 

```c#
### Tidy data ###

# scrape tables
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy'
tables = pd.read_html(
    url,       
    header = 0,
    attrs = {'class' : 'wikitable sortable static-row-numbers plainrowheaders srn-white-background'}
)
print(len(tables))

# extract and rename data
df_life = tables[1]
df_life.rename(columns = {'Countries and regions' : 'Country', 'Life expectancy at birth' : 'life_exp'}, inplace = True)
df_life = df_life[['Country', 'life_exp']].copy()

# drop first row
df_life.drop(0, inplace = True)

# convert to float
df_life['life_exp'] = df_life['life_exp'].astype(float)

print('Number of countries: ' + str(df_life['Country'].nunique()))

# merge population and life expectency data (inner join)
df2 = df_life.merge(df, on = 'Country', how = 'inner')

print('Number of countries: ' + str(df2['Country'].nunique()))


### Plot ###

fig, ax = plt.subplots(figsize = (10, 4))

ax.scatter(
    df2['GDP_pc'],             # x-values
    df2['life_exp'],           # y-values
    s = df2['pop'] / 1000000   # population weights (must divide by 1000000 to make markers visible)
)

ax.set_ylabel('Life expectancy (in yrs)')
ax.set_xlabel('GDP per capita (in dollars)')

ax.set_title('Life expectency vs GDP per capita, United Nations (2019)')

plt.show()
```

</p>
</details> 

### Summary

`pandas` is super useful for web scraping! As you have seen, we have scraped many different web sites using just `pandas`.

However, there are some limitations to what we can use `pandas` for:
- `read_csv` can only be used to extract data that is in the form of csv files.
- `read_html` can only be used to extract data that is in the form of HTML tables. 

If you want to extract data from a web site that is not in the form of a csv file or HTML table, you need to use something more powerful, e.g. `requests` and `BeautifulSoup`. See [this](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/) or [this](https://betterprogramming.pub/how-to-use-pandas-for-web-scraping-not-enough-try-beautiful-soup-98d0362d5bb1) tutorial for how to use these packages to scrape additional HTML data.