[![Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/EconomicsObservatory/courses/HEAD?labpath=5%2Fs5_Scraping_Bonus.ipynb)

<a href="https://colab.research.google.com/github/EconomicsObservatory/courses/blob/main/5/s5_Scraping_Bonus.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping
Scraping is the automated extraction of data from websites.

This bonus notebook Demonstrates scraping HTML tables using a Wikipedia table of G7 meetings as an example

# Data Scraping


For this example, we will scrape the list of G7 summits from Wikipedia, <a href="https://en.wikipedia.org/wiki/G7#List_of_summits">here</a>. It's a good target because it is:

- Available Online
- Not available as a clean download (Excel, CSV)

<img
height=500 src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/g7_wiki.png"> </img>



Before starting, we have to install and import the tool Pandas, which we'll use for scraping and data manipulation

In [None]:
%pip install pandas

In [2]:
import pandas as pd # used for data scraping, and manipulation

First we define the URL of web-page we want to scrape from.


In [3]:
url = "https://en.wikipedia.org/wiki/G7" # The URL of the webpage
tables_from_webpage = pd.read_html(url) # Read the tables from the webpage into a list of DataFrames

pd.read_html returns a list of dataframes, one for each table on the webpage.
We can look through this list one by one by trying tables_from_webpage[0], tables_from_webpage[1], ...

In [4]:
tables_from_webpage[2].head(10)

Unnamed: 0,#,Date,Host,Host leader,Location held,Notes
0,1st,15–17 November 1975,France,Valéry Giscard d'Estaing,"Château de Rambouillet, Yvelines",The first and last G6 summit.
1,2nd,27–28 June 1976,United States,Gerald R. Ford,"Dorado, Puerto Rico[74]","Also called ""Rambouillet II"". Canada joined th..."
2,3rd,7–8 May 1977,United Kingdom,James Callaghan,"London, England",The President of the European Commission was i...
3,4th,16–17 July 1978,West Germany,Helmut Schmidt,"Bonn, North Rhine-Westphalia",
4,5th,28–29 June 1979,Japan,Masayoshi Ōhira,Tokyo,
5,6th,22–23 June 1980,Italy,Francesco Cossiga,"Venice, Veneto",Prime Minister Ōhira died in office on 12 June...
6,7th,20–21 July 1981,Canada,Pierre E. Trudeau,"Montebello, Québec",
7,8th,4–6 June 1982,France,François Mitterrand,"Versailles, Yvelines",
8,9th,28–30 May 1983,United States,Ronald Reagan,"Williamsburg, Virginia",
9,10th,7–9 June 1984,United Kingdom,Margaret Thatcher,"London, England",


Trying tables_from_webpage[0], tables_from_webpage[1] and tables_from_webpage[2], we can see that tables_from_webpage[2] is the table we need.

Let's set the variable df equal to this table, for ease of use and take a look at the first few rows with df.head()

In [5]:
df = tables_from_webpage[2]
df.head()

Unnamed: 0,#,Date,Host,Host leader,Location held,Notes
0,1st,15–17 November 1975,France,Valéry Giscard d'Estaing,"Château de Rambouillet, Yvelines",The first and last G6 summit.
1,2nd,27–28 June 1976,United States,Gerald R. Ford,"Dorado, Puerto Rico[74]","Also called ""Rambouillet II"". Canada joined th..."
2,3rd,7–8 May 1977,United Kingdom,James Callaghan,"London, England",The President of the European Commission was i...
3,4th,16–17 July 1978,West Germany,Helmut Schmidt,"Bonn, North Rhine-Westphalia",
4,5th,28–29 June 1979,Japan,Masayoshi Ōhira,Tokyo,


## Manipulating the Data

We can now manipulate the data. Let's try and make a chart of number of G7 meetings location.

First, let's group by the column 'Location held' and sort for just the most common places.

In [6]:
df = tables_from_webpage[2]
df = df.groupby('Location held').aggregate({'Host': 'count'}) # Group the data by the 'Location held' column and count the number of occurrences
df = df.sort_values(by='Host', ascending=False) # Sort the data by the number of occurrences in descending order
df = df[df['Host'] > 1] # Keep only the rows where the number of occurrences is greater than 1
df = df.rename(columns={'Host': 'Count'}) # Rename the 'Host' column to 'Count'

df

Unnamed: 0_level_0,Count
Location held,Unnamed: 1_level_1
Tokyo,3
"London, England",3
"Bonn, North Rhine-Westphalia",2
"Venice, Veneto",2


## Uploading the Data

Now let's save our table to upload to GitHub and use in Vega-lite

In [7]:
df.to_csv('g7_summits.csv') # Save the data to a CSV file

Next we have to upload our output (e.g. "g7_summits.csv") to GitHub. Got to your own repository and click 'Add file':

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/uploading_to_github.png"> </img>

Then find the file and click 'raw'

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/getting_raw.png"> </img>

and finally copy the url to use in Vega-lite:

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/getting_url.png"> </img>



