# Web Scrapping to determine which region a country lies in:

The aim of this code is find out which regions different countries in the world belong to by using a <a href="https://meta.wikimedia.org/wiki/List_of_countries_by_regional_classification" target="_blank">wikipedia webpage</a>. Then we shall save the data as a csv file for using with the COVID data being used on my tableau dashboard.

In [4]:
#import useful packages
from __future__ import print_function
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Reading the webpage
Firstly, we need to read the webpage into python. This is done using the requests library to obtain the html code formatting the webpage. The html code can be seen on any webpage by 'right-clicking' and viewing the 'source code'. It is this code that is read into python using the command below.

In [2]:
#read the region data into script
r = requests.get("https://meta.wikimedia.org/wiki/List_of_countries_by_regional_classification")

In [3]:
r.text[0:500]

'\n<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of countries by regional classification - Meta</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"710acccd-a520-4'

### Parsing the html
Currently, a string is loaded into the instance object r. In order for python to make better sense of the html code, we use an html parser with the following Beautiful Soup command.

In [6]:
soup = BeautifulSoup(r.text, 'html.parser')

## Retrieving relevant data
We need to pick out the data relevant to us. This is the tabular data showing different countries by region. If we look at the 'source page', this data is in the following format:
```
<tr>
<td>COUNTRY
</td>
<td>REGION
</td>
<td>HEMISPHERE
</td></tr> 
```
Therefore, we use the commands `.find_all('`TAG`', attrs={'`ATTRIBUTE`':'`ATTR VALUE`'})` to return results set which lists all of the instances of this tag used in the html.

In [14]:
table = soup.find_all('tr') #note that the tag that encompasses the data we want is <tr> in html or 'tr' in python
results = table[1:] #save the data as results. Note that the first instance of the tag is the column headers and isn't needed.

In [15]:
len(results) 

248

In [23]:
first_result = results[0]

In [49]:
first_result

<tr>
<td>Andorra
</td>
<td>Europe
</td>
<td>Global North
</td></tr>

### Slicing html tags

In [37]:
#returns the first country
first_result.contents[1].contents[0][:-1]

'Andorra'

In [41]:
#returns the first region
first_result.contents[3].contents[0][:-1]

'Europe'

In [48]:
#returns the Hemisphere
first_result.contents[5].contents[0][7:-1] + 'ern Hemisphere' 

'Northern Hemisphere'

This slices the column to return either Norther/Southern Hemisphere. However, looking at the wikipedia page and the data, it is clear that Global North does not equal Northern Hemisphere. Global North/South describe how well developed the country is and are therefore less useful for describing geographical locations.

Now we have manage to return a string with the information we want from the first data point in the results set (`results`), we can iterate through the entire set and return a list of tuples which represent the data points.

In [66]:
#iterate through the entire results list to find the country, region and hemisphere
records = []
for result in results:
    country = result.contents[1].contents[0][:-1]
    region = result.contents[3].contents[0][:-1]
    hemisphere = result.contents[5].contents[0][:-1]
    records.append((country, region, hemisphere))

In [67]:
records[-3:]

[('Jersey', 'Europe', 'Global North'),
 ('Saint Barthelemy', 'South/Latin America', 'Global South'),
 ('Saint Martin', 'South/Latin America', 'Global South')]

## Saving to a pandas DataFrame
Having checked that the list of tuples shows the data we wanted to retrieve, we now need to save the list of tuples to a pandas dataframe.

In [71]:
#create dataframe in pandas
df4 = pd.DataFrame(records, columns=['Country', 'Region', 'Global North/South'])

In [70]:
df4.head()

Unnamed: 0,Country,Region,Global North/South
0,Andorra,Europe,Global North
1,United Arab Emirates,Middle east,Global South
2,Afghanistan,Asia & Pacific,Global South
3,Antigua and Barbuda,South/Latin America,Global South
4,Anguilla,South/Latin America,Global South


In [72]:
df4.to_csv('Regional Data.csv', index=False)