# Web Scrapping to find the GPD of countries:

The aim of this code is find out the population of different countries by using the <a href="https://www.worldometers.info/gdp/gdp-by-country/" target="_blank">worldometer.info</a> website. Then we shall save the data as a csv file for using with the COVID data being used on my tableau dashboard.

In [1]:
#import useful packages
from __future__ import print_function
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## Reading the webpage
Firstly, we need to read the webpage into python. This is done using the requests library to obtain the html code formatting the webpage. The html code can be seen on any webpage by 'right-clicking' and viewing the 'source code'. It is this code that is read into python using the command below.

In [2]:
#read the region data into script
r = requests.get("https://www.worldometers.info/gdp/gdp-by-country/")

In [3]:
r.text[0:500]

'\n<!DOCTYPE html><!--[if IE 8]> <html lang="en" class="ie8"> <![endif]--><!--[if IE 9]> <html lang="en" class="ie9"> <![endif]--><!--[if !IE]><!--> <html lang="en"> <!--<![endif]--> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>GDP by Country - Worldometer</title><meta name="description" content="Countries in the world ranked by Gross Domestic Product (GDP). List and ranking of GDP gr'

### Parsing the html
Currently, a string is loaded into the instance object r. In order for python to make better sense of the html code, we use an html parser with the following Beautiful Soup command.

In [4]:
soup = BeautifulSoup(r.text, 'html.parser')

## Retrieving relevant data
We need to pick out the data relevant to us. This is the tabular data showing different countries by region. If we look at the 'source page', this data is in the following format:
```
<tr> <td style="text-align:center">RANK</td> <td style="font-weight: bold; font-size:17px; text-align:left; padding-left:5px; padding-top:10px; padding-bottom:10px"><a href="COUNTRY HYPERLINK">COUNTRY</a></td> <td style="font-weight: bold; text-align:right;">GDP</td> <td style="font-weight: bold; text-align:right;">GPD ABBREV</td> <td style="font-weight: bold; text-align:right;">GDP % GROWTH</td> <td style="font-weight: bold; text-align:right;">POPULATION</td> <td style="font-weight: bold; text-align:right;">GPD PER CAPITA</td> <td style="font-weight: bold; text-align:right;">WORLD SHARE</td> </tr>
```
Therefore, we use the commands `.find_all('`TAG`', attrs={'`ATTRIBUTE`':'`ATTR VALUE`'})` to return results set which lists all of the instances of this tag used in the html.

In [5]:
table = soup.find_all('tr') #note that the tag that encompasses the data we want is <tr> in html or 'tr' in python
results = table[1:] #save the data as results. Note that the first instance of the tag is the column headers and isn't needed.

In [6]:
len(results) 

189

In [7]:
first_result = results[0]

In [8]:
first_result

<tr> <td style="text-align:center">1</td> <td style="font-weight: bold; font-size:17px; text-align:left; padding-left:5px; padding-top:10px; padding-bottom:10px"><a href="/gdp/us-gdp/">United States</a></td> <td style="font-weight: bold; text-align:right;">$19,485,394,000,000</td> <td style="font-weight: bold; text-align:right;">$19.485 trillion</td> <td style="font-weight: bold; text-align:right;">2.27%</td> <td style="font-weight: bold; text-align:right;">325,084,756</td> <td style="font-weight: bold; text-align:right;">$59,939</td> <td style="font-weight: bold; text-align:right;">24.08%</td> </tr>

### Slicing html tags

In [10]:
#returns the first country
first_result.find_all('td')[1].text

'United States'

In [14]:
#returns the first GDP
first_result.find_all('td')[2].text[1:].replace(',','')

'19485394000000'

Now we have manage to return the information we want from the first data point in the results set (`results`), we can iterate through the entire set and return a list of tuples which represent the data points.

In [16]:
#iterate through the entire results list to find the country, region and hemisphere
records = []
for result in results:
    country = result.find_all('td')[1].text
    GDP = int(result.find_all('td')[2].text[1:].replace(',', ''))
    records.append((country, GDP))

In [19]:
records[-3:]

[('Marshall Islands', 204173430),
 ('Kiribati', 185572502),
 ('Tuvalu', 39731317)]

## Saving to a pandas DataFrame
Having checked that the list of tuples shows the data we wanted to retrieve, we now need to save the list of tuples to a pandas dataframe.

In [22]:
#create dataframe in pandas
df4 = pd.DataFrame(records, columns=['Country/Region', 'GDP ($)'])

In [None]:
df4 = df4.replace('N.A.',np.NaN)
df4 = df4.replace('N.' ,np.NaN)
df4['Median Age'] = df4['Median Age'].astype('float64')
df4['Ratio of Urban Population (%)'] = df4['Ratio of Urban Population (%)'].astype('float64')

In [24]:
df4.tail()

Unnamed: 0,Country/Region,GDP ($)
184,Sao Tome & Principe,392570293
185,Palau,289823500
186,Marshall Islands,204173430
187,Kiribati,185572502
188,Tuvalu,39731317


In [25]:
df4.dtypes

Country/Region    object
GDP ($)            int64
dtype: object

In [29]:
df4.to_csv('GDP Data.csv', index=False, encoding='utf-8')