## The intent of this exercise is to show you the Big Data Pipeline

We are going to obtain data from resources on the web, clean the data, visualize the data and then store the data in CSV format which can then be loaded on cloud or a database.

Our intent to obtain the information on current GDP of countries and the population statistics of the countries. These details are available from www.worldometers.info. But they are available as two different set of information. We are going to access the data from the two sets, clean them and put them together. 

#### Install required packages

In [1]:
!pip install requests
!pip install beautifulsoup4



#### Import the packages required for this exercise

In [2]:
import pandas as pd
from IPython.core.display import HTML
import requests
from bs4 import BeautifulSoup
import bs4
import json
from datetime import date

#### From the remote URL https://www.worldometers.info/gdp/gdp-per-capita/ read the GDP data we require. requests.get(URL) gets the entire content of a remote URL as a string. Visit the URL on your browser to familiarise the data. 

In [3]:

URL="https://www.worldometers.info/gdp/gdp-per-capita/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
stats_tbl = soup.find("table")


#### Parse the string data to only extract the information we require

In [4]:

def parse_table(tbl,cols):
    rows = []
    trows = tbl.find_all("tr")
    for tr in trows[1:]:
        row = []
        for td in tr.children:
            if isinstance(td,bs4.element.Tag):
                for data in td.children:
                    if isinstance(data,bs4.element.Tag):
                        for innerHTML in data.children:
                            row.append(innerHTML)
                    else:
                        row.append(data)
        rows.append(row)
    
    return pd.DataFrame(rows,columns=cols)

df = parse_table(stats_tbl,["Ranking","Country","GDP (PPP) per capita","GDP (nominal) per capita","vs. World PPP GDP per capita"])


#### Print the head of the dataframe object to see what the data collected looks like

In [5]:
df.head()

Unnamed: 0,Ranking,Country,GDP (PPP) per capita,GDP (nominal) per capita,vs. World PPP GDP per capita
0,1,Qatar,"$128,647","$61,264",752%
1,2,Macao,"$115,367","$80,890",675%
2,3,Luxembourg,"$107,641","$105,280",629%
3,4,Singapore,"$94,105","$56,746",550%
4,5,Brunei,"$79,003","$28,572",462%


#### Clean and Process the data to suit our purpose

In [6]:
# The data is present with $ sign and ,. This needs to be converted to numeric data
df['GDP (PPP) per capita'] = df['GDP (PPP) per capita'].str.replace("$","")
df['GDP (PPP) per capita'] = df['GDP (PPP) per capita'].str.replace(",","")
df['GDP (PPP) per capita'] = df['GDP (PPP) per capita'].str.replace("N.A.","0")
df['GDP (PPP) per capita'] = df['GDP (PPP) per capita'].astype(int)





In [7]:
df['GDP (nominal) per capita'] = df['GDP (nominal) per capita'].str.replace("$","")
df['GDP (nominal) per capita'] = df['GDP (nominal) per capita'].str.replace(",","")
df['GDP (nominal) per capita'] = df['GDP (nominal) per capita'].str.replace("N.A.","0")
df['GDP (nominal) per capita'] = df['GDP (nominal) per capita'].astype(int)



#### Visualize the data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

gdps = df['GDP (PPP) per capita'][0:20]
countries = df['Country'][0:20]

plt.figure(figsize=(10,5))
plot = sns.barplot(countries, gdps)

plot.set_xticklabels(countries, rotation=90)

plt.show()

#### Obtain the information on population. Just as we did earlier, we obtain the information from http://worldometers.info/world-population. Feel free to visit the URL and familiarise yourself with the data.

In [None]:
URL="http://worldometers.info/world-population"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
stats_tbl = soup.find(id="popbycountry")


#### Parse the data and generate a dataframe from the data available in the link

In [None]:
df2 = parse_table(stats_tbl,["Pop Rank","Country","Population 2020","Yearly Change","Net Change", "People per KMSq","Land Area","Migrants","Fertility Rate","Median Age","Urban Population","World Share"])


#### See the newly created dataframe

In [None]:
df2.head(10)

#### Clean and process the data in desired format

In [None]:
df2['Population 2020'] = df2['Population 2020'].str.replace(",","")
df2['Population 2020'] = df2['Population 2020'].astype(int)

df2['Net Change'] = df2['Net Change'].str.replace(",","")
df2['Net Change'] = df2['Net Change'].astype(int)


df2['Land Area'] = df2['Land Area'].str.replace(",","")
df2['Land Area'] = df2['Land Area'].astype(int)

df2['Migrants'] = df2['Migrants'].str.replace(",","")
df2['Migrants'] = df2['Migrants'].replace(" ", "0")
df2['Migrants'] = df2['Migrants'].astype(int)

df2['Median Age'] = df2['Median Age'].replace("N.A.", "0")

df2['Median Age'] = df2['Median Age'].astype(int)

df2['Pop Rank'] = df2['Pop Rank'].astype(int)


#### Merge the two dataframes based on the column which has the same value - "Country"

In [None]:
df_new = df.merge(df2,on="Country")

#### Check the data that is present in the newly formed dataframe, which has merged the GDP data and the population data

In [None]:
df_new.head(2).transpose()

#### Visualize the GDP data just as before, but this time in the order of population ranking. 

In [None]:
df_temp = df_new.sort_values(["Pop Rank"], axis=0, 
                 ascending=True) 
gdps = df_temp['GDP (PPP) per capita'][0:20]
countries = df_temp['Country'][0:20]

plt.figure(figsize=(10,5))
plot = sns.barplot(countries, gdps)

plot.set_xticklabels(countries, rotation=90)

plt.show()

#### Save the data to a CSV file with today's timestamp. 

In [None]:
timestamp = date.today().strftime("%d_%m_%Y")
df_new.to_csv("Details"+timestamp+".csv")