# Practice Project -- Countries & GDP

In this practice project, you will use the skills acquired through the course and create a complete ETL pipeline for accessing data from a website and processing it to meet the requirements.

## Project Scenario

An international firm that is looking to expand its business in different countries across the world has recruited you. You have been hired as a junior Data Engineer and are tasked with creating an automated script that can extract the list of all countries in order of their GDPs in billion USDs (rounded to 2 decimal places), as logged by the International Monetary Fund (IMF). Since IMF releases this evaluation twice a year, this code will be used by the organization to extract the information as it is updated.

You can find the required data on this [webpage](https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29).

The required information needs to be made accessible as a JSON file '__Countries_by_GDP.json__' as well as a table '__Countries_by_GDP__' in a database file '__World_Economies.db__' with attributes '`Country`' and '`GDP_USD_billion`.'

Your boss wants you to demonstrate the success of this code by running a query on the database table to display only the entries with more than a 100 billion USD economy. Also, log the entire process of execution in a file named '__etl_project_log.txt__'.

### Imports

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import requests
import re
from datetime import datetime
from bs4 import BeautifulSoup

### Constants & Initializations

In [2]:
URL = "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"
JSON_OUTPUT_FILE = "Countries_by_GDP.json"
LOG_FILE = "etl_project_log.txt"
DATABASE_NAME = "World_Economies.db"
DATABASE_TABLE_NAME = "Countries_by_GDP"
TABLE_COLUMNS = ["Country", "GDP_USD_billion"]

### Load Table from Webpage

In [3]:
html_page = requests.get(URL).text
data = BeautifulSoup(html_page, "html.parser")

### Load Rows & Headers/Columns

In [4]:
countries_gdp_dict = {
    "Country": [],
    "GDP_USD_billion": []
}

try:
    tables = data.find_all("table", class_="wikitable sortable static-row-numbers plainrowheaders srn-white-background")
    countries_gdp_table = tables[0]

    rows = countries_gdp_table.find_all("tr")
    countries_gdp_rows = rows[0]
               
except IndexError:
    print("The table was not found or the incorrect table was selected.")



### HEADERS
headers_tags = countries_gdp_rows.find_all("th")
headers = []

for th in headers_tags:
    header_text = th.text
    
    # exclude any bracketed annotations from header text (e.g. [1])
    pattern = r"[\[][\d+].*[\]]"
    bracketed_annotations = re.findall(pattern, header_text)
    if len(bracketed_annotations) > 0:
        header_text = header_text.replace(bracketed_annotations[0], "")

    # remove new line escape character    
    header_text = header_text.replace("\n", "")
    
    headers.append(header_text)

### ADD ROWS TO DICTIONARY
for row in rows:
    # exclude header rows
    class_list = row.get_attribute_list("class")
    if len(class_list) > 0 and class_list[0] == "static-row-header":
        continue

    row_data = row.find_all("td")

    country_name = row_data[0].text.strip()
    gdp_imf = row_data[2].text
    
    countries_gdp_dict["Country"].append(country_name)
    countries_gdp_dict["GDP_USD_billion"].append(gdp_imf)


for k, v in countries_gdp_dict.items():
    print(f"{k} : {v}")

Country : ['United States', 'China', 'Japan', 'Germany', 'India', 'United Kingdom', 'France', 'Italy', 'Canada', 'Brazil', 'Russia', 'South Korea', 'Australia', 'Mexico', 'Spain', 'Indonesia', 'Netherlands', 'Saudi Arabia', 'Turkey', 'Switzerland', 'Taiwan', 'Poland', 'Argentina', 'Belgium', 'Sweden', 'Ireland', 'Thailand', 'Norway', 'Israel', 'Singapore', 'Austria', 'Nigeria', 'United Arab Emirates', 'Vietnam', 'Malaysia', 'Philippines', 'Bangladesh', 'Denmark', 'South Africa', 'Hong Kong', 'Egypt', 'Pakistan', 'Iran', 'Chile', 'Romania', 'Colombia', 'Czech Republic', 'Finland', 'Peru', 'Iraq', 'Portugal', 'New Zealand', 'Kazakhstan', 'Greece', 'Qatar', 'Algeria', 'Hungary', 'Kuwait', 'Ethiopia', 'Ukraine', 'Morocco', 'Slovakia', 'Ecuador', 'Dominican Republic', 'Puerto Rico', 'Kenya', 'Angola', 'Cuba', 'Oman', 'Guatemala', 'Bulgaria', 'Venezuela', 'Uzbekistan', 'Luxembourg', 'Tanzania', 'Turkmenistan', 'Croatia', 'Lithuania', 'Costa Rica', 'Uruguay', 'Panama', 'Ivory Coast', 'Sri Lan

### Load Dictionary To DataFrame

In [5]:
countries_gdp_df = pd.DataFrame(countries_gdp_dict)
countries_gdp_df

Unnamed: 0,Country,GDP_USD_billion
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882
...,...,...
208,Anguilla,—
209,Kiribati,248
210,Nauru,151
211,Montserrat,—


### Transform DataFrame
* Remove commas from GDP number
* Convert GDP to billions
* Round GDP to 2 decimal points

*Remove commas*

In [6]:
countries_gdp_df["GDP_USD_billion"] = countries_gdp_df["GDP_USD_billion"].str.replace(",", "")

countries_gdp_df

Unnamed: 0,Country,GDP_USD_billion
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882
...,...,...
208,Anguilla,—
209,Kiribati,248
210,Nauru,151
211,Montserrat,—


*Temporarily replace the null values with 0*

In [7]:
countries_gdp_df["GDP_USD_billion"] = countries_gdp_df["GDP_USD_billion"].str.replace("—", "0")

countries_gdp_df

Unnamed: 0,Country,GDP_USD_billion
0,United States,26854599
1,China,19373586
2,Japan,4409738
3,Germany,4308854
4,India,3736882
...,...,...
208,Anguilla,0
209,Kiribati,248
210,Nauru,151
211,Montserrat,0


*Convert GDP_USD_billion column to `float`*

In [8]:
countries_gdp_df["GDP_USD_billion"] = countries_gdp_df["GDP_USD_billion"].astype(float)

countries_gdp_df

Unnamed: 0,Country,GDP_USD_billion
0,United States,26854599.0
1,China,19373586.0
2,Japan,4409738.0
3,Germany,4308854.0
4,India,3736882.0
...,...,...
208,Anguilla,0.0
209,Kiribati,248.0
210,Nauru,151.0
211,Montserrat,0.0


*Convert GDP_USD_billion column from millions to billions*

In [9]:
countries_gdp_df["GDP_USD_billion"] = countries_gdp_df["GDP_USD_billion"] * 0.001

countries_gdp_df

Unnamed: 0,Country,GDP_USD_billion
0,United States,26854.599
1,China,19373.586
2,Japan,4409.738
3,Germany,4308.854
4,India,3736.882
...,...,...
208,Anguilla,0.000
209,Kiribati,0.248
210,Nauru,0.151
211,Montserrat,0.000


*Round GDP_USD_billion column to 2 decimal places*

In [10]:
countries_gdp_df["GDP_USD_billion"] = np.round(countries_gdp_df["GDP_USD_billion"], 2)

countries_gdp_df

Unnamed: 0,Country,GDP_USD_billion
0,United States,26854.60
1,China,19373.59
2,Japan,4409.74
3,Germany,4308.85
4,India,3736.88
...,...,...
208,Anguilla,0.00
209,Kiribati,0.25
210,Nauru,0.15
211,Montserrat,0.00


*Re-instate null values*

In [11]:
countries_gdp_df.loc[countries_gdp_df["GDP_USD_billion"] == 0, "GDP_USD_billion"] = "-"

countries_gdp_df

Unnamed: 0,Country,GDP_USD_billion
0,United States,26854.6
1,China,19373.59
2,Japan,4409.74
3,Germany,4308.85
4,India,3736.88
...,...,...
208,Anguilla,-
209,Kiribati,0.25
210,Nauru,0.15
211,Montserrat,-


### Load to JSON

In [12]:
countries_gdp_df.to_json(JSON_OUTPUT_FILE, indent=4, orient="records")

### Load to SQLite Database

In [13]:
conn = sqlite3.connect(DATABASE_NAME)

countries_gdp_df.to_sql(DATABASE_TABLE_NAME, conn, if_exists="replace", index=False)

213

### Query Countries with > 100 Billion GDP

In [15]:
query = f"SELECT * FROM {DATABASE_TABLE_NAME} WHERE CAST(GDP_USD_billion AS decimal) > 100"
output = pd.read_sql(query, conn)

print(f"QUERY EXECUTED: {query}\nOUTPUT:\n{output}")

conn.close()

QUERY EXECUTED: SELECT * FROM Countries_by_GDP WHERE CAST(GDP_USD_billion AS decimal) > 100
OUTPUT:
          Country GDP_USD_billion
0   United States         26854.6
1           China        19373.59
2           Japan         4409.74
3         Germany         4308.85
4           India         3736.88
..            ...             ...
64          Kenya          118.13
65         Angola          117.88
66           Oman           104.9
67      Guatemala          102.31
68       Bulgaria          100.64

[69 rows x 2 columns]
