# Final Project -- World's Largest Banks

In this project, you will put all the skills acquired throughout the course and your knowledge of basic Python to test. You will work on real-world data and perform the operations of Extraction, Transformation, and Loading (ETL) as required.

## Project Scenario

You have been hired as a data engineer by research organization. Your boss has asked you to create a code that can be used to compile the list of the top 10 largest banks in the world ranked by market capitalization in billion USD. Further, the data needs to be transformed and stored in GBP, EUR and INR as well, in accordance with the exchange rate information that has been made available to you as a CSV file. The processed information table is to be saved locally in a CSV format and as a database table.

Your job is to create an automated system to generate this information so that the same can be executed in every financial quarter to prepare the report.

Particulars of the code to be made have been shared below.


| Parameter | Value |
| --------- | ----- |
| Code name | `banks_project.py` |
| Data URL | `https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks` |
| Exchange rate CSV path | `https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/exchange_rate.csv` |
| Table Attributes (upon Extraction only) |	`Name`, `MC_USD_Billion` |
| Table Attributes (final) | `Name`, `MC_USD_Billion`, `MC_GBP_Billion`, `MC_EUR_Billion`, `MC_INR_Billion` |
| Output CSV Path | `./Largest_banks_data.csv` |
| Database name | `Banks.db` |
| Table name | `Largest_banks` |
| Log file | `code_log.txt` |


## Project Tasks

### Task 1:
Write a function `log_progress()` to log the progress of the code at different stages in a file `code_log.txt`. Use the list of log points provided to create log entries as every stage of the code.

### Task 2:
Extract the tabular information from the given URL under the heading 'By market capitalization' and save it to a dataframe.

a. Inspect the webpage and identify the position and pattern of the tabular information in the HTML code

b. Write the code for a function `extract()` to perform the required data extraction.

c. Execute a function call to `extract()` to verify the output.

### Task 3:
Transform the dataframe by adding columns for Market Capitalization in GBP, EUR and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.

a. Write the code for a function `transform()` to perform the said task.

b. Execute a function call to `transform()` and verify the output.

### Task 4:
Load the transformed dataframe to an output CSV file. Write a function `load_to_csv()`, execute a function call and verify the output.

### Task 5:
Load the transformed dataframe to an SQL database server as a table. Write a function `load_to_db()`, execute a function call and verify the output.

### Task 6:
Run queries on the database table. Write a function `load_to_db()`, execute a given set of queries and verify the output.

### Task 7:
Verify that the log entries have been completed at all stages by checking the contents of the file `code_log.txt`.


<hr>

### IMPORTS & INITIALIZATIONS

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import os

In [2]:
### TO BE PUT IN main()

URL = "https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks"
DATABASE_NAME = "Banks.db"
DATABASE_TABLE_NAME = "Largest_banks"
INITIAL_TABLE_ATTRIBUTES = ["Name", "MC_USD_Billion"]
ADDITIONAL_TABLE_ATTRIBUTES = ["MC_GBP_Billion", "MC_EUR_Billion", "MC_INR_Billion"]
EXCHANGE_RATE_CSV_FILE = "exchange_rate.csv"
OUTPUT_CSV_FILE = "Largest_banks_data.csv"
LOG_FILE = "code_log.txt"

# log_progress("Preliminaries complete. Initiating ETL process")

## TASK 1:

<i>Write a function `log_progress()` to log the progress of the code at different stages in a file `code_log.txt`. Use the list of log points provided to create log entries as every stage of the code.</i>

Log messages:

| Task | Log message on completion |
| ---- | ------------------------- |
| Declaring known values | *Preliminaries complete. Initiating ETL process* |
| Call `extract()` function | *Data extraction complete. Initiating Transformation process* |
| Call `transform()` function | *Data transformation complete. Initiating Loading process* |
| Call `load_to_csv()` | *Data saved to CSV file* |
| Initiate SQLite3 connection | *SQL Connection initiated* |
| Call `load_to_db()` | *Data loaded to Database as a table, Executing queries* |
| Call `run_query()` | *Process Complete* |
| Close SQLite3 connection | *Server Connection closed* |

In [3]:
def log_progress(message):
    """ This function logs the mentioned message of a given stage of the
    code execution to a log file. Function returns nothing. """
    
    now = datetime.now()
    timestamp = now.strftime("%m-%d-%Y %H:%M:%S")

    full_log_message = f"{timestamp} -- {message}"

    with open(LOG_FILE, "a") as f:
        f.write(f"{full_log_message}\n")

    print(full_log_message)

## TASK 2:
<i>

Extract the tabular information from the given URL under the heading 'By market capitalization' and save it to a dataframe.

a. Inspect the webpage and identify the position and pattern of the tabular information in the HTML code

b. Write the code for a function `extract()` to perform the required data extraction.

c. Execute a function call to `extract()` to verify the output.

</i>

In [4]:
def extract(url, table_attribs):
    """ This function aims to extract the required
    information from the website and save it to a data frame. The
    function returns the data frame for further processing. """

    # set up dictionary with table column names (attributes) and initialize to empty list
    top_banks_dict = dict()

    for attr in table_attribs:
        top_banks_dict[attr] = []

    # extract html page & table
    html_page = requests.get(url).text
    soup = BeautifulSoup(html_page, "html.parser")

    tables = soup.find_all("table", {"class": "wikitable sortable mw-collapsible"})
    top_banks_table = tables[0]
    
    # extract rows & data
    rows = top_banks_table.find_all("tr")
    
    for row in rows:
        cell_data = row.find_all("td")
        
        
        if len(cell_data) > 0:
            for i in range(len(table_attribs)):
                current_column_name = table_attribs[i]
                current_cell_data_index = i+1 # skip first column (Rank)
                current_cell_data = None

                # extract bank name
                if current_cell_data_index == 1:
                    current_cell_data = cell_data[1].find_all("a")[1].text
                else:
                    current_cell_data = cell_data[current_cell_data_index].contents[0]

                # slightly clean text (remove new line escape character, strip)
                current_cell_data = str(current_cell_data).strip()
                current_cell_data = current_cell_data.replace("\n", "")

                top_banks_dict[current_column_name].append(current_cell_data)
    
    df = pd.DataFrame(top_banks_dict)

    return df

df8411 = extract(URL, INITIAL_TABLE_ATTRIBUTES)
df8411

Unnamed: 0,Name,MC_USD_Billion
0,JPMorgan Chase,432.92
1,Bank of America,231.52
2,Industrial and Commercial Bank of China,194.56
3,Agricultural Bank of China,160.68
4,HDFC Bank,157.91
5,Wells Fargo,155.87
6,HSBC Holdings PLC,148.9
7,Morgan Stanley,140.83
8,China Construction Bank,139.82
9,Bank of China,136.81


## TASK 3:

<i>

Transform the dataframe by adding columns for Market Capitalization in GBP, EUR and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.

a. Write the code for a function `transform()` to perform the said task.

b. Execute a function call to `transform()` and verify the output.

</i>

In [5]:
def transform(df, csv_path, additional_columns):
    """ This function accesses the CSV file for exchange rate
    information, and adds three columns to the data frame, each
    containing the transformed version of Market Cap column to
    respective currencies. """

    usd_market_cap_column_name = "MC_USD_Billion"

    # update data type of MC_USD_Billion column from string to float
    df[usd_market_cap_column_name] = df[usd_market_cap_column_name].astype(float)

    # extract exchange rate csv to DataFrame
    df_exchange_rates = pd.read_csv(csv_path)
    gbp_rate = get_exchange_rate(df_exchange_rates, "GBP")
    eur_rate = get_exchange_rate(df_exchange_rates, "EUR")
    inr_rate = get_exchange_rate(df_exchange_rates, "INR")
    
    # add 3 new columns for GBP, EUR, and INR currencies
    for col in additional_columns:
        if "GBP" in col:
            df[col] = np.round(df[usd_market_cap_column_name] * gbp_rate, 2)
        elif "EUR" in col:
            df[col] = np.round(df[usd_market_cap_column_name] * eur_rate, 2)
        elif "INR" in col:
            df[col] = np.round(df[usd_market_cap_column_name] * inr_rate, 2)

    return df


def get_exchange_rate(df: pd.DataFrame, currency: str):
    return df.loc[df["Currency"] == currency]["Rate"].iloc[0]

full_exchange_rates_csv_path = f"{os.getcwd()}\\{EXCHANGE_RATE_CSV_FILE}"
df8411 = transform(df8411, full_exchange_rates_csv_path, ADDITIONAL_TABLE_ATTRIBUTES)
df8411

Unnamed: 0,Name,MC_USD_Billion,MC_GBP_Billion,MC_EUR_Billion,MC_INR_Billion
0,JPMorgan Chase,432.92,346.34,402.62,35910.71
1,Bank of America,231.52,185.22,215.31,19204.58
2,Industrial and Commercial Bank of China,194.56,155.65,180.94,16138.75
3,Agricultural Bank of China,160.68,128.54,149.43,13328.41
4,HDFC Bank,157.91,126.33,146.86,13098.63
5,Wells Fargo,155.87,124.7,144.96,12929.42
6,HSBC Holdings PLC,148.9,119.12,138.48,12351.26
7,Morgan Stanley,140.83,112.66,130.97,11681.85
8,China Construction Bank,139.82,111.86,130.03,11598.07
9,Bank of China,136.81,109.45,127.23,11348.39


## TASK 4:

<i>Load the transformed dataframe to an output CSV file. Write a function `load_to_csv()`, execute a function call and verify the output.</i>

In [9]:
def load_to_csv(df: pd.DataFrame, output_path):
    """ This function saves the final data frame as a CSV file in
    the provided path. Function returns nothing. """
    df.to_csv(f"{os.getcwd()}\\{output_path}", index=False)

load_to_csv(df8411, OUTPUT_CSV_FILE)

## TASK 5:

<i>Load the transformed dataframe to an SQL database server as a table. Write a function `load_to_db()`, execute a function call and verify the output.</i>

In [10]:
def load_to_db(df: pd.DataFrame, sql_connection, table_name):
    """ This function saves the final data frame to a database
    table with the provided name. Function returns nothing. """
    df.to_sql(table_name, sql_connection, if_exists="replace", index=False)

conn = sqlite3.connect(DATABASE_NAME)
load_to_db(df8411, conn, DATABASE_TABLE_NAME)

## TASK 6:

<i>Run queries on the database table. Write a function `load_to_db()`, execute a given set of queries and verify the output.</i>

In [12]:
def run_query(query_statement, sql_connection):
    """ This function runs the query on the database table and
    prints the output on the terminal. Function returns nothing. """
    output = pd.read_sql(query_statement, sql_connection)

    print(f"QUERY EXECUTED: {query_statement}")
    print(f"OUTPUT:\n{output}")

run_query(f"SELECT * FROM {DATABASE_TABLE_NAME}", conn)
conn.close()

QUERY EXECUTED: SELECT * FROM Largest_banks
OUTPUT:
                                      Name  MC_USD_Billion  MC_GBP_Billion  \
0                           JPMorgan Chase          432.92          346.34   
1                          Bank of America          231.52          185.22   
2  Industrial and Commercial Bank of China          194.56          155.65   
3               Agricultural Bank of China          160.68          128.54   
4                                HDFC Bank          157.91          126.33   
5                              Wells Fargo          155.87          124.70   
6                        HSBC Holdings PLC          148.90          119.12   
7                           Morgan Stanley          140.83          112.66   
8                  China Construction Bank          139.82          111.86   
9                            Bank of China          136.81          109.45   

   MC_EUR_Billion  MC_INR_Billion  
0          402.62        35910.71  
1          215.31

## TASK 7:

<i>Verify that the log entries have been completed at all stages by checking the contents of the file `code_log.txt`. Print out contents of log file.</i>