# Hands-on Lab: Acquiring and Processing Information on the World's Largest Banks

Estimated Time: 60 mins
In this project, you will put all the skills acquired throughout the course and your knowledge of basic Python to test. You will work on real-world data and perform the operations of Extraction, Transformation, and Loading (ETL) as required.

Disclaimer:

Cloud IDE is not a persistent platform, and you will lose your progress every time you restart this lab. We recommend saving a copy of your file on your local machine as a protective measure against data loss.

Project Scenario:
You have been hired as a data engineer by research organization. Your boss has asked you to create a code that can be used to compile the list of the top 10 largest banks in the world ranked by market capitalization in billion USD. Further, the data needs to be transformed and stored in GBP, EUR and INR as well, in accordance with the exchange rate information that has been made available to you as a CSV file. The processed information table is to be saved locally in a CSV format and as a database table.

Your job is to create an automated system to generate this information so that the same can be executed in every financial quarter to prepare the report.

Particulars of the code to be made have been shared below.

Parameter	Value
Code name	banks_project.py
Data URL	https://web.archive.org/web/20230908091635 /https://en.wikipedia.org/wiki/List_of_largest_banks
Exchange rate CSV path	https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/exchange_rate.csv
Table Attributes (upon Extraction only)	Name, MC_USD_Billion
Table Attributes (final)	Name, MC_USD_Billion, MC_GBP_Billion, MC_EUR_Billion, MC_INR_Billion
Output CSV Path	./Largest_banks_data.csv
Database name	Banks.db
Table name	Largest_banks
Log file	code_log.txt
Project tasks
Task 1:
Write a function log_progress() to log the progress of the code at different stages in a file code_log.txt. Use the list of log points provided to create log entries as every stage of the code.

Task 2:
Extract the tabular information from the given URL under the heading 'By market capitalization' and save it to a dataframe.
a. Inspect the webpage and identify the position and pattern of the tabular information in the HTML code
b. Write the code for a function extract() to perform the required data extraction.
c. Execute a function call to extract() to verify the output.

Task 3:
Transform the dataframe by adding columns for Market Capitalization in GBP, EUR and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.
a. Write the code for a function transform() to perform the said task.
b. Execute a function call to transform() and verify the output.

Task 4:
Load the transformed dataframe to an output CSV file. Write a function load_to_csv(), execute a function call and verify the output.

Task 5:
Load the transformed dataframe to an SQL database server as a table. Write a function load_to_db(), execute a function call and verify the output.

Task 6:
Run queries on the database table. Write a function load_to_db(), execute a given set of queries and verify the output.

Task 7:
Verify that the log entries have been completed at all stages by checking the contents of the file code_log.txt.

CSV for currency

wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-PY0221EN-Coursera/labs/v2/exchange_rate.csv

Ne

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import sqlite3
from datetime import datetime

def log_progress(message):
    ''' This function logs the mentioned message at a given stage of the 
    code execution to a log file. Function returns nothing.'''

    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second 
    now = datetime.now() # get current timestamp 
    timestamp = now.strftime(timestamp_format) 
    with open("./etl_project_log.txt","a") as f: 
        f.write(timestamp + ' : ' + message + '\n')

def extract(url):
    page=requests.get(url)
    soup = BeautifulSoup(page.text,'html.parser')
    table=soup.find_all('table')[0]
    # getting titles from table headers
    titles=table.find_all('th')
    titles=[title.text.strip() for title in titles] # strip th \n
    titles.append('MC_USD_Billion')
    titles.pop(2)
    titles
    # put it in a dataframe
    df =pd.DataFrame(columns=titles)
    column_data=table.find_all('tr')
    for row in column_data[1:]:
        row_data =row.find_all('td')
        individual_row_data = [data.text.strip() for data in row_data]
        length=len(df)
        df.loc[length]=individual_row_data
    df = df.drop(columns=['Rank'])
    # Data type conversions 
    df['MC_USD_Billion']=df['MC_USD_Billion'].astype(float)
    df['Bank name']=df['Bank name'].astype(str)

    return df

def transform(df,exchange_rate_csv_path):
    exc_rate_df = pd.read_csv(exchange_rate_csv_path)
    df['MC_GBP_Billion'] = [np.round(x*exc_rate_df['Rate'][1],2) for x in df['MC_USD_Billion']]
    df['MC_EUR_Billion'] = [np.round(x*exc_rate_df['Rate'][0],2) for x in df['MC_USD_Billion']]
    df['MC_INR_Billion'] = [np.round(x*exc_rate_df['Rate'][2],2) for x in df['MC_USD_Billion']]
    
    #df['MC_EUR_Billion']=df['MC_USD_Billion']*df_rates['Rate'][0]
    #df['MC_GBP_Billion']=df['MC_USD_Billion']*df_rates['Rate'][1]
    #df['MC_INR_Billion']=df['MC_USD_Billion']*df_rates['Rate'][2]
    df
    return df

def load_to_csv(df, csv_path):
   
    df.to_csv(csv_path)

def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final dataframe to as a database table
    with the provided name. Function returns nothing.'''

    df.to_sql(table_name, sql_connection, if_exists='replace', index=False)

def run_query(query_statement, sql_connection):
    ''' This function runs the stated query on the database table and
    prints the output on the terminal. Function returns nothing. '''

    print(query_statement)
    query_output = pd.read_sql(query_statement, sql_connection)
    print(query_output)




In [2]:

table_name="Largest_banks"

url="https://web.archive.org/web/20230908091635/https://en.wikipedia.org/wiki/List_of_largest_banks"

csv_path="./Largest_banks_data.csv"
exchange_rate_csv_path="./exchange_rate.csv"
log_progress('Preliminaries complete. Initiating ETL process')

df = extract(url)

log_progress('Data extraction complete. Initiating Transformation process')

df = transform(df,exchange_rate_csv_path)
log_progress('Data transformation complete. Initiating loading process')

load_to_csv(df, csv_path)

log_progress('Data saved to CSV file')

sql_connection = sqlite3.connect('Banks.db')

log_progress('SQL Connection initiated.')

load_to_db(df, sql_connection, table_name)

log_progress('Data loaded to Database as table. Running the query')

all_query_statement = f"SELECT * FROM Largest_banks"
log_progress('Running the query for all ')
run_query(all_query_statement, sql_connection)

av_query_statement = f"SELECT AVG(MC_GBP_Billion) FROM Largest_banks"
log_progress(' Running the avarage query')
run_query(av_query_statement, sql_connection)
name_query_statement = f"SELECT `Bank name` from Largest_banks LIMIT 5"
log_progress(' Running the name  query')
run_query(name_query_statement, sql_connection)

log_progress('Process Complete.')

sql_connection.close()



SELECT * FROM Largest_banks
                                 Bank name  MC_USD_Billion  MC_GBP_Billion  \
0                           JPMorgan Chase          432.92          346.34   
1                          Bank of America          231.52          185.22   
2  Industrial and Commercial Bank of China          194.56          155.65   
3               Agricultural Bank of China          160.68          128.54   
4                                HDFC Bank          157.91          126.33   
5                              Wells Fargo          155.87          124.70   
6                        HSBC Holdings PLC          148.90          119.12   
7                           Morgan Stanley          140.83          112.66   
8                  China Construction Bank          139.82          111.86   
9                            Bank of China          136.81          109.45   

   MC_EUR_Billion  MC_INR_Billion  
0          402.62        35910.71  
1          215.31        19204.58  
2    

# Answers to questions

In [3]:
df['MC_EUR_Billion'][4]

146.86

In [4]:
df

Unnamed: 0,Bank name,MC_USD_Billion,MC_GBP_Billion,MC_EUR_Billion,MC_INR_Billion
0,JPMorgan Chase,432.92,346.34,402.62,35910.71
1,Bank of America,231.52,185.22,215.31,19204.58
2,Industrial and Commercial Bank of China,194.56,155.65,180.94,16138.75
3,Agricultural Bank of China,160.68,128.54,149.43,13328.41
4,HDFC Bank,157.91,126.33,146.86,13098.63
5,Wells Fargo,155.87,124.7,144.96,12929.42
6,HSBC Holdings PLC,148.9,119.12,138.48,12351.26
7,Morgan Stanley,140.83,112.66,130.97,11681.85
8,China Construction Bank,139.82,111.86,130.03,11598.07
9,Bank of China,136.81,109.45,127.23,11348.39


In [5]:


from bs4 import BeautifulSoup

# Assuming 'html_content' is the variable containing the HTML content
soup = BeautifulSoup(page, 'html.parser')

# Find the table row (<tr>) with the desired data
table_row = soup.find('tr', attrs={'data-th': '1'})

if table_row:
    # Extract data from table cells (<td>)
    cells = table_row.find_all('td')

    # Extract the rank
    rank = cells[0].text.strip()

    # Extract the company name
    company_name_cell = cells[1]
    company_name = company_name_cell.a.text.strip()

    # Extract the market cap value
    market_cap = cells[2].text.strip()

    # Print the extracted data
    print(f"Rank: {rank}")
    print(f"Company Name: {company_name}")
    print(f"Market Cap: {market_cap}")
else:
    print("Table row not found.")

NameError: name 'page' is not defined