<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# Acquiring and Processing Information on the World's Largest Banks

Estimated time needed: **60** minutes


## Project Scenario:

You have been hired as a data engineer by research organization. Your boss has asked you to create a code that can be used to compile the list of the top 10 largest banks in the world ranked by market capitalization in billion USD. Further, the data needs to be transformed and stored in GBP, EUR and INR as well, in accordance with the exchange rate information that has been made available to you as a CSV file. The processed information table is to be saved locally in a CSV format and as a database table.

Your job is to create an automated system to generate this information so that the same can be executed in every financial quarter to prepare the report.


## Imports

Import libraries needed here.


In [1]:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime

## Project tasks

### **Task 1:**

Write a function <code>log_progress()</code> to log the progress of the code at different stages in a file <code>code_log.txt</code>. Use the list of log points provided to create log entries as every stage of the code.

In [2]:
def log_progress(message):
    ''' This function logs the mentioned message of a given stage of the
    code execution to a log file.'''

    timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
    now = datetime.now() # get current timestamp
    timestamp = now.strftime(timestamp_format)
    with open("code_log.txt","a") as f:
        f.write(timestamp + ',' + message + '\n')

    print(timestamp + ', ' + message + '\n')

### **Task 2:**
Extract the tabular information from the given URL under the heading 'By market capitalization' and save it to a dataframe.

1. Inspect the webpage and identify the position and pattern of the tabular information in the HTML code.

The URL and the column attributes as follow:


In [3]:
url = 'https://web.archive.org/web/20230908091635 /https://en.wikipedia.org/wiki/List_of_largest_banks'
ex_path = 'C:\\Users\kyoss\Desktop\COURSERA\IBM DATA ENGINEERING PROFESSIONAL\Course 03 - Python Project for Data Engineering\exchange_rate.csv'
output_path = 'c:\\Users\\kyoss\\Desktop\\COURSERA\\IBM DATA ENGINEERING PROFESSIONAL\\Course 03 - Python Project for Data Engineering/Largest_banks_data.csv'
db_name = 'Banks.db'
table_name = 'Largest_banks'
conn = sqlite3.connect(db_name)
table_attribs = ["Name","MC_USD_Billion"]
df_final = pd.DataFrame(columns=table_attribs)

2. Write the code for a function <code>extract()</code> to perform the required data extraction.

In [4]:
def extract(url, table_attribs):
    ''' This function aims to extract the required
    information from the website and save it to a data frame. The
    function returns the data frame for further processing. '''

    # Loading the webpage for webscraping
    html_page = requests.get(url).text
    data = BeautifulSoup(html_page, 'html.parser')

    # Scraping for required info
    tables = data.find_all('tbody')
    rows = tables[0].find_all('tr')

    df = pd.DataFrame(columns=table_attribs)

    for row in rows:
        col = row.find_all('td')
        if len(col)!=0:
            data_dict = {table_attribs[0]:col[1].text.strip(),
                         table_attribs[1]:float(col[2].text.strip())}
            df1 = pd.DataFrame(data_dict, index = [0])
            if df.empty:
                df=df1                
            else:
                df = pd.concat([df,df1], ignore_index = True)

    return df

3. Execute a function call to <code>extract()</code> to verify the output.

In [5]:
df = extract(url, table_attribs)
df

Unnamed: 0,Name,MC_USD_Billion
0,JPMorgan Chase,432.92
1,Bank of America,231.52
2,Industrial and Commercial Bank of China,194.56
3,Agricultural Bank of China,160.68
4,HDFC Bank,157.91
5,Wells Fargo,155.87
6,HSBC Holdings PLC,148.9
7,Morgan Stanley,140.83
8,China Construction Bank,139.82
9,Bank of China,136.81


### **Task 3:**

Transform the dataframe by adding columns for Market Capitalization in GBP, EUR and INR, rounded to 2 decimal places, based on the exchange rate information shared as a CSV file.
1. Write the code for a function <code>transform()</code> to perform the said task.

In [6]:
def transform(df, csv_path):
    ''' This function accesses the CSV file for exchange rate
    information, and adds three columns to the data frame, each
    containing the transformed version of Market Cap column to
    respective currencies'''

    # extract the exchange rate.csv file
    ex_df = pd.read_csv(ex_path, index_col=0)


    # create bew columns MC_GBP_Billion, MC_EUR_Billion, MC_INR_Billion
    df['MC_GBP_Billion'] = round(df['MC_USD_Billion'] * ex_df.loc['GBP','Rate'],2)
    df['MC_EUR_Billion'] = round(df['MC_USD_Billion'] * ex_df.loc['EUR','Rate'],2)
    df['MC_INR_Billion'] = round(df['MC_USD_Billion'] * ex_df.loc['INR','Rate'],2)
    return df



2. Execute a function call to <code>transform()</code> and verify the output.

In [7]:
df1 = transform(df, ex_path)
df1

Unnamed: 0,Name,MC_USD_Billion,MC_GBP_Billion,MC_EUR_Billion,MC_INR_Billion
0,JPMorgan Chase,432.92,346.34,402.62,35910.71
1,Bank of America,231.52,185.22,215.31,19204.58
2,Industrial and Commercial Bank of China,194.56,155.65,180.94,16138.75
3,Agricultural Bank of China,160.68,128.54,149.43,13328.41
4,HDFC Bank,157.91,126.33,146.86,13098.63
5,Wells Fargo,155.87,124.7,144.96,12929.42
6,HSBC Holdings PLC,148.9,119.12,138.48,12351.26
7,Morgan Stanley,140.83,112.66,130.97,11681.85
8,China Construction Bank,139.82,111.86,130.03,11598.07
9,Bank of China,136.81,109.45,127.23,11348.39


### **Task 4:**
Load the transformed dataframe to an output CSV file. Write a function <code>load_to_csv()</code>, execute a function call and verify the output.

In [8]:
def load_to_csv(df, output_path):
    ''' This function saves the final data frame as a CSV file in
    the provided path. Function returns nothing.'''

    df.to_csv(output_path)


In [9]:
load_to_csv(df1, output_path)

### **Task 5:**
Load the transformed dataframe to an SQL database server as a table. Write a function <code>load_to_db()</code>, execute a function call and verify the output.

In [10]:
def load_to_db(df, sql_connection, table_name):
    ''' This function saves the final data frame to a database
    table with the provided name. Function returns nothing.'''

    
    df.to_sql(table_name, sql_connection, if_exists='replace', index=False)
    sql_connection.close()
    

In [11]:
conn = sqlite3.connect(db_name)
load_to_db(df1,conn, table_name)

### **Task 6:**
Run queries on the database table. Write a function <code>load_to_db()</code>, execute a given set of queries and verify the output.

In [12]:
def run_query(query_statement):
    ''' This function runs the query on the database table and
    prints the output on the terminal. Function returns nothing. '''

    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()
    cursor.execute(query_statement)
    x = cursor.fetchall()
    conn.commit()
    conn.close()
    result = pd.DataFrame(x, columns=["#","Name","MC_USD_Billion","MC_EUR_Billion","MC_INR_Billion"])
    print(result)


### **Task 7:**
Verify that the log entries have been completed at all stages by checking the contents of the file <code>code_log.txt</code>.

In [13]:
''' Here, you define the required entities and call the relevant
functions in the correct order to complete the project. Note that this
portion is not inside any function.'''

log_progress("ETL Job Started")

# Extracting the data
log_progress("Extract phase Started")
extracted_data = extract(url,table_attribs)
log_progress("Extract phase Ended")

# Transforming the data
log_progress("Transform phase Started")
transformed_data = transform(extracted_data, ex_path)
log_progress("Transform phase Ended")

# Loading data to csv
log_progress("Load phase Started")
load_to_csv(transformed_data, output_path)
load_to_db(transformed_data, conn, table_name)
log_progress("Load phase Ended")

# Running SQL query
log_progress("Access SQL Program")

xx = 'y'
while xx == 'y':
    xx = input("Do you want to Run SQL Query?(y/n): ").lower()
    if xx == "y":
        query = input("Please Enter the Query Statment: ") 
        run_query(query)
        continue
    elif xx == "n":
        break
    else:
        print("Wrong Input!! Please answer y/n")
        xx="y"
        continue     

log_progress("Close SQL Program")

log_progress("ETL Job Ended")

2024-May-18-14:10:19, ETL Job Started

2024-May-18-14:10:19, Extract phase Started

2024-May-18-14:10:20, Extract phase Ended

2024-May-18-14:10:20, Transform phase Started

2024-May-18-14:10:20, Transform phase Ended

2024-May-18-14:10:20, Load phase Started

2024-May-18-14:10:20, Load phase Ended

2024-May-18-14:10:20, Access SQL Program

