# Peer Review Assignment - Data Engineer - Webscraping


## Objectives

In this part we will:

*   Use webscraping to get bank information


For this exercise, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your  environment. Others may need to be installed by you. The cells below will install these libraries when executed.


In [8]:
# ! pip  install requests==2.26.0 
# ! pip install bs4
# ! pip install html5lib
# ! pip install requests

## Imports

Import any additional libraries you may need here.


In [4]:
from bs4 import BeautifulSoup
import html5lib
import requests
import pandas as pd
import warnings
import sys
import os
warnings.filterwarnings("ignore")
sys.path.append(os.path.abspath(os.path.join('../scripts')))

## Extract Data Using Web Scraping


The wikipedia webpage [https://en.wikipedia.org/wiki/List_of_largest_banks](https://en.wikipedia.org/wiki/List_of_largest_banks?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2022-01-01) provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.


### Webpage Contents

Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>


In [69]:
#Write your code here
html_data= requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
# html_data

<b>Question 1</b> Print out the output of the following line, and remember it as it will be a quiz question:


In [70]:
html_data[101:124]

'List of largest banks -'

### Scraping the Data

<b> Question 2</b> Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names.  Display the first five rows using head.


Using BeautifulSoup parse the contents of the webpage.


In [71]:
#Replace the dots below
soup=BeautifulSoup(html_data,"html5lib")
# soup.prettify()

Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.


In [186]:
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])
data

Unnamed: 0,Name,Market Cap (US$ Billion)


In [187]:

for row in soup.find_all('tbody')[2].find_all('tr'):
    col = row.find_all('td')
    try:
        if col:
            print(col[1].find_all('a')[1].text)
            bank_name = col[1].find_all("a")[1].text
            market_cap = float(col[2].text.split('[')[0].replace(',',""))
            print(market_cap)
            data= data.append({"Name": bank_name,
                            "Market Cap (US$ Billion)": market_cap}, ignore_index=True)
    except IndexError:
        continue
        

Industrial and Commercial Bank of China Limited
5866.0
China Construction Bank
4532.05
Agricultural Bank of China
4354.56
Bank of China
4113.36
JPMorgan Chase
3773.88
Mitsubishi UFJ Financial Group
3737.31
HSBC
2958.15
Bank of America
2434.08
BNP Paribas
2429.26
Crédit Agricole
2256.72
Japan Post Bank
1984.62
SMBC Group
1954.78
Citigroup Inc.
1951.16
Wells Fargo
1927.56
Mizuho Financial Group
1874.89
Banco Santander
1702.61
Société Générale
1522.05
Barclays
1510.14
Groupe BPCE
1501.59
Postal Savings Bank of China
1467.31
Deutsche Bank
1456.26
Bank of Communications
1422.63
Goldman Sachs
1200.0
Royal Bank of Canada
1116.31
Lloyds Banking Group
1104.42
Toronto-Dominion Bank
1102.04
China Merchants Bank
1065.25
Intesa Sanpaolo
1057.82
Norinchukin Bank
1011.14
ING Group
1000.72
Industrial Bank (China)
976.79
Crédit Mutuel
976.14
UBS
972.18
UniCredit
960.21
China Minsheng Bank
959.63
NatWest Group
957.6
Shanghai Pudong Development Bank
950.01
China CITIC Bank
904.02
Morgan Stanley
895.43
Sc

In [183]:
# dollar='5,866.00\n'
# float(dollar.split()[0].replace(',',""))

dollar2= '3773.88'
float(dollar2.split('[')[0].replace('',""))


3773.88

**Question 3** Display the first five rows using the `head` function.


In [188]:
#Write your code here
data.head()

Unnamed: 0,Name,Market Cap (US$ Billion)
0,Industrial and Commercial Bank of China Limited,5866.0
1,China Construction Bank,4532.05
2,Agricultural Bank of China,4354.56
3,Bank of China,4113.36
4,JPMorgan Chase,3773.88


In [189]:
data.dtypes

Name                         object
Market Cap (US$ Billion)    float64
dtype: object

### Loading the Data

Usually you will Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function, but this time the data will be sent to another team who will split the data file into two files and inspect it. If you save the data it will interfere with the next part of the assignment.


In [192]:
#Write your code here
data.to_csv('../data /df_bank.csv')

## Authors


Ramesh Sannareddy, Joseph Santarcangelo and Azim Hirjani


### Other Contributors


Rav Ahuja


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By          | Change Description                 |
| ----------------- | ------- | ------------------- | ---------------------------------- |
| 2022-07-12        | 0.2     | Appalabhaktula Hema | Corrected the code and markdown    |
| 2020-11-25        | 0.1     | Ramesh Sannareddy   | Created initial version of the lab |


Copyright © 2020 IBM Corporation.
