# Data Engineer - Webscraping


## Imports

In [1]:
from bs4 import BeautifulSoup
import html5lib
import requests
import pandas as pd

## Extracting Data Using Web Scraping


The wikipedia webpage https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks provides information about largest banks in the world by various parameters. 
Scrape the data from the table 'By market capitalization' and store it in a JSON file.


### Webpage Contents

Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>


In [2]:
url = "https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks"
response = requests.get(url)

if response.status_code == 200:
    html_data = response.text
else:
    print(f"Error al obtener los datos. Status code: {response.status_code}")

### Scraping the Data

Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names.  Display the first five rows using head. 


Using BeautifulSoup parse the contents of the webpage.


In [3]:
soup = BeautifulSoup(html_data, "html5lib")

Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.


In [4]:
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])

bank_names = []
market_caps = []

for row in soup.find_all('tbody')[2].find_all('tr'):
    col = row.find_all('td')

    if len(col) >= 2:
        bank_name = col[1].text.strip()
        market_cap = col[2].text.strip()

        bank_names.append(bank_name)
        market_caps.append(market_cap)

data = {'Bank Name': bank_names, 'Market Cap (US$ Billion)': market_caps}
df = pd.DataFrame(data)
    

Display the first five rows using the `head` function.


In [5]:
print(df.head())

                                 Bank Name Market Cap (US$ Billion)
0                           JPMorgan Chase                  390.934
1  Industrial and Commercial Bank of China                  345.214
2                          Bank of America                  325.331
3                              Wells Fargo                  308.013
4                  China Construction Bank                  257.399



### Loading the Data

Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function.


In [6]:
json_file_name = "bank_market_cap.json"

df.to_json(json_file_name, orient="records")