# Data Engineering - Webscraping


## Objectives

In this part I will be using webscraping to get bank information


In [1]:
#!pip install pandas
!pip install bs4
#!pip install requests

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting beautifulsoup4
  Downloading beautifulsoup4-4.10.0-py3-none-any.whl (97 kB)
     |████████████████████████████████| 97 kB 8.3 MB/s             
[?25hCollecting soupsieve>1.2
  Downloading soupsieve-2.3.1-py3-none-any.whl (37 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1271 sha256=81962c227954e567c42e16be4fde8f4c91cd517aac78648bb59cf76977779d30
  Stored in directory: /home/jupyterlab/.cache/pip/wheels/0a/9e/ba/20e5bbc1afef3a491f0b3bb74d508f99403aabe76eda2167ca
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.10.0 bs4-0.0.1 soupsieve-2.3.1


## Imports

Import any additional libraries we may need here.


In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Extract Data Using Web Scraping


The wikipedia webpage [https://en.wikipedia.org/wiki/List_of_largest_banks](https://en.wikipedia.org/wiki/List_of_largest_banks?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2021-01-01) provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.


### Webpage Contents

Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>


In [16]:
url = "https://en.wikipedia.org/wiki/List_of_largest_banks"
response = requests.get(url)
html_data = response.text
#print(html_data)

In [17]:
html_data[101:124]

'List of largest banks -'

### Scraping the Data

Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe will have the country `Name` and `Market Cap (US$ Billion)` as column names. 


Using BeautifulSoup parse the contents of the webpage.


In [19]:
soup = BeautifulSoup(html_data, 'html.parser')

Load the data from the `By market capitalization` table into a pandas dataframe. Using the empty dataframe `data` and the loop extract the necessary data from each row and append it to the empty dataframe.

In [30]:
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])

# Collecting data
for row in soup.find_all('tbody')[3].find_all('tr'):
    # Find all data for each column
    col = row.find_all('td')
    
    if(col != []):
        name = col[1].text.strip()
        marCap = col[2].text.strip()
        #print(name)
        data = data.append({'Name': name,  'Market Cap (US$ Billion)': marCap}, ignore_index=True)

In [34]:
data.head()

Unnamed: 0,Name,Market Cap (US$ Billion)
0,JPMorgan Chase,488.47
1,Bank of America,379.25
2,Industrial and Commercial Bank of China,246.5
3,Wells Fargo,308.013
4,China Construction Bank,257.399


### Loading the Data

Usually we can Load the `pandas` dataframe created above into a JSON using the `to_json()` function.


In [None]:
#data.to_json(r'bank_market_cap.json', index = False)