# Extract Data Using Web Scraping

## Scrape the data from a website
The wikipedia webpage https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.


To scrape the data from the table 'By market capitalization' of the given wikipedia webpage, we can use Python and the following libraries:

- requests: to make HTTP requests to the webpage and retrieve its HTML content.
- BeautifulSoup: to parse the HTML content and extract the relevant data from it.
- json: to store the extracted data in a JSON file.

Here's the Python code to perform web scraping on the given webpage and store the data in a JSON file.

- In this code, we first specify the URL of the webpage to be scraped and use the requests library to make an HTTP request to it and retrieve its HTML content. We then use BeautifulSoup to parse the HTML content and find the table 'By market capitalization' and its rows.

- We then loop through the rows of the table and extract the relevant data for each bank, such as its rank, name, market capitalization, headquarters, founded date, and notes. We store the extracted data in a list of dictionaries called 'banks'.

- Finally, we use the json library to write the 'banks' list to a JSON file called 'largest_banks.json', using the 'dump' method and specifying an indentation of 4 spaces for readability.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks'

html_data = requests.get(url).text
print(html_data.find("List of largest banks"))

required_class_id = "By_market_capitalization"

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_data, 'html.parser')
span = soup.find('span', {'id': required_class_id})
#print(span)

table = span.find_next(class_="wikitable")
#print(table)

rows = table.find_all('tr')
headers = [header.text.strip() for header in rows[0].find_all('th')]

my_data = []
# Loop through the rows of the table and print their content
for row in rows[1:]:
    cells = row.find_all('td')
    name = cells[1].text.strip()
    market_cap = cells[2].text.strip()
    my_data.append({'Name': name, 'Market Cap (US$ Billion)': market_cap})

df = pd.DataFrame(my_data)

print(df.head())
print("------")
print(df.tail())



760
                                      Name Market Cap (US$ Billion)
0                           JPMorgan Chase                  390.934
1  Industrial and Commercial Bank of China                  345.214
2                          Bank of America                  325.331
3                              Wells Fargo                  308.013
4                  China Construction Bank                  257.399
------
                    Name Market Cap (US$ Billion)
65          Ping An Bank                   37.993
66    Standard Chartered                   37.319
67  United Overseas Bank                   35.128
68             QNB Group                   33.560
69           Bank Rakyat                   33.081


### Troubleshooting
If you get TypeError when trying to read html text

The TypeError: Response object is not subscriptable error occurs because the html_data object returned by the requests.get(url) method is not subscriptable, meaning you cannot access its elements using the square bracket notation.

To get the content of the HTTP response, you can use the '.text' property of the Response object.

## Second method recommended by IBM Data Engineer Lab


In [2]:
from bs4 import BeautifulSoup
import html5lib
import requests
import pandas as pd

In [3]:
# Specify the URL of the webpage to be scraped
url = 'https://web.archive.org/web/20200318083015/https://en.wikipedia.org/wiki/List_of_largest_banks'

# Make an HTTP request to the webpage and get its HTML content
html_data = requests.get(url).text
#html_data[760:783]

In [4]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_data, 'html.parser')

In [None]:
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])

# This method uses the specific tbody[2] directly, so it's not general.
for row in soup.find_all('tbody')[2].find_all('tr'):
    col = row.find_all('td')
    if (col != []):
        name = col[1].text.strip()
        market_cap = col[2].text.strip()
        data = data.append({'Name': name, 'Market Cap (US$ Billion)': market_cap}, ignore_index=True)

In [6]:
data.head(5)

Unnamed: 0,Name,Market Cap (US$ Billion)
0,JPMorgan Chase,390.934
1,Industrial and Commercial Bank of China,345.214
2,Bank of America,325.331
3,Wells Fargo,308.013
4,China Construction Bank,257.399


In [7]:
#Write your code here
#data.to_json("bank_market_cap.json")
#data.to_json("bank_market_cap1.json", orient="records")
#data.to_json("bank_market_cap2.json", orient="split", index=False)
data.to_json("bank_market_cap3.json", orient="table", index=False)