# Webscraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

Sometimes the data needed exists in structured form on the web, but can't be accessed via an API or a database connection. A good example of this would be a table on a website. 

Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an Application Programming Interface (API) to extract data from a web site. 

Newer forms of web scraping involve listening to data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the web server. There are methods that some websites use to prevent web scraping, such as detecting and disallowing bots from crawling (viewing) their pages. In response, there are web scraping systems that rely on using techniques in DOM parsing, computer vision and natural language processing to simulate human browsing to enable gathering web page content for offline parsing.

<b>Web scraping:</b> the act of automatically downloading a web page’s data and extracting very specific information from it. The extracted information can be stored pretty much anywhere (database, file, etc.).

## Ethical considerations / Checking robots.txt
Most websites define a robots.txt file in the main domain to let crawlers know of any restrictions when crawling their websites. Although just suggestions are good practices follow them for ethical reasons. Before crawling, check the robots.txt is a valuable resource to minimize the changes to be blocked, and to discover clues about the website's structure.



<b>Beatiful Soup </b> is a library that parses a web page and a interface to navigate content. To instal this library: 'pip install beautifulsoup4' 

## Objectives

In this part you will:

- Use webscraping to get bank information


For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.


In [1]:
#!mamba install pandas==1.3.3 -y
#!mamba install requests==2.26.0 -y
!mamba install bs4==4.10.0 -y
!mamba install html5lib==1.1 -y

'mamba' is not recognized as an internal or external command,
operable program or batch file.
'mamba' is not recognized as an internal or external command,
operable program or batch file.


## Imports

Import any additional libraries you may need here.


In [2]:
!pip install bs4



In [1]:
from bs4 import BeautifulSoup
import html5lib
import requests
import pandas as pd

## Extract Data Using Web Scraping


The wikipedia webpage https://en.wikipedia.org/wiki/List_of_largest_banks provides information about largest banks in the world by various parameters. Scrape the data from the table 'By market capitalization' and store it in a JSON file.


### Webpage Contents

Gather the contents of the webpage in text format using the `requests` library and assign it to the variable <code>html_data</code>


In [2]:
#Write your code here
url = "https://en.wikipedia.org/wiki/List_of_largest_banks"
html_data = requests.get(url).text

<b>Question 1</b> Print out the output of the following line, and remember it as it will be a quiz question:


In [3]:
html_data[483:506]

're-client-prefs-pinned-'

### Scraping the Data

<b> Question 2</b> Using the contents and `beautiful soup` load the data from the `By market capitalization` table into a `pandas` dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names.  Display the first five rows using head. 


Using BeautifulSoup parse the contents of the webpage.


In [4]:
#Replace the dots below
soup = BeautifulSoup(html_data,"html.parser")

Load the data from the `By market capitalization` table into a pandas dataframe. The dataframe should have the bank `Name` and `Market Cap (US$ Billion)` as column names. Using the empty dataframe `data` and the given loop extract the necessary data from each row and append it to the empty dataframe.


In [9]:
data = pd.DataFrame(columns=["Name", "Market Cap (US$ Billion)"])

for row in soup.find_all('tbody')[0].find_all('tr'):
    col = row.find_all('td')
    #Write your code here
    if (col != []):
        name = col[1].text
        market_cap = col[2].text
        #data = data.concat({name, market_cap})
        data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)

data

  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)
  data = data.append({"Name": name, "Market Cap (US$ Billion)": market_cap}, ignore_index = True)


Unnamed: 0,Name,Market Cap (US$ Billion)
0,JPMorgan Chase\n,491.76\n
1,Bank of America\n,266.45\n
2,Industrial and Commercial Bank of China\n,219.45\n
3,Wells Fargo\n,178.74\n
4,Agricultural Bank of China\n,175.69\n
5,HDFC Bank\n,169.84\n
6,HSBC Holdings PLC\n,156.13\n
7,Morgan Stanley\n,153.05\n
8,China Construction Bank\n,151.97\n
9,Bank of China\n,150.39\n


**Question 3** Display the first five rows using the `head` function.


In [8]:
#Write your code here
data[:5]

Unnamed: 0,Name,Market Cap (US$ Billion)



### Loading the Data

Load the `pandas` dataframe created above into a JSON named `bank_market_cap.json` using the `to_json()` function.


In [None]:
#Write your code here
data.to_json("bank_market_cap.json", index = True)

## Author


João Neto
