<h1> Web Scrapping using BeautifulSoup and its Equivalent in rvest </h1>

*Submitted By - Malaika Gupta and Shreya Verma*

### Motivation

*While there may be multiple separate tutorials on using BeautifulSoup and rvest, it is sometimes challenging to convert code (with different functions in different languages), for instance, in Python to R and vice versa. Since most of our classmates have more experience in Python than R, which implies that they have performed web scraping in Pyhton rather than in R, this tutorial could help them directly relate between the two. This project can hopefully help our classmates understand a direct translation (so to say) of web scraping from BeautifulSoup to rvest.*

<h3>Introduction to Web Scrapping</h3>

Web scraping is the process of collecting unstructured web data in an automated fashion. It’s also called web data extraction. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, market research etc.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to analyze data and make smarter decisions.

<h5>There are mainly two ways to extract data from a website:</h5>

- Using the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook or Google Maps API which allows us to retrieve results from that application.
- Accessing the HTML of the webpage and extract useful information from it. This technique is called web scraping or web harvesting or web data extraction

<h5>Steps involved in web scraping:</h5>

1. Finding the URL of the webpage that we want to scrape
2. Selecting the particular elements by inspecting
3. Writing the code to get the content of the selected elements
4. Storing the data in the required format

<h3>Introduction to BeautifulSoup</h3>

Beautiful Soup is a python package which allows us to pull data out of HTML and XML documents.It creates data parse trees in order to get data easily.
It is popular because of the myriad functions it provides to extract data from HTML.

<h5> Few Popular Functions in Beautiful Soup </h5>
<h6> 1. get() </h6>

This function is very essential because with it we will get to the certain web page we desire.
The first needed to scrape a web page is to download the page. We download the pages using the Python requests library. The requests  will make a GET request  to the desired URL with defined headers, which will download the HTML contents of a given web page. 

After that we create an object instance using BeautifulSoup(,) which creates a data structure representing a parsed HTML or XML document.

<h6> 2. get_text() </h6>

The get_text()  function helps us to extract the text part of the newly found elements from the HTML document. 

Example :
~~~
soup.get_text()
~~~

<h6> 3. find_all() </h6>
The find_all method is one of the most common methods in BeautifulSoup with which we are able to search for anything in our web page. It looks through a tag’s descendants and retrieves all descendants that match your filters.

The find_all() method takes an HTML tag as a string argument and returns the list of elements that matches with the provided tag. For example, if we want all p tags in doc.html

Examples:
~~~
soup.find_all("title")
soup.find_all("p", "title")
soup.find_all("a")
~~~

'find_all()' also accepts a regular expression instead of a string.

<h6> 4. strip() </h6>
The strip() method returns a copy of the string with both leading and trailing characters removed. It is generally used to trim empty spaces. This is very handy while extracting data from HTML Document.


<h6> 5. split() </h6>
The split() method splits the string into different parts and we can use the parts that we desire. It works with a combination of separator and a string. Hence is a specific part of the HTML has to be divided into 1 or more parts or text arrays we use this function.
This Split function is very similar to general purpose python.

Example:
~~~
filename = BeautifulSoup(tag).findAll("img")[0]['src']
filetype = filename.split(".")[1]
~~~

<h3>Introduction to rvest</h3>

rvest is a package used in R programming for extracting or harvesting data from webpages. Inspired by BeautifulSoup, rvest assists in common webscraping tasks.

<h4>Scraping a webpage with R</h4>

The first needed to scrape a web page is to read the page. That can happen with the help of two request functions in rvest - 'read_html()' and 'session()'.

'read_html()' can parse a HTML file or an url into xml document.

'session()' creates a session and accept httr methods.

Example :

url <- "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
results_page <- read_html(url)

OR

url <- "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25"
response <- session(url) 
results_page <- httr::content(response$response, as = "text")

<h5> Extracting with rvest Examples: rvest div class and beyond </h5>

The html_nodes() function in rvest is analogous to the find() function in Python. The function is obiously very essential because with it we will get to the content we desire. When scraping a document, look for nodes, attributes and/or texts. 

Let's look at the following target element examples:

- Target by Class ID appears as 
~~~
\<div class='target'>\</div>
~~~ 
You target this as: 
~~~
html_nodes("div.target")
~~~
Since it is div node by default, one could even write it as 
~~~
html_nodes(".target")
~~~
Note: There are multiple ways to access target class but the above two are some of the more commonnly used methods.
We can also have nested statements as:
~~~
html_nodes("#titleCast .itemprop span")
~~~

- Target by Element ID appears as 
~~~
\<div id='target'>\</div>
~~~ 
You target this as: 
~~~
html_nodes("#target")
~~~

- Target by HTML tag type appears as 
~~~
<table></table>
~~~

You target this as: 
~~~
html_nodes("table")
~~~

- Target child of another tag appears as 
~~~
<ol class='sources'><li></li></ol>
~~~

You target this as 
~~~
html_child("sources li")
~~~

To further extracts the contents of a page, one could use 'html_text()' or 'html_text2()'. 

The main difference is how they handle white space. Since in HTML, white space is largely ignored, it also forms the structure of the elements that defines how text is laid out. 'html_text2()' does its best to follow the same rules, giving you something similar to what you’d see in the browser (and what you would like to see). 'html_text()' is faster than 'html_text2()' so it is useful for simple applications where performance is important. 

<h3>Example : Scraping Fidelity.com</h3>

In this example, we  will scrape data from fidelity.com. The goal of the project is to get the latest sector performance data from the US markets, and to get the total market capitalization for each sector. Here, while we write all code in python, we also show what the near equivalent snippets in r would look like. 


In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

### **Extracting the HTML**

### Step 1

We can get the HTML content from this page using requests

We have to first import the requests library, and then download the page using the requests.get method

In [2]:
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response=requests.get(url)
response

<Response [200]>

#### R equivalent
~~~
url <- "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
results_page <- read_html(url)
~~~
OR
~~~
url <- "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
response <- session(url) 
~~~

### Step 2

Extracting content from the HTML
To extract our data from the HTML received in data, we'll need to identify which tags have what we need.

If you skim through the HTML, you’ll find this section for all sectors

In [3]:
results_page = BeautifulSoup(response.content,'lxml')
sectors = results_page.find_all('div',class_="heading")
print(sectors)

[<div class="heading">
<h2>U.S. Markets Performance <a href="https://www.fidelity.com//webcontent/ap010098-etf-content/21.01.0/help/research/learn_er_markets_sectors_news_analysis.shtml#whatusfuturesmarketsdataisavailable" onclick="javascript:openPopup('https://www.fidelity.com//webcontent/ap010098-etf-content/21.01.0/help/research/learn_er_markets_sectors_news_analysis.shtml#canigetcurrentandhistoricalmarketperformanceinformationforfutures',420,450);return false;"><img alt="Help for U.S. Futures Markets" border="0" src="https://www.fidelity.com/webcontent/ap101883-markets_sectors-content/21.01.0/images/question_mark.png" title="Help for U.S. Markets Performance"/></a></h2>
<br/>
<div class="usMarketAnchor" id="usMarketAnchor"><span class="byline" id="futureMarketDatetime"></span></div>
</div>, <div class="heading"><a class="heading1" href="/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&amp;sector=50" id="50"><strong>Communication Services</strong></a></div>, <div

#### R Equivalent

~~~
results_page <- httr::content(response$response, as = "text")
results_page <- results_page %>% html_nodes("div.heading") %>%  html_text2()
~~~

### Step 3

We Loop through the list and call the function get_sector_change_and_market_cap(sector_page_link) for each sector

It returns the name, the change, the capitalization, the weight and the link for each sector

In [4]:
sector_page_link = "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25"
response = requests.get(sector_page_link)
results_page = BeautifulSoup(response.content,'lxml')
results_page

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--[if lt IE 7 ]><html class="ie6" xml:lang="en" lang="en"> <![endif]--><!--[if IE 7 ]><html class="ie7" xml:lang="en" lang="en"> <![endif]--><!--[if IE 8 ]><html class="ie8" xml:lang="en" lang="en"> <![endif]--><!--[if (gte IE 9)|!(IE)]><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><![endif]--><html><head>
<script type="text/javascript">
    	if (top.location != location) {
        	top.location.href = document.location.href;
     	}
	</script>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="" name="keywords"/>
<meta content="Research the performance of U.S. Consumer Discretionary sector." name="description"/>
<meta content="639042" name="ereview"/>
<link href="https://www.fidelity.com/sector-investing/consumer-discretionary/overview" rel="canonical"/>
<link href="https://www.fidelity.com/webcontent/ap101883

#### R Equivalent
~~~
sector_page_link <- "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25"
response <- session(sector_page_link) 
results_page <- httr::content(response$response, as = "text")
~~~

### Step 4

We see for every sector we have a *table* class "snapshot-data-tbl", under which all the data is contained.
*Div* "page-title" contains the sector name.
Rest of the information is contained in *span*
We use *get_text()* to get all data incrementally inside of span


In [5]:
def get_sector_change_and_market_cap(sector_page_link):
    response = requests.get(sector_page_link)
    sector_change_list = list()
    x = list()
    if not response.status_code == 200:
        return None
    try:
        results_page = BeautifulSoup(response.content,'lxml')
        sector_information = results_page.find('table',class_="snapshot-data-tbl")
        st = sector_information.find_all('span')
        for s in st:
            x.append(s.get_text())
        
        sector_change = x[0]
        sector_market_cap = x[2]
        sector_market_weight = x[4].rstrip('%')
        sector_name = results_page.find('div',class_="page-title").get_text().strip()
        
            
    except:
        return None
        
    return sector_name,sector_change,sector_market_cap,sector_market_weight

#### R Equivalent

1. Checking for status code
~~~
if(httr::status_code(response) != 200){
        return(NULL)
  }
~~~
2. Narrowing down on content
~~~
st <- results_page %>% html_nodes("table.snapshot-data-tbl") %>% html_nodes("span") %>%  html_text2()
~~~
and
~~~
sector_name <- results_page %>% html_nodes("div.page-title") %>% html_text(trim = TRUE)
~~~

### Step 6
#### (Intermediate step)

Output for a particular sector

In [6]:
link = "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25"
get_sector_change_and_market_cap(link)

('Consumer Discretionary', '-0.02%', '$9.59T', '12.49')

#### R Equivalent
~~~
link = "https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/sectors_in_market.jhtml?tab=learn&sector=25"
get_sector_change_and_market_cap(link)
~~~

### Step 7

In this function *get_us_sector_performance()*, we get a list of sectors and the links to the sector detail pages from the url and add it to the output_list

Each tuple corresponds to a sector and contains the following data:
<li>the sector name
<li>the amount the sector has moved
<li>the market capitalization of the sector
<li>the market weight of the sector
<li>a link to the fidelity page for that sector

In [7]:
def get_us_sector_performance():
    output_list = list()
    url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
    response=requests.get(url)
    sector_list= list()
    sector_link_list = list()
    if not response.status_code == 200:
         return None

    results_page = BeautifulSoup(response.content,'lxml')
    sectors = results_page.find_all('div',class_="heading")
    for sector in sectors:
        sector_link = "https://eresearch.fidelity.com" + sector.find('a').get('href')
        sector_link_list.append(sector_link)

    sector_link_list.pop(0)
    for link in sector_link_list:     
        
        
        sector_name,sector_change,sector_market_cap,sector_market_weight = get_sector_change_and_market_cap(link)
        output_list.append((sector_name, sector_change,sector_market_cap,sector_market_weight,link))
    
    output_list.sort(key=lambda k: float(k[3]), reverse = True)

    return (output_list)

#### R Equivalent

1. Checking for status code
~~~
if(httr::status_code(response) != 200){
        return(NULL)
    }
~~~

2. Finding and appending sectors
~~~
    results_page <- httr::content(response$response, as = "text")
    sectors = results_page %>% html_nodes("div.heading")
    for(sector in sectors){
      x <- sector %>%  html_element("a") %>% html_attr('href')
      sector_link <- "https://eresearch.fidelity.com" + x
      append(sector_link_list,sector_link)
    }
~~~

#### Step 8
Printing the list of necessary content

In [8]:
f = get_us_sector_performance()
df = pd.DataFrame(f, columns =['Industry_Sector', 'Sector_change', 'Sector_market_cap', 'Sector_market_weight', 'URL'])
print(df.head(5))

          Industry_Sector Sector_change Sector_market_cap  \
0  Information Technology        +0.43%           $16.47T   
1             Health Care        +0.95%            $8.42T   
2  Consumer Discretionary        -0.02%            $9.59T   
3              Financials        -0.43%            $9.25T   
4  Communication Services        +0.83%            $6.68T   

  Sector_market_weight                                                URL  
0                27.76  https://eresearch.fidelity.com/eresearch/marke...  
1                12.98  https://eresearch.fidelity.com/eresearch/marke...  
2                12.49  https://eresearch.fidelity.com/eresearch/marke...  
3                11.67  https://eresearch.fidelity.com/eresearch/marke...  
4                10.76  https://eresearch.fidelity.com/eresearch/marke...  


#### R equivalent
~~~
f <- get_us_sector_performance()
df <- as.data.frame(f, col.names = c('Industry_Sector', 'Sector_change', 'Sector_market_cap', 'Sector_market_weight', 'URL'))
~~~

#### Step 9

Saving data to Disk

In [9]:
df.to_csv('my_data.csv',index=False)

#### R Equivalent

~~~
write.csv(df,file="my_data.csv")
~~~