# Web Scraping Workshop 2 - HTML Parising with BeautifulSoup
Prepared by: Nickolas K. Freeman, Ph.D.

In this notebook, we will see how we can use Python to harvest data from HTML. In particular, we will be using the python `requests` and `beautifulsoup` libraries to request and parse financial data from Yahoo finance. 

The following code block imports some libraries that we will use.

In [1]:
import bs4
import datetime
import pandas as pd
import requests
import time

## Yahoo Finance

We will be harvesting data from `https://finance.yahoo.com`. Before starting, let's first look at the `robots.txt` file published for the site to see if there is any activity that is specifically *disallowed* or if any additional guidelines for access are given. The following code block makes a **GET** request for the site's `robots.txt` file and prints the returned text.

In [2]:
r = requests.get('https://finance.yahoo.com/robots.txt')
print(r.text)

User-agent: *
Sitemap: https://finance.yahoo.com/sitemap_en-us_desktop_index.xml
Sitemap: https://finance.yahoo.com/sitemaps/finance-sitemap_index_US_en-US.xml.gz
Sitemap: https://finance.yahoo.com/sitemaps/finance-sitemap_googlenewsindex_US_en-US.xml.gz
Disallow: /m/
Disallow: /r/
Disallow: /__rapidworker-1.2.js
Disallow: /__blank
Disallow: /_td_api
Disallow: /_remote



As you can see the `robots.txt` file specifies several paths that are disallowed for all user-agents. This is indicated by the `User-agent: *` line. However, also note that the files provides several sitemaps that actually help programs determine the layout of the site. We will use one of these sitemaps to help us access stock information for companies operating in specific sectors. Before doing this, let's store a list of prohibited paths so that we can ensure we do not volate the `robots.txt` directives.

In [3]:
prohibited_paths = []
for statement in r.text.split('\n'):
    if 'Disallow:' in statement:
        current_path = statement.strip('Disallow: ')
        prohibited_paths.append('https://finance.yahoo.com' + current_path)
prohibited_paths

['https://finance.yahoo.com/m/',
 'https://finance.yahoo.com/r/',
 'https://finance.yahoo.com/__rapidworker-1.2.j',
 'https://finance.yahoo.com/__blank',
 'https://finance.yahoo.com/_td_ap',
 'https://finance.yahoo.com/_remote']

The following code block makes a **GET** request for the first sitemap. The response from this request is stored in the variable `r`. We use the `content` attribute of this response to get a *bytes* representation of the reponse and use the *BeautifulSoup* library to convert this data into a `soup` object. We print a *prettified* version of the first 1,000 characters of this object. 

In [4]:
r = requests.get('https://finance.yahoo.com/sitemap_en-us_desktop_index.xml')

soup = bs4.BeautifulSoup(r.content)

print(soup.prettify()[:1000])

<?xml version="1.0"?>
<html>
 <body>
  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/andy-serwer/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/brittany-jones-cooper/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/daniel-howley-20160419/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/daniel-roberts/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/julia-la-roche/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/mandi-woodruff/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/melody-hahm-20151026/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/nicole-sinclair/
    </loc>
   </url>
   <url>
    <loc>
     https://finance.yahoo.com/blogs/author/rick-newman/
    <

If we were to print more characters, you would see that there are URLS that follow the form ` https://finance.yahoo.com/sector/...`. These pages report stock information for top companies operating in the associated sector. The following code block shows how we can use BeautifulSoup to extract all of the `loc` elements that include the word `sector` in the associated text and store them in a list called `sector_urls`. Note how we are insuring that the returned URLS do not match any of the partial strings stored in our list of `prohibited_paths`.

In [5]:
sector_urls = []
for loc_tag in soup.find_all('loc'):
    if 'sector' in loc_tag.text:
        prohibited = False
        for path in prohibited_paths:
            if path in loc_tag.text:
                print('Path prohibited!')
                prohibited = True
        if not prohibited:
            print(f'{loc_tag.text} not prohibited')
            sector_urls.append(loc_tag.text)

https://finance.yahoo.com/sector/ms_basic_materials not prohibited
https://finance.yahoo.com/sector/ms_communication_services not prohibited
https://finance.yahoo.com/sector/ms_consumer_cyclical not prohibited
https://finance.yahoo.com/sector/ms_consumer_defensive not prohibited
https://finance.yahoo.com/sector/ms_energy not prohibited
https://finance.yahoo.com/sector/ms_financial_services not prohibited
https://finance.yahoo.com/sector/ms_healthcare not prohibited
https://finance.yahoo.com/sector/ms_industrials not prohibited
https://finance.yahoo.com/sector/ms_real_estate not prohibited
https://finance.yahoo.com/sector/ms_technology not prohibited
https://finance.yahoo.com/sector/ms_utilities not prohibited


The following code block prints the first URl in our `sector_urls` list. If you visit the site with your web browser, you sill see that the page displays a table of related stocks and associated information.

In [6]:
current_sector_loc = sector_urls[0]
print(sector_urls[0])

https://finance.yahoo.com/sector/ms_basic_materials


Inspecting the HTML of the previous URL, you will see that the table is enclosed in a set of `<table>` tags. The following code block 
1. requests the HTML associated with the URL, 
2. converts the response to a `soup` object, 
3. identifies all elements enclosed in `<table>` tags and stores the associated HTML in a list named `tables`, 
4. prints the number of items in the `tables` object, and
5. prints the first 5,000 characters of the first item in the `tables` list.

Note how the HTML for the tables uses the `<tr>`, `<th>`, and `<td>` to achieve the desired structure.

In [7]:
current_sector_loc = sector_urls[0]

r = requests.get(current_sector_loc)
soup = bs4.BeautifulSoup(r.content)

tables = soup.find_all('table')

print(f'The tables object has {len(tables)} item(s).\n')

print(f'The first item in the tables object is:\n{"-"*70}\n{tables[0].prettify()[:5000]}')

The tables object has 1 item(s).

The first item in the tables object is:
----------------------------------------------------------------------
<table class="W(100%)" data-reactid="32">
 <thead data-reactid="33">
  <tr class="C($c-fuji-grey-j) BdB Bdbc($finLightGrayAlt)" data-reactid="34">
   <th class="Ta(start) Pstart(6px) Pend(10px) Miw(90px) Bgc(white) Fz(xs) Va(m) Py(5px)! Fw(400)! Ta(start) Start(0) Pend(10px) Ta(start)!" data-reactid="35">
    <span data-reactid="36">
     Symbol
    </span>
   </th>
   <th class="Ta(start) Px(10px) Bgc(white) Fz(xs) Va(m) Py(5px)! Fw(400)!" data-reactid="37">
    <span data-reactid="38">
     Name
    </span>
   </th>
   <th class="Ta(end) Pstart(20px) Bgc(white) Fz(xs) Va(m) Py(5px)! Fw(400)!" data-reactid="39">
    <span data-reactid="40">
     Price (Intraday)
    </span>
   </th>
   <th class="Ta(end) Pstart(20px) Bgc(white) Fz(xs) Va(m) Py(5px)! Fw(400)!" data-reactid="41">
    <span data-reactid="42">
     Change
    </span>
   </th>
   

The following code block shows how we can use a loop to iterate over all of the sectors, collecting the data stored in the table and using it to construct a pandas `DataFrame`.

In [8]:
data_list = []
for current_sector_loc in sector_urls:
    r = requests.get(current_sector_loc)
    soup = bs4.BeautifulSoup(r.content)

    for current_row in soup.find_all('tr'):
        row_list = []
        for current_data in current_row.find_all('td'):
            row_list.append(current_data.text)
        if row_list:
            row_list.append(current_sector_loc.split('/')[-1])
            row_list.append(datetime.datetime.now().strftime('%m-%d-%Y %H:%M:%S'))
            data_list.append(row_list)
    
columns = ['Symbol', 'Name', 'Price', 'Change', '%Change', 
           'Volume', 'Avg Volume (3 month)', 'Market Cap', 
           'PE Ratio', '52 Week Range', 'Sector', 'Scrape Time']

all_data = pd.DataFrame(data_list, columns = columns)
all_data.head()

Unnamed: 0,Symbol,Name,Price,Change,%Change,Volume,Avg Volume (3 month),Market Cap,PE Ratio,52 Week Range,Sector,Scrape Time
0,BHP,BHP Group,45.78,1.84,+4.19%,2.461M,2.068M,117.261B,12.33,,ms_basic_materials,03-04-2020 16:38:53
1,LIN,Linde plc,205.53,10.14,+5.19%,2.606M,1.752M,109.539B,49.09,,ms_basic_materials,03-04-2020 16:38:53
2,BBL,BHP Group,38.97,1.65,+4.42%,2.089M,1.332M,98.605B,10.5,,ms_basic_materials,03-04-2020 16:38:53
3,CTA-PB,E. I. du Pont de Nemours and Company,108.0,-1.37,-1.25%,366,965,93.723B,,,ms_basic_materials,03-04-2020 16:38:53
4,RIO,Rio Tinto Group,51.48,2.72,+5.58%,3.212M,1.835M,86.055B,10.55,,ms_basic_materials,03-04-2020 16:38:53


## Wikipedia

Just to demonstrate a (slightly) more complex harvesting task, let's write some code to collect headlines from Wikipedia pages. As before, let's first check the `robots.txt` file, which is much more extensive.

In [9]:
r = requests.get('https://en.wikipedia.org/robots.txt')
print(r.text)

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: 

The following code block creates a list of prohibited paths for a general web scraping application and displays the first 30 items in the list.

In [10]:
prohibited_paths = []
current_user_agent = None
for line in r.text.split('\n'):
    if 'User-agent' in line:
        current_user_agent = line.strip('User-agent: ')
    if current_user_agent is not None:
        if current_user_agent.strip() == '*':
            if 'Disallow: ' in line:
                prohibited_paths.append(line.strip('Disallow: '))
                
prohibited_paths[:30]

['/w/',
 '/api/',
 '/trap/',
 '/wiki/Spec',
 '/wiki/Spez',
 '/wiki/Spe',
 '/wiki/Special%3A',
 '/wiki/Spezial%3A',
 '/wiki/Spesial%3A',
 '/wiki/%D8%AE%D8%A7%D8%B5:Search',
 '/wiki/%D8%AE%D8%A7%D8%B5%3ASearch',
 '/wiki/Wikipedia:L%C3%B6schkandidaten/',
 '/wiki/Wikipedia:Löschkandidaten/',
 '/wiki/Wikipedia:Vandalensperrung/',
 '/wiki/Wikipedia:Benutzersperrung/',
 '/wiki/Wikipedia:Vermittlungsausschuss/',
 '/wiki/Wikipedia:Administratoren/Probleme/',
 '/wiki/Wikipedia:Adminkandidaturen/',
 '/wiki/Wikipedia:Qualitätssicherung/',
 '/wiki/Wikipedia:Qualit%C3%A4tssicherung/',
 '/wiki/Wikipedia:Vandalismusmeldung/',
 '/wiki/Wikipedia:Gesperrte_Lemmata/',
 '/wiki/Wikipedia:Löschprüfung/',
 '/wiki/Wikipedia:L%C3%B6schprüfung/',
 '/wiki/Wikipedia:Administratoren/Notizen/',
 '/wiki/Wikipedia:Schiedsgericht/Anfragen/',
 '/wiki/Wikipedia:L%C3%B6schpr%C3%BCfung/',
 '/wiki/Wikipedia:Checkuser/',
 '/wiki/Wikipedia_Diskussion:Checkuser/',
 '/wiki/Wikipedia_Diskussion:Adminkandidaturen/']

If you browse through a few Wikipedia pages, you will notice that the links use a very consitent naming pattern. Suppose, we want to try the phrases in the `words_to_try` object.

In [11]:
words_to_try = ['Stochastic_optimization', 
                'Robust_optimization',
                'Nonparametric_regression',
                'Data_warehouse',
                'tHis_wIll_nOt_wErk',
                'Integer_programming',
                'Cluster_analysis']

base_url = 'https://en.wikipedia.org/wiki/'

The following code attempts to access the pages defined by our `words_to_try` object and returns a list of the headlines along with the associated word.

In [12]:
headlines = []

for current_word in words_to_try:
    current_url = base_url + current_word
    
    try:
        r = requests.get(current_url)
        soup = bs4.BeautifulSoup(r.content)

        for headline in soup.find_all('span', attrs = {'class':'mw-headline'}):
            if 'See also' in headline:
                break
            else:
                headlines.append([headline.text, current_word])
    except Exception as e:
        print(f'{type(e).__name__} on {current_url}')
    
    time.sleep(1)
    
for row in headlines:
    print(row)

['Methods for stochastic functions', 'Stochastic_optimization']
['Randomized search methods', 'Stochastic_optimization']
['History', 'Robust_optimization']
['Example 1', 'Robust_optimization']
['Classification', 'Robust_optimization']
['Local robustness', 'Robust_optimization']
['Global robustness', 'Robust_optimization']
['Example 2', 'Robust_optimization']
['Example 3', 'Robust_optimization']
['Non-probabilistic robust optimization models', 'Robust_optimization']
['Probabilistically robust optimization models', 'Robust_optimization']
['Robust counterpart', 'Robust_optimization']
['Gaussian process regression or Kriging', 'Nonparametric_regression']
['Kernel regression', 'Nonparametric_regression']
['Regression trees', 'Nonparametric_regression']
['ETL based Data warehousing', 'Data_warehouse']
['ELT based Data warehousing', 'Data_warehouse']
['Benefits', 'Data_warehouse']
['Generic', 'Data_warehouse']
['Related systems (data mart, OLAP, OLTP, predictive analytics)', 'Data_warehouse']