# Chapter 7 - Web Scraping

Disclaimer: This notebook is provided for educational and academic purposes only, and does not substitute professional legal advice. 


Last edited: 05/16/2021

In this chapter, we will use Federal Open Market Committee statement as an example to learn about web scraping on a more general basis, so that you can apply it for any tasks of gathering of information from the internet. Note that while web scraping, one should be ethical and gentle.

Check the `www.websitename.com/robots.txt` page to know if the website is crawler friendly.<br> To know more about robots.txt, read the resource [here](https://www.robotstxt.org/robotstxt.html).

**Contents of this Notebook:**

- [Section 1. Single page web-scraping – Beautiful Soup](#Section-1.-Single-page-web-scraping-–-Beautiful-Soup)
- [Section 2. Find all FOMC statement associated links ](#Section-2.-Find-all-FOMC-statement-associated-links)
- [Section 3. Loop over the links, pull the text and save the DataFrame into csv file](#Section-3.-Loop-over-the-links,-pull-the-text-and-save-the-DataFrame-into-csv-file)

We need to use the beautifulsoup and requests module. So please have them installed via your terminal. 

    pip install beautifulsoup4
    pip install requests

In [1]:
# %load_ext nb_black
# loading the libraries
import pandas as pd
import numpy as np
from datetime import date, datetime
import re

import requests  # performs the URL request and fetches the website's HTML
from bs4 import BeautifulSoup  # the scraping module to parse the HTML
from random import randint
import time

## Section 1. Single page web-scraping – Beautiful Soup
### 1.1 Find one FOMC statement in HYML format (If you download the pdf, you will need to convert the pdf to text data. This process is more computational heavier than using the HTML source. For textual analysis, pdf source is unfriendly.)

For example, we arbitrarily pick one FOMC statement from [FOMC statement website]( https://www.federalreserve.gov/newsevents/pressreleases/monetary20210127a.htm).

In [2]:
web = "https://www.federalreserve.gov/newsevents/pressreleases/monetary20200303a.htm"
response = requests.get(web)

In [3]:
type(response)

requests.models.Response

For more on `requests.get`, see [here](https://docs.python-requests.org/en/master/user/quickstart/#make-a-request).

### 1.2 In Chrome, right click – choose "View Page Source". 
This is what your computer will read. In python, you can use request command to get the page source. 

In [4]:
webtext = response.text
webtext[:1000]

'ï»¿<!doctype html>\n<html lang="en" class="no-js">\n<head>\n \n\n\n \n \n <meta charset="utf-8">\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\n <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0 maximum-scale=1.6, user-scalable=1">\n <meta name="description" content=" The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support " />\n <meta property="og:title" content="Federal Reserve issues FOMC statement"/>\n <meta property="og:site_name" content="Board of Governors of the Federal Reserve System"/>\n <meta property="og:type" content="article" /> \n <meta property="og:description" content=" The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support "/>\n <meta property="og:image"  content="" /> \n <meta name="twitter:card" content="summary" /

### 1.3 Use Beautiful Soup to parse HTML.
You don’t need to know HTML language well, but you do need to browse the source and find where the data is (simply use CTRL+F). <br>

The web is built on blocks. Each block has its own tags to inform the browsers how to read them. The soup object will parse those blocks for us.

[A Simple HTML Document](https://www.w3schools.com/html/html_intro.asp).

In [5]:
# make the soup (parse the HTML)
soup = BeautifulSoup(webtext, "html.parser")

In [6]:
print(soup)

ï»¿<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0 maximum-scale=1.6, user-scalable=1" name="viewport"/>
<meta content=" The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support " name="description">
<meta content="Federal Reserve issues FOMC statement" property="og:title">
<meta content="Board of Governors of the Federal Reserve System" property="og:site_name">
<meta content="article" property="og:type"/>
<meta content=" The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support " property="og:description"/>
<meta content="" property="og:image"/>
<meta content="summary" name="twitter:card"/>
<meta content="Federal Reserve issues FO

**Navigating through the source code of the web pages we’re scraping from.**

- In web browser, right click - "View Page Sourse".

- "Control + F" find the statements

- Passing the block tags to `soup.find_all`

In [7]:
detail_tab = soup.find_all("div", class_="col-xs-12 col-sm-8 col-md-8")

detail_tab

[<div class="col-xs-12 col-sm-8 col-md-8">
 <p>The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support of achieving its maximum employment and price stability goals, the Federal Open Market Committee decided today to lower the target range for the federal funds rate by 1/2 percentage point, to 1 to 1â1/4 percent. The Committee is closely monitoring developments and their implications for the economic outlook and will use its tools and act as appropriate to support the economy.</p>
 <p>Voting for the monetary policy action were Jerome H. Powell, Chair; John C. Williams, Vice Chair; Michelle W. Bowman; Lael Brainard; Richard H. Clarida; Patrick Harker; Robert S. Kaplan; Neel Kashkari; Loretta J. Mester; and Randal K. Quarles.</p>
 <p>For media inquiries, call 202-452-2955.</p>
 <p><a href="/newsevents/pressreleases/monetary20200303a1.htm">Implementation Note issued March 3, 2020</a></

In [8]:
paras = detail_tab[0].find_all("p")

paras[0]

<p>The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support of achieving its maximum employment and price stability goals, the Federal Open Market Committee decided today to lower the target range for the federal funds rate by 1/2 percentage point, to 1 to 1â1/4 percent. The Committee is closely monitoring developments and their implications for the economic outlook and will use its tools and act as appropriate to support the economy.</p>

In [9]:
type(paras[0])

bs4.element.Tag

In [10]:
paras[0].text

'The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support of achieving its maximum employment and price stability goals, the Federal Open Market Committee decided today to lower the target range for the federal funds rate by 1/2 percentage point, to 1 to 1â\x80\x911/4 percent. The Committee is closely monitoring developments and their implications for the economic outlook and will use its tools and act as appropriate to support the economy.'

Using `.join()` funtion to concatenate all paragraphs (in `paras`) together with new lines `\n`.

In [11]:
statement = "\n".join(para.text for para in paras)

In [12]:
print(statement)

The fundamentals of the U.S. economy remain strong. However, the coronavirus poses evolving risks to economic activity. In light of these risks and in support of achieving its maximum employment and price stability goals, the Federal Open Market Committee decided today to lower the target range for the federal funds rate by 1/2 percentage point, to 1 to 1â1/4 percent. The Committee is closely monitoring developments and their implications for the economic outlook and will use its tools and act as appropriate to support the economy.
Voting for the monetary policy action were Jerome H. Powell, Chair; John C. Williams, Vice Chair; Michelle W. Bowman; Lael Brainard; Richard H. Clarida; Patrick Harker; Robert S. Kaplan; Neel Kashkari; Loretta J. Mester; and Randal K. Quarles.
For media inquiries, call 202-452-2955.
Implementation Note issued March 3, 2020


### 1.4 Wrap up the code into a function


In [13]:
def soup(web):
    html = requests.get(web).text
    return BeautifulSoup(html, "html.parser")


def webscraping_fomc(web):
    html_soup = soup(web)
    detail_tab = html_soup.find_all("div", class_="col-xs-12 col-sm-8 col-md-8")
    paras = detail_tab[0].find_all("p")
    return "\n".join(para.text for para in paras)

## Section 2. Find all FOMC statement associated links 

### 2.1 2016-2021
#### 2.1.1 Browse the [FOMC website](https://www.federalreserve.gov/monetarypolicy/fomc.htm) and find the page where we can find the FOMC statements (2016-2021).


<details><summary>Click here for the links</summary>
Key pages where we can find the links:
    
1. For recent 5 years (2016-2021) records, see [meeting calendars](https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm).<br>

2. For historical records (prior to 2016), see [Historical Materials by Year](https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm).
</details>

#### 2.1.2 Request the page, and use regular expression to capture all links

In [14]:
web_calendar = "https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm"

web_calendar_text = requests.get(web_calendar).text
captured_1621 = re.findall(
    r'href="(\/newsevents\/pressreleases\/monetary(20\d{6})a.htm)">HTML',
    web_calendar_text,
)

hrefs_1621, dates_1621 = zip(*captured_1621)
hrefs_1621 = ["https://www.federalreserve.gov" + href for href in hrefs_1621]

In [15]:
captured_1621

[('/newsevents/pressreleases/monetary20210127a.htm', '20210127'),
 ('/newsevents/pressreleases/monetary20210317a.htm', '20210317'),
 ('/newsevents/pressreleases/monetary20210428a.htm', '20210428'),
 ('/newsevents/pressreleases/monetary20200129a.htm', '20200129'),
 ('/newsevents/pressreleases/monetary20200303a.htm', '20200303'),
 ('/newsevents/pressreleases/monetary20200315a.htm', '20200315'),
 ('/newsevents/pressreleases/monetary20200323a.htm', '20200323'),
 ('/newsevents/pressreleases/monetary20200429a.htm', '20200429'),
 ('/newsevents/pressreleases/monetary20200610a.htm', '20200610'),
 ('/newsevents/pressreleases/monetary20200729a.htm', '20200729'),
 ('/newsevents/pressreleases/monetary20200916a.htm', '20200916'),
 ('/newsevents/pressreleases/monetary20201105a.htm', '20201105'),
 ('/newsevents/pressreleases/monetary20201216a.htm', '20201216'),
 ('/newsevents/pressreleases/monetary20190130a.htm', '20190130'),
 ('/newsevents/pressreleases/monetary20190320a.htm', '20190320'),
 ('/newsev

I extracted the URLs and dates with unzip funtion.

In [16]:
hrefs_1621

['https://www.federalreserve.gov/newsevents/pressreleases/monetary20210127a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20210317a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20210428a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200129a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200303a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200315a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200323a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200429a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200610a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200729a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20200916a.htm',
 'https://www.federalreserve.gov/newsevents/pressreleases/monetary20201105a.htm',
 'https://www.fe

In [17]:
list(dates_1621)

['20210127',
 '20210317',
 '20210428',
 '20200129',
 '20200303',
 '20200315',
 '20200323',
 '20200429',
 '20200610',
 '20200729',
 '20200916',
 '20201105',
 '20201216',
 '20190130',
 '20190320',
 '20190501',
 '20190619',
 '20190731',
 '20190918',
 '20191011',
 '20191030',
 '20191211',
 '20180131',
 '20180321',
 '20180502',
 '20180613',
 '20180801',
 '20180926',
 '20181108',
 '20181219',
 '20170201',
 '20170315',
 '20170503',
 '20170614',
 '20170726',
 '20170920',
 '20171101',
 '20171213',
 '20160127',
 '20160316',
 '20160427',
 '20160615',
 '20160727',
 '20160921',
 '20161102',
 '20161214']

### 2.2 Historical records: prior to 2016
#### 2.2.1 Browse the website and find which page contains the statement links (prior to 2016)

<details><summary>Click here for the links</summary>
Key pages where we can find the links:
    
1. For recent 5 years (2016-2021) records, see [meeting calendars](https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm).<br>

2. For historical records (prior to 2016), see [Historical Materials by Year](https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm).
</details>


For historical years, the home directory page use the same format:
`https://www.federalreserve.gov/monetarypolicy/fomchistoricalXXXX.htm`,

where XXXX is the year. After we write a function to deal with one year, We can use a `for` loop to get access to all years.

#### 2.2.2 Web scraping the links in [2015](https://www.federalreserve.gov/monetarypolicy/fomchistorical2015.htm)

In [18]:
web2015 = "https://www.federalreserve.gov/monetarypolicy/fomchistorical2015.htm"
text2015 = requests.get(web2015).text

re.findall(
    r'a href="(\/newsevents\/pressreleases\/monetary(2015\d{4})a.htm)">Statement</a>',
    text2015,
)

[('/newsevents/pressreleases/monetary20150128a.htm', '20150128'),
 ('/newsevents/pressreleases/monetary20150318a.htm', '20150318'),
 ('/newsevents/pressreleases/monetary20150429a.htm', '20150429'),
 ('/newsevents/pressreleases/monetary20150617a.htm', '20150617'),
 ('/newsevents/pressreleases/monetary20150729a.htm', '20150729'),
 ('/newsevents/pressreleases/monetary20150917a.htm', '20150917'),
 ('/newsevents/pressreleases/monetary20151028a.htm', '20151028'),
 ('/newsevents/pressreleases/monetary20151216a.htm', '20151216')]

#### 2.2.3 Request the historical record page, loop over the years and use regular expression to capture all links Web scraping the links for 2011-2015

In [19]:
# create an empty list for saving captured data
captured_1115 = []

for year in range(2011, 2015 + 1):
    webyear = (
        "https://www.federalreserve.gov/monetarypolicy/fomchistorical"
        + str(year)
        + ".htm"
    )
    textyear = requests.get(webyear).text

    captured = re.findall(
        r'a href="(\/newsevents\/pressreleases\/monetary('
        + str(year)
        + '\d{4})a.htm)">Statement</a>',
        textyear,
    )
    captured_1115.append(captured)

    # Wait one or two seconds after each loop
    time.sleep(randint(1, 2))

In [20]:
captured_1115

[[('/newsevents/pressreleases/monetary20110126a.htm', '20110126'),
  ('/newsevents/pressreleases/monetary20110315a.htm', '20110315'),
  ('/newsevents/pressreleases/monetary20110427a.htm', '20110427'),
  ('/newsevents/pressreleases/monetary20110622a.htm', '20110622'),
  ('/newsevents/pressreleases/monetary20110809a.htm', '20110809'),
  ('/newsevents/pressreleases/monetary20110921a.htm', '20110921'),
  ('/newsevents/pressreleases/monetary20111102a.htm', '20111102'),
  ('/newsevents/pressreleases/monetary20111213a.htm', '20111213')],
 [('/newsevents/pressreleases/monetary20120125a.htm', '20120125'),
  ('/newsevents/pressreleases/monetary20120313a.htm', '20120313'),
  ('/newsevents/pressreleases/monetary20120425a.htm', '20120425'),
  ('/newsevents/pressreleases/monetary20120620a.htm', '20120620'),
  ('/newsevents/pressreleases/monetary20120801a.htm', '20120801'),
  ('/newsevents/pressreleases/monetary20120913a.htm', '20120913'),
  ('/newsevents/pressreleases/monetary20121024a.htm', '201210

#### 2.2.4 2010?
The regular expression we used fails to capture statements in [2010](https://www.federalreserve.gov/monetarypolicy/fomchistorical2010.htm).<br>
You may realize that the statement website does NOT follow a unique pattern. We have to modify our regular expression to fit for every cases.

Please run the code, see what `captured_0615` looks like, and interpret the corresponding regular expression.

In [21]:
def get_hrefs(year):
    web = (
        "https://www.federalreserve.gov/monetarypolicy/fomchistorical"
        + str(year)
        + ".htm"
    )
    html = requests.get(web).text

    return re.findall(
        r'a href="(\/newsevents\/press(?:releases)?\/monetary(?:\/)?('
        + str(year)
        + '\d{4})a.htm)">Statement</a>',
        html,
    )


captured_0615 = []

for year in range(2006, 2015 + 1):
    captured_year = get_hrefs(year)
    captured_0615.append(captured_year)
    print(f"{year}, {len(captured_year)}")
    time.sleep(randint(1, 2))

2006, 8
2007, 9
2008, 9
2009, 8
2010, 9
2011, 8
2012, 8
2013, 8
2014, 8
2015, 8


In [22]:
captured_0615

[[('/newsevents/press/monetary/20060131a.htm', '20060131'),
  ('/newsevents/press/monetary/20060328a.htm', '20060328'),
  ('/newsevents/press/monetary/20060510a.htm', '20060510'),
  ('/newsevents/press/monetary/20060629a.htm', '20060629'),
  ('/newsevents/press/monetary/20060808a.htm', '20060808'),
  ('/newsevents/press/monetary/20060920a.htm', '20060920'),
  ('/newsevents/press/monetary/20061025a.htm', '20061025'),
  ('/newsevents/press/monetary/20061212a.htm', '20061212')],
 [('/newsevents/press/monetary/20070131a.htm', '20070131'),
  ('/newsevents/press/monetary/20070321a.htm', '20070321'),
  ('/newsevents/press/monetary/20070509a.htm', '20070509'),
  ('/newsevents/press/monetary/20070618a.htm', '20070618'),
  ('/newsevents/press/monetary/20070807a.htm', '20070807'),
  ('/newsevents/press/monetary/20070810a.htm', '20070810'),
  ('/newsevents/press/monetary/20070918a.htm', '20070918'),
  ('/newsevents/press/monetary/20071031a.htm', '20071031'),
  ('/newsevents/press/monetary/20071211

## Section 3. Loop over the links, pull the text and save the DataFrame into csv file
Instead of saving each statement into a single txt file, we can save them into dataframe for future use.

In [23]:
# Flatten List of Lists Using a List Comprehension
captured_0615f = [
    onemeeting for captured_year in captured_0615 for onemeeting in captured_year
]
hrefs, dates = zip(*captured_0615f)

hrefs = ["https://www.federalreserve.gov" + href for href in hrefs]
dates = [datetime.strptime(date, "%Y%m%d") for date in dates]

hrefs.extend(hrefs_1621)
dates.extend(list(dates_1621))

In [24]:
hrefs

['https://www.federalreserve.gov/newsevents/press/monetary/20060131a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20060328a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20060510a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20060629a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20060808a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20060920a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20061025a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20061212a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20070131a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20070321a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20070509a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20070618a.htm',
 'https://www.federalreserve.gov/newsevents/press/monetary/20070807a.htm',
 'https://www.federalrese

```
# Use webscraping_fomc function to get all statements, and save in a list

fomcs = []
for href in hrefs:
    fomcs.append(webscraping_fomc(href))
    print(href)
    time.sleep(randint(1, 2))
```

**Warning!!** <br>

**The Cell Type has been changed to Markdown so that you don't accidentally run the code.**<br>

**It is not recommended to run the code until you fully understand the code and it's consequences.**

In [25]:
fomcs = []

In [26]:
# Save the dataframe
df = pd.DataFrame(
    data=list(zip(dates, hrefs, fomcs)), columns=["dates", "hrefs", "fomcs"]
)

# import os
# os.getcwd()

# f_fomc0621 = "data/fomc0621.csv"
# df.to_csv(f_fomc0621, index=False)

#### Exercise

Get FOMC statement URLs for years 2000-2005.

In [27]:
# insert your code here


<details><summary>Click here for the solution</summary>

```python
# regular expression for 2003-2005
re.findall(
    r'a href="(\/boarddocs\/press\/monetary\/'
    + str(year)
    + "\/("
    + str(year)
    + '\d{4})\/(?:default.htm)?)">Statement</a>',
    html,
)

# regular expression for 2000-2002
re.findall(
    r'a href="(\/boarddocs\/press\/(?:general|monetary)\/'
    + str(year)
    + "\/("
    + str(year)
    + '\d{4})\/)">Statement</a>',
    html,
)
```

</details>