# Scraping a wider range of dates

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [7]:
from pyquery import PyQuery as pq
from bs4 import BeautifulSoup
from selenium import webdriver
import re
import requests

import time

In [3]:
fedurl_base = "http://search.newyorkfed.org"

## Using Selenium WebDriver

We first used Selenium to obtain all the links of the FOMC minutes on the FOMC website.

Selenium offers an advantage over basic requests--it doesn't run into php/javacript tag selector issues because it simulates an actual human browsing the website. (The search engine from which the links are obtained uses a php/javascript backend.)

In [48]:
# Initialize browser
# Use Ctrl+Enter (don't use Shift+Enter)
browser = webdriver.Firefox()

Selenium allows us to go directly to the search page with text already in the input query.

In [49]:
browser.get("http://search.newyorkfed.org/fomc-docs/search?advanced_search=true&fomc_document_type=minutes&text=.htm&search_precision=All+Words&from_month=3&from_year=1936&to_month=12&to_year=2015&sort=Most+Recent+First&Search=Search")

We iterate through each page of the search results and collect all the links by storing them in a list.

In [50]:
links = []

while True:
    src = browser.page_source
    soup = BeautifulSoup(src, "html.parser")
    
    for tag in soup.find_all('strong'):
        linkbox = tag.find('a')
        if linkbox:
            links.append(linkbox['href'])
    try:
        nextresults = browser.find_element_by_link_text('Next Page')
        nextresults.click()
        time.sleep(1)
    except Exception, e:
        print "End of results"
        print "====================================="
        print e
        break

# remove duplicates
links = list(set(links))
print len(links)

End of results
Message: Unable to locate element: {"method":"link text","selector":"Next Page"}
Stacktrace:
    at FirefoxDriver.prototype.findElementInternal_ (file:///c:/users/george/appdata/local/temp/tmpz5tu8q/extensions/fxdriver@googlecode.com/components/driver-component.js:10659)
    at FirefoxDriver.prototype.findElement (file:///c:/users/george/appdata/local/temp/tmpz5tu8q/extensions/fxdriver@googlecode.com/components/driver-component.js:10668)
    at DelayedCommand.prototype.executeInternal_/h (file:///c:/users/george/appdata/local/temp/tmpz5tu8q/extensions/fxdriver@googlecode.com/components/command-processor.js:12534)
    at DelayedCommand.prototype.executeInternal_ (file:///c:/users/george/appdata/local/temp/tmpz5tu8q/extensions/fxdriver@googlecode.com/components/command-processor.js:12539)
    at DelayedCommand.prototype.execute/< (file:///c:/users/george/appdata/local/temp/tmpz5tu8q/extensions/fxdriver@googlecode.com/components/command-processor.js:12481)
187


We save the links that we found.

In [66]:
import pickle
pickle.dump(links, open("mins_links.p", "wb"))

In [67]:
import pickle
links = pickle.load(open("mins_links.p", "rb"))

In [68]:
len(links)

187

## Using requests

This approach for obtaining the links cannot be used as easily since Javacript used on the FOMC site makes tag selection difficult. 
Attempting to find strong results in no urls being found. 


However, given the links that we found above, we can still use requests to obtain the page contents.

In [74]:
from requests.exceptions import ConnectionError

fomc_mins = {}

In [75]:
# this code block can be run multiple times--we check for duplicates
searched_urls = fomc_mins.keys()

for url in links:
    if url not in searched_urls and url[-3:] == "htm":
        try:
            page = requests.get(url)
            fomc_mins[url] = page.text
        except ConnectionError as e:
            print "Error ==> ", e, "for", date

        time.sleep(1)

print "Finished getting page sources"

http://www.federalreserve.gov/fomc/MINUTES/1993/19930707min.htm
http://www.federalreserve.gov/monetarypolicy/fomcminutes20110427.htm
http://www.federalreserve.gov/fomc/minutes/20030129.htm
http://www.federalreserve.gov/fomc/minutes/19960702.htm
http://www.federalreserve.gov/fomc/minutes/20001219.htm
http://www.federalreserve.gov/monetarypolicy/files/fomcminutes20150318.pdf
http://www.federalreserve.gov/monetarypolicy/fomcminutes20090128.htm
http://www.federalreserve.gov/monetarypolicy/fomcminutes20110921.htm
http://www.federalreserve.gov/fomc/minutes/20051213.htm
http://www.federalreserve.gov/fomc/MINUTES/1993/19930817min.htm
http://www.federalreserve.gov/fomc/minutes/20030916.htm
http://www.federalreserve.gov/fomc/minutes/20070131.htm
http://www.federalreserve.gov/fomc/minutes/20060629.htm
http://www.federalreserve.gov/monetarypolicy/fomcminutes20111213.htm
http://www.federalreserve.gov/fomc/minutes/19980929.htm
http://www.federalreserve.gov/monetarypolicy/fomcminutes20100921.htm
http

We store our page sources a dictionary indexed by the FOMC minutes url.

In [78]:
import json

mins_html = open("fomc_mins_all.json", "wb")
json.dump(fomc_mins, mins_html)
mins_html.close()

In [79]:
import json

with open("fomc_mins_all.json", "rb") as infile:
    fomc_mins_all = json.load(infile)

In [80]:
len(fomc_mins_all)

183