##### Notebook 01: Project Introduction and Data Gathering

# US FOMC Communication Interest Rate Forecaster

### Executive Summary:
Using US Federal Reserve Open Market Committee public communications, we can forecast with approximately 65% accuracy the following 6-month interest rate sentiment.  This is based on using FOMC communications since 1960, and allowing the model to be trained on a random 75% training sample.  Removing portions of data based on date significantly reduces future model accuracy, as the features of the communications changes over time, and the economic/political environment changes over time.  When used with caution, an NLP model using FOMC communications as features can be a useful supplemental tool for interest rate forecasting.

### Problem Statement:<br>
Can NLP be used to forecast US rate changes with a useful level of accuracy?

### Requirements:<br>
Python 3.6+<br>
selenium<br>
fredapi<br>
pdfminer<br>
scikit-learn<br>
gensim<br>
nltk<br>
xgboost<br>
pyLDAvis<br>

##### Code Credit:
pdfminer function:
https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/<br>

### Contents:
- [01 Data Gathering](#01:-Data-Gathering)
- [02: Convert PDF Files](#02:-Convert-PDF-Files)

In [1]:
## Import libraries

from time import sleep as sleep
import os
import io
import pandas as pd
import datefinder
from selenium import webdriver
from fredapi import Fred
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage

### 01: Data Gathering<br>
FOMC communications are released in PDF documents online for public consumption.  All PDF documents are associated by the FOMC with a month.  This month may not be when it was written, but it will be the month it was released to be public.  All documents contain a timestamp with the month it was released.<br>
Historical daily real interest rates will be downloaded from the US Federal Reserve's FRED database, using their FRED API.<br>
PDF documents will be saved in a local folder, interest rate data will be saved as a local CSV.
1. Library imports
2. Define the years and months target for PDF downloads
3. Download FOMC communication documents, this will download all PDF files available for each month/year
4. Delete any duplicates that may have been downloaded
5. Download rate history and save to CSV

In [2]:
## Definine project directory

original_dir = os.getcwd()

In [3]:
## Define download targets

# Months
months = ['January', 'February', 'March', 'April', 'May', 'June', 
         'July', 'August', 'September', 'October', 'November', 'December']
# Years
years = ['1983', '1982', '1981', '1980', '1979', '1978', '1977', '1976', '1975', '1974', 
        '1973', '1972', '1971', '1970', '1969', '1968', '1967', '1966', '1965', '1964', 
        '1963', '1962', '1961', '1960']

In [4]:
## Webscraper for downloading PDF files

# Function to start next download only after previous is finished
def downloads_done():
    for i in os.listdir(download_dir):
        if ".crdownload" in i:
            sleep(0.5)
            downloads_done()

# Options
download_dir = "C:\FOMC Documents" # Windows
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}],
               "download.default_directory": download_dir , "download.extensions_to_open": "applications/pdf", 
          "download.prompt_for_download": "false"}
options.add_experimental_option("prefs", profile)
driver = webdriver.Chrome('./chromedriver/chromedriver', options=options)
driver.implicitly_wait(3)

# FOMC webpage
driver.get(f'https://fraser.stlouisfed.org/title/677')

# Loop downloads in all desired years and months
for y in years:
    for m in months:
        pdf_links = []
        xpath = "//*[contains(text(), '"+m+"') and contains(text(), '"+y+"')]"
        # Try/except block to accomodate some months not existing, no meetings
        try:
            period = driver.find_element_by_xpath(xpath)
            period.click()

            elems = driver.find_elements_by_css_selector("a[href*='.pdf']")
            for elem in elems:
                if len(elem.text) > 1:
                    pdf_links.append(elem.text)
            for elem in pdf_links:
                download = driver.find_element_by_link_text(elem)
                download.click()
                sleep(1)
                downloads_done()
        except:
            pass
driver.quit()

In [6]:
## Delete any duplicates that may have been scraped

files = 'C:/FOMC Documents/'
os.chdir(files)
for f in os.listdir('.'):
    if f.endswith('(1).pdf'):
        os.remove(f)

In [16]:
## Download real daily interest rate history

fred = Fred(api_key='9a1d22d71aa0e64ad32ec723c47c1db0')
data = fred.get_series('DFF')

In [17]:
data.head()

1954-07-01    1.13
1954-07-02    1.25
1954-07-03    1.25
1954-07-04    1.25
1954-07-05    0.88
dtype: float64

In [7]:
## Save locally as CSV

os.chdir(original_dir)
data = pd.DataFrame(data, columns=['rate'])
data.to_csv('./data/data.csv', sep=',')

### 02: Convert PDF Files
Downloaded PDF files will be converted to text strings in a dataframe.  First a function is defined to convert the PDF file text to strings using the pdfminer library.  Next a list of the PDF documents to convert is created.  The total number of PDF files scraped is excessively large, and many documents hold no valuable predictive information (such as press statements announcing when a document will be released).  Document types vary over the years, so a robust method of only taking the largest filesize PDF document for each month will be used.  Not all months have any documents released.<br>
An alternative method of choosing documents by their filename is included for reference.
1. Define function for extracting text
2. Define which documents to use
3. Extract text
4. Save text dataframe as local CSV.

In [7]:
## Function to extract text from PDF files

def extract_text_from_pdf(pdf_path):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
            
        text = fake_file_handle.getvalue()
    
    # close open handles
    converter.close()
    fake_file_handle.close()
    
    if text:
        return text

In [None]:
## Import documents based on name of the document
## Depreciated alternative method

# df_beige = pd.DataFrame(columns=['date','text'])
# files = 'C:/FOMC Documents/'
# os.chdir(files)


# # Beige book - Runs from 2012/07 to present
# for f in os.listdir('.'):
#     if f.startswith('Beige'):
#         new_date = list(datefinder.find_dates(f))
#         new_text = extract_text_from_pdf(files + f)

#         new_line = pd.DataFrame({'date': new_date[0], 'text': new_text}, index=[0])
#         df_beige = pd.concat([df_beige, new_line])

In [18]:
## Get the largest PDF for each month
## File names and communication types vary over the years

files = 'C:/FOMC Documents/'
os.chdir(files)

months = ['01','02','03','04','05','06','07','08','09','10','11','12']
years = [2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,
        1999,1998,1997,1996,1995,1994,1993,1992,1991,1990,1989,1987,1986,1985, 1984, 1983, 
        1982, 1981, 1980, 1979, 1978, 1977, 1976, 1975, 1974, 1973, 1972, 1971, 1970, 1969, 
        1968, 1967, 1966, 1965, 1964, 1963, 1962, 1961, 1960]
years = [str(i) for i in years]

pdf_files = []
largest_byte = 0
largest_file = ''

for y in years:
    for m in months:
        date_filter = y+m
        for f in os.listdir('.'):
            if date_filter in f:
                stat = os.stat(f)
                if stat.st_size > largest_byte:
                    largest_byte = stat.st_size
                    largest_file = f
        pdf_files.append(largest_file)
        largest_file = ''
        largest_byte = 0

In [19]:
len(pdf_files)

696

In [None]:
## Extract text from each file
## Counter will show progress, and confirm the process is working

df_texts = pd.DataFrame(columns=['date','text'])
files = 'C:/FOMC Documents/'
os.chdir(files)
counter = 0

# Extract text from the largest file from each month
for file in pdf_files:
    for f in os.listdir('.'):
        if f == file:
            new_date = list(datefinder.find_dates(f))
            new_text = extract_text_from_pdf(files + f)

            new_line = pd.DataFrame({'date': new_date[0], 'text': new_text}, index=[0])
            df_texts = pd.concat([df_texts, new_line])
            print(counter)
            counter += 1

In [12]:
df_texts.shape

(644, 2)

In [13]:
df_texts.tail(10)

Unnamed: 0,date,text
0,1960-03-01,A meeting of the Federal Open Market Committee...
0,1960-04-12,A meeting of the Federal Open Market Committee...
0,1960-05-03,A meeting of the Federal Open Market Committee...
0,1960-06-14,A meeting of the Federal Open Market Committee...
0,1960-07-06,A meeting of the Federal Open Market Committee...
0,1960-08-16,A meeting of the Federal Open Market Committee...
0,1960-09-13,A meeting of the Federal Open Market Committee...
0,1960-10-04,A meeting of the Federal Open Market Committee...
0,1960-11-22,A meeting of the Federal Open Market Committee...
0,1960-12-13,A meeting of the Federal Open Market Committee...


In [14]:
os.chdir(original_dir)
df_texts.to_csv('./data/df_texts.csv', sep=',')