Notebook outlining what functions from nltk will be useful for pre-processing the text and training the classification algorithm.

Expects a CSV file in the current directory containing a release's date, text, and classification (pos or neg) in each row.

In [30]:
import pandas as pd
import nltk
import re

import sklearn

Some pre-processing to get the dataframe into the format we need for content analysis.

In [70]:
release_df = pd.read_csv('release_df.csv', index_col=0)

In [15]:
release_df['year'] = release_df['date'].apply(lambda x: int(x[-4:]))

In [29]:
release_df['month'] = release_df['date'].apply(lambda x: re.findall(r'(\w+) \d', x)[0])

In [None]:
ratesdf = pd.read_csv('rates.csv')

In [None]:
ratesdf = ratesdf.rename(columns={"Month":'month', "Year":'year'})

In [None]:
release_df = release_df.reset_index(drop=True)

In [101]:
ratesdf

Unnamed: 0,month,year,Effective Rate,Shadow Rate,Change,Positive (1) / Negative (-1)
1,January,2008,3.94,,,
2,February,2008,2.98,,-24.37%,-1.0
3,March,2008,2.61,,-12.42%,-1.0
4,April,2008,2.28,,-12.64%,-1.0
5,May,2008,1.98,,-13.16%,-1.0
6,June,2008,2.00,,1.01%,1.0
7,July,2008,2.01,,0.50%,1.0
8,August,2008,2.00,,-0.50%,-1.0
9,September,2008,1.81,,-9.50%,-1.0
10,October,2008,0.97,,-46.41%,-1.0


In [96]:
release_df.merge(ratesdf, how='left', on = ['year', 'month'])

Unnamed: 0,date,release_text,year,month,Effective Rate,Shadow Rate,Change,Positive (1) / Negative (-1)
0,"January 31, 2018",Information received since the Federal Open Ma...,2018,January,,,,
1,"February 01, 2017",Information received since the Federal Open Ma...,2017,February,,,,
2,"March 15, 2017",Information received since the Federal Open Ma...,2017,March,,,,
3,"May 03, 2017",Information received since the Federal Open Ma...,2017,May,,,,
4,"June 14, 2017",Information received since the Federal Open Ma...,2017,June,,,,
5,"July 26, 2017",Information received since the Federal Open Ma...,2017,July,,,,
6,"September 20, 2017",Information received since the Federal Open Ma...,2017,September,,,,
7,"November 01, 2017",Information received since the Federal Open Ma...,2017,November,,,,
8,"December 13, 2017",Information received since the Federal Open Ma...,2017,December,,,,
9,"January 27, 2016",Information received since the Federal Open Ma...,2016,January,,,,


In [24]:
testrelease_df = release_df[(release_df['year']==2017) | (release_df['year']==2018)]

In [26]:
trainrelease_df = release_df.iloc[9:]

In [29]:
testrelease_df[:5]

Unnamed: 0,date,release_text,year,month
0,"January 31, 2018",Information received since the Federal Open Ma...,2018,January
1,"February 01, 2017",Information received since the Federal Open Ma...,2017,February
2,"March 15, 2017",Information received since the Federal Open Ma...,2017,March
3,"May 03, 2017",Information received since the Federal Open Ma...,2017,May
4,"June 14, 2017",Information received since the Federal Open Ma...,2017,June


In [28]:
trainrelease_df[:5]

Unnamed: 0,date,release_text,year,month
9,"January 27, 2016",Information received since the Federal Open Ma...,2016,January
10,"March 16, 2016",Information received since the Federal Open Ma...,2016,March
11,"April 27, 2016",Information received since the Federal Open Ma...,2016,April
12,"June 15, 2016",Information received since the Federal Open Ma...,2016,June
13,"July 27, 2016",Information received since the Federal Open Ma...,2016,July


In [61]:
ratesdf.join(release_df, on=['month', 'year'])

ValueError: len(left_on) must equal the number of levels in the index of "right"

In [59]:
ratesdf.shape

(121, 6)

In [60]:
release_df.shape

(103, 4)

Now text processing using `nltk`.

In [None]:
#release_df['tokenized'] = release_df['text']

In [31]:
releaseTFVectorizer = sklearn.feature_extraction.text.TfidfVectorizer(max_df=0.5, min_df=3, stop_words='english', norm='l2')

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=3,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)