<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Modelling lab: Can we predict US government bond interest rates from financial news headlines?

In this exercise we'll combine two datasets, government bond interest rates from the US treasury and business/financial news headlines from various sources, to see if we can predict the changes in interest rates from the news. The question we're getting at is: do news headlines give us an indication of future changes in interest rates?

# Part 1 - US treasury data

The data for this comes [from the Federal Reserve](https://www.federalreserve.gov/releases/H15/default.htm) and is the file `FRB_H15.csv` in the `data` folder.

### 1. Load in the data, making sure excess headers are dealt with

### 2. Make sure the columns are appropriate types, converting as necessary

Are there any missing values?

### 3. Plot the time series to get a feel for the data

### 4. Plot the time series for 2014 only

The news dataset is from 2014, so let's get a feel for the interest rates in that year only.

### 5. Add a "change from yesterday" column

Rather than predicting the interest rate itself we'll try to predict the overnight change in the value from news items on the previous day. For that we'll need a target column, which is the change in interest rate from the day before.

Create this column.

# Part 2 - News

Our news headline data comes from the UCI Machine Learning repository ([via Kaggle](https://www.kaggle.com/uciml/news-aggregator-dataset)). The news dataset originally has multiple categories, but the file provided for this exercise is limited to the "business" category.

### 1. Read in the data

The file is zipped to reduce its size, but `pandas` can actually open CSVs if they're in a Gzip archive directly (you don't have to do the unzipping yourself).

### 2. Do the usual sense checking of the dataset

- how many rows are there?
- are there missing values?
- what are the data types?

### 3. Convert the "timestamp" column to a date type

Have a read of [the data documentation](http://archive.ics.uci.edu/ml/datasets/News+Aggregator) on the UCI ML repository site to see what format the column is currently in (if it's not a format you've come across before).

### 4. Sanity check your date column by looking at what time period it covers (should be dates within 2014)

# Part 3 - Merging the datasets

Now it's time to combine the two datasets.

We want to make sure the dataset accurately reflects the data generating process, so join the news headlines with the interest rates making sure that the date for the headlines is the day *before* the interest rates. We therefore assume that any effect of what's happening in the news takes a day to happen. You can always tweak this assumption later!

### 1. Create a column in your news dataset called "next day" to store the date + 1 day

*Hint: the `datetime` library has a section called `timedelta` that will help!*

### 2. Merge the two datasets

Join the rates data to the news data. Remember to join on the "next day" column instead of the original date in your news dataset.

After joining ensure the dataset has the same number of rows as your news dataset (to verify the join).

# Part 3 - Feature creation

### 1. Separate your dataset into training and test sets

Your training and test features should only contain the titles for now, and your training/test targets are your interest rate changes.

We want to do this *before* creating text features, because our vocabulary should come from our **training set only**.

Remember, the test set is information you haven't seen yet, so words in those headlines shouldn't count!

### 2. Create text features using the `TfidfVectorizer`

Now we want to create features that represent the TF-IDF scores of the `N` most common (non-stopword) words.

Try `N`=100 for now.

Use scikit-learn's `TfidfVectorizer` to create a document term matrix of 100 columns, one row per news headline, using the **training set only**.

### 2. Examine the vocabulary in your `TfidfVectorizer`

What are the top 100 words it picked out?

### 3. Transform your test set using your trained vectorizer

You should now have a 100-column training set - each column representing a word, and each value representing the TF-IDF score of that word in each title.

Use your trained vectorizer object to create a 100-column test set using the set-aside test set titles.

### 4. Rename columns in the training set

Use your vectorizer's `vocabulary_` dictionary to get your keys (they need to be sorted alphabetically to correspond to the document-term matrix).

Your training dataset should look something like this:

| ahead | airlines | american | ... |
|---------|---------|------|------|
| 0.45 | 0.3 | `NaN` | ... |

# Part 5 - Prediction!

### 1. Examine your target (in the training set) - what is the distribution (i.e. what is a credible range for predictions)?

### 2. Compare the cross-validated performance of two different prediction models

For this you'll need to:

- choose the appropriate type of model (is this classification or regression?)
- based on the above, come up with an appropriate metric to measure performance
- choose two appropriate algorithms to compare with one another
- run both models (with cross-validation) and examine your results!

### 3. Compare your models' feature importance - what words are important to predict interest rate changes?

### 4. Try something to improve one of your models

Some ideas:

- change the number of features (is more or less than 100 a good idea? This may depend on which words your model thought important)
- tune the hyperparameters using grid search
- try something more radical and predict the rates themselves, not the changes

### 5. Evaluate the best version of your best model on the test set - how did it perform?