# Exercise 9 - KNN Classification

In Tutorial 20, we devised a simple classification problem involving daily changes in VIX levels and daily changes SPY returns.  In particular, we used k-nearest neighbors to identify a given daily return as a gain or a loss by analyzing changes in the VIX from *the same day*.  We found that the prediction accuracy to be a little over 80%, which is quite strong.

In this exercise, we extend that analysis to try to predict whether *the following day* will be a gain or a loss by looking at VIX changes from the current day.  As you will see, there is very little predictive power in this methodology.

#### 1) Import the packages that you think you will need.

In [1]:
import pandas as pd
import numpy as np
import sklearn

#### 2) Read in the data from `vix_knn.csv` and assign it to a variable called `df_vix`.

In [2]:
df_vix = pd.read_csv('../data/vix_knn.csv')
df_vix.head()

Unnamed: 0,trade_date,vix_009,vix_030,vix_090,vix_180,spy_ret
0,2011-01-03,,,,,0.010338
1,2011-01-04,0.02,-0.23,-0.01,-0.21,-0.000551
2,2011-01-05,-0.49,-0.36,-0.56,-0.41,0.005198
3,2011-01-06,0.14,0.38,0.3,0.09,-0.001959
4,2011-01-07,-0.7,-0.26,-0.06,0.05,-0.001962


#### 3) Notice that the first row of `df_vix` contains `NaN` values, so remove the first row.

In [3]:
df_vix = df_vix[df_vix.trade_date > '2011-01-03']
df_vix.head()

Unnamed: 0,trade_date,vix_009,vix_030,vix_090,vix_180,spy_ret
1,2011-01-04,0.02,-0.23,-0.01,-0.21,-0.000551
2,2011-01-05,-0.49,-0.36,-0.56,-0.41,0.005198
3,2011-01-06,0.14,0.38,0.3,0.09,-0.001959
4,2011-01-07,-0.7,-0.26,-0.06,0.05,-0.001962
5,2011-01-10,0.8,0.4,0.19,0.01,-0.001259


#### 4) Add a column to `df_vix` called `spy_label_1`.  These will be the labels that we are trying to predict, and they will be a function of the *next day* return. If it is a loss the column will contain a 'L', otherwise it will contain a 'G'.

In [4]:
def labeler(ret):
    if ret < 0:
        return('L')
    else:
        return('G')
df_vix['spy_label_1'] = df_vix['spy_ret'].apply(labeler).shift(-1)
df_vix.tail()

Unnamed: 0,trade_date,vix_009,vix_030,vix_090,vix_180,spy_ret,spy_label_1
2007,2018-12-24,8.66,5.96,2.61,1.5,-0.026423,G
2008,2018-12-26,-7.69,-5.66,-3.15,-1.99,0.050525,G
2009,2018-12-27,-0.83,-0.45,0.2,-0.14,0.007677,L
2010,2018-12-28,-2.86,-1.62,-0.57,-0.28,-0.00129,G
2011,2018-12-31,-3.67,-2.92,-1.59,-1.02,0.008759,


#### 5) Notice that in the final row of `df_vix`, the `spy_label_1` column contains a `NaN` value.  Remove the final row from `df_vix`.

In [5]:
df_vix = df_vix[df_vix.trade_date < '2018-12-31']
df_vix.tail()

Unnamed: 0,trade_date,vix_009,vix_030,vix_090,vix_180,spy_ret,spy_label_1
2006,2018-12-21,3.32,1.73,0.99,0.53,-0.02049,L
2007,2018-12-24,8.66,5.96,2.61,1.5,-0.026423,G
2008,2018-12-26,-7.69,-5.66,-3.15,-1.99,0.050525,G
2009,2018-12-27,-0.83,-0.45,0.2,-0.14,0.007677,L
2010,2018-12-28,-2.86,-1.62,-0.57,-0.28,-0.00129,G


#### 6) Import the `KNeighborsClassifier` constructor, as well as the `scale` function and the `train_test_split` function.

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split

#### 7) Select all four VIX term structure points as your feature set and it `X`.  Also, isolate the labels you want to predict and call them `y`.

In [7]:
X = df_vix[['vix_009', 'vix_030', 'vix_090', 'vix_180']]
y = df_vix['spy_label_1'].values

#### 8) Use the `scale()` function to normalize the feature set; call the normalized features `Xs`.

In [8]:
Xs = scale(X)

#### 9) Use `train_test_split()` to generate a training set and a hold out set.  Use the canonical variable names `X_train`, `X_test`, `y_train`, `y_test`.  Set the size of the test set to 20%.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.20, random_state=0)

#### 10) Instantiate the a KNN classifer with a hyperparameter of 10, and fit the model to the training set.

In [10]:
clf = KNeighborsClassifier(n_neighbors = 10)
clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=10, p=2,
           weights='uniform')

#### 11) Check the in-sample accuracy score of the model.

In [11]:
print(clf.score(X_train, y_train))

0.6436567164179104


#### 12) Check the out-of-sample accuracy score using the test set.

In [12]:
print(clf.score(X_test, y_test))

0.5223880597014925
