In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import math
import random
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
from sklearn.model_selection import train_test_split


### Part 1: Exploring and Modeling the Data


Load the smartket.csv data and print the first 6 rows

In [6]:
df = pd.read_csv("data/smarket.csv")
df.head(6)

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
4,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up
5,2001,0.213,0.614,-0.623,1.032,0.959,1.3491,1.392,Up


The output variable we will be predicting is the Direction of the stock market, but it is a categorical variable that needs to be converted into a numerical column

In [7]:
df['Up'] = (df.Direction == 'Up').astype(int)
df.head(10)

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction,Up
0,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up,1
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up,1
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down,0
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up,1
4,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up,1
5,2001,0.213,0.614,-0.623,1.032,0.959,1.3491,1.392,Up,1
6,2001,1.392,0.213,0.614,-0.623,1.032,1.445,-0.403,Down,0
7,2001,-0.403,1.392,0.213,0.614,-0.623,1.4078,0.027,Up,1
8,2001,0.027,-0.403,1.392,0.213,0.614,1.164,1.303,Up,1
9,2001,1.303,0.027,-0.403,1.392,0.213,1.2326,0.287,Up,1


Display the correlations table for all columns of this dataset

In [8]:
df.corr()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Up
Year,1.0,0.0297,0.030596,0.033195,0.035689,0.029788,0.539006,0.030095,0.074608
Lag1,0.0297,1.0,-0.026294,-0.010803,-0.002986,-0.005675,0.04091,-0.026155,-0.039757
Lag2,0.030596,-0.026294,1.0,-0.025897,-0.010854,-0.003558,-0.043383,-0.01025,-0.024081
Lag3,0.033195,-0.010803,-0.025897,1.0,-0.024051,-0.018808,-0.041824,-0.002448,0.006132
Lag4,0.035689,-0.002986,-0.010854,-0.024051,1.0,-0.027084,-0.048414,-0.0069,0.004215
Lag5,0.029788,-0.005675,-0.003558,-0.018808,-0.027084,1.0,-0.022002,-0.03486,0.005423
Volume,0.539006,0.04091,-0.043383,-0.041824,-0.048414,-0.022002,1.0,0.014592,0.022951
Today,0.030095,-0.026155,-0.01025,-0.002448,-0.0069,-0.03486,0.014592,1.0,0.730563
Up,0.074608,-0.039757,-0.024081,0.006132,0.004215,0.005423,0.022951,0.730563,1.0


Create a full logistic regression model on the *entire* dataset using `smf.logit`, print and observe the results

In [10]:
cols = ' + '.join(df.columns[:-2])
results = smf.logit('Up ~ ' + cols, data=df).fit()
results.summary()

  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


         Current function value: inf
         Iterations: 35


LinAlgError: Singular matrix

Split the dataset into 2 sets: a training set that uses data from years prior to **not including** 2005 and a test set that uses data from year 2005 and beyond\
**Hint**: once you retrieve the X values of the dataset, you can use this code: <br> `X_train = X[smarket['Year'].values < 2005]` to filter out rows for the training features X

### Part 2: Time to Predict!

#### Now that we are finally working on predictions (yay!!!), we will start using the `sklearn` libary from now on to help us create the predictive models and evaluate them. 

Let's first train a logistic regression model on the smarket data using the [`LogisticRegression` module](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (scroll down to the Examples section to save time)

Evaluate your model using the test set and print out the resulting confusion matrix. Calculate the accuracy of your classification for each class (in this case 2 classes) of your output (using items in the confusion matrix). Is this model good enough? 

Now print the classification report using `sklearn`. Observe the metrics from the report. What do these numbers suggest about your model's performance?

#### Linear Discriminant Analysis and Quadratic Discriminant Analysis

Now fit an [LDA model](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html) on the `smarket` dataset, using only the observations before 2005, and then test the model on the data from 2005 on

Print the number of times the model predicted `Up` for the test data

Print the classification report

Now fit a [QDA model](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html) on the `smarket` dataset, using only the observations before 2005, and then test the model on the data from 2005 on

Print the number of times the model predicted `Up` for the test data

Print the classification report. Is this model better or worse compared to the LDA model above?

#### K-Nearest Neighbors

Fit 2-3 prediction models with different numbers of neighbors using [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) and the same test/train split as above. Print and observe the classification reports. Which model is the best at predicting the stock market direction for the test data?

#### Another Case Study: Caravan Insurance Data

Perform feature scaling on the `Caravan` dataset using the [`scale` method from the `preprocessing` module](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html)

We will compare KNN and logistic models in this case study. To minimize code duplication, let's create a python method called KNN that will allow us to pass in 1 keyword argument called `n_neighbors`. The method returns the predictions, the score, and the classes values

Create another method called plot_confusion_matrix to plot the confusion matrix 

Use a `for` loop to create 3 models of different `n_neighbors` values

Now create a logistic regression model on the dataset and print out a classification report