# 10.

this questions should be answered using the ```Weekly``` data set, which is part of the ```ISLR``` package. This data is similar in nature to the ```Smarket``` data from this chapter's lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1900 to the end of 2010.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from mlxtend.plotting import plot_confusion_matrix
from IPython.display import display, Markdown
pd.options.display.max_rows = 999
%matplotlib inline

In [None]:
def printm(input_str):
    display(Markdown(input_str))

In [None]:
df = sm.datasets.get_rdataset("Weekly", "ISLR", cache=True).data

## a)
Produce some numerical and graphical summaries of the ```Weekly``` data. Do there appear to be any patterns?

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.loc[df["Today"] == 0]

In [None]:
df.loc[df.isna().any(axis=1)]

In [None]:
# cdf = df.reset_index().melt(value_vars=["Lag1", "Lag2", "Lag3", "Lag4", "Today"], id_vars="index")
# sns.lineplot(x="index", y="value", hue="variable", data=cdf);
# Realized this is dumb, X is too squished so they all just overlap

In [None]:
df.corr().style.background_gradient(cmap='viridis')

In [None]:
df["Today"].plot();

In [None]:
df["Volume"].plot();

Pretty weak correlations, except between volume and year, which we can observe from the above plot is generally trending upward (although declining near the end there.

From looking at the weekly returns plot the series is plausibly stationary. Mean definitely looks stable over time, the volatility might be increasing with time though.

No missing values and the summary statistics all look good. For example, Today and the week lag of today have almost identical summary statistics, as you'd expect for series that are only off by two observations.

## b)

Use the full data set to perform a logistic regression with ```Direction``` as the response and the five lag variables plus ```Volume``` as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

In [None]:
y = df["Direction"] == "Up"
X = df[[f"Lag{x}" for x in range(1, 6)]]

In [None]:
logit = sm.Logit(y, sm.add_constant(X)).fit()
print(logit.summary())

The intercept has a positive and statistically significant coefficient, suggesting returns are slightly more likely to be positive.

Lag 2 is statistically significant and positive (only lag with a positive sign). Implying likelihood of a positive return in the current period increases the more positive returns were two weeks ago.

## c) 

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

In [None]:
class_labels = ["Down", "Up"] # took the Up dummy column as my independent variable, so 1 = Up
predict_prob = logit.predict(sm.add_constant(X))
predict_class = pd.Series(data=0, index=predict_prob.index)
predict_class.loc[predict_prob > 0.5] = 1 # question didn't specify threshold so let's assume 50%
confusion_mat = confusion_matrix(y, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

In [None]:
accuracy = (predict_class == y).sum() / len(y)
printm(f"Model correctly predicted direction {accuracy:0.1%} of the time")

The confusion matrix shows that the model is almost exclusively predicting an Up day, regardless of the true outcome. It looks like the positive intercept coefficient is dominating all the other factors in the model. Which is fine given their weak statistical results.

## d)

Now fit the logistic regression model using a training data period from 1990 to 2008, with ```Lag2``` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data.


In [None]:
df_train = df.loc[df["Year"] <= 2008].copy()
df_test = df.loc[df["Year"] > 2008].copy()
y_train = df_train["Direction"] == "Up"
y_test = df_test["Direction"] == "Up"
X_train = df_train["Lag2"]
X_test = df_test["Lag2"]
logit = sm.Logit(y_train, sm.add_constant(X_train)).fit()

In [None]:
predict_prob = logit.predict(sm.add_constant(X_test))
predict_class = pd.Series(data=0, index=predict_prob.index)
predict_class.loc[predict_prob > 0.5] = 1 # question didn't specify threshold so let's assume 50%
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

## e)

Repeat (d) using LDA

In [None]:
X_train = X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
predict_class = lda.predict(X_test)
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

## f)

Repeat (d) using QDA

In [None]:

qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
predict_class = qda.predict(X_test)
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

## g)

Repeat (d) using KNN with K=1

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
predict_class = knn.predict(X_test)
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

# h)

Which of these methods appears to provide the best results on this data?