# 10.

this questions should be answered using the ```Weekly``` data set, which is part of the ```ISLR``` package. This data is similar in nature to the ```Smarket``` data from this chapter's lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1900 to the end of 2010.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from mlxtend.plotting import plot_confusion_matrix
from IPython.display import display, Markdown
pd.options.display.max_rows = 999
%matplotlib inline

In [None]:
def printm(input_str):
    display(Markdown(input_str))

In [None]:
df = sm.datasets.get_rdataset("Weekly", "ISLR", cache=True).data

## a)
Produce some numerical and graphical summaries of the ```Weekly``` data. Do there appear to be any patterns?

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df.loc[df["Today"] == 0]

In [None]:
df.loc[df.isna().any(axis=1)]

In [None]:
# cdf = df.reset_index().melt(value_vars=["Lag1", "Lag2", "Lag3", "Lag4", "Today"], id_vars="index")
# sns.lineplot(x="index", y="value", hue="variable", data=cdf);
# Realized this is dumb, X is too squished so they all just overlap

In [None]:
df.corr().style.background_gradient(cmap='viridis')

In [None]:
df["Today"].plot();

In [None]:
df["Volume"].plot();

Pretty weak correlations, except between volume and year, which we can observe from the above plot is generally trending upward (although declining near the end there.

From looking at the weekly returns plot the series is plausibly stationary. Mean definitely looks stable over time, the volatility might be increasing with time though.

No missing values and the summary statistics all look good. For example, Today and the week lag of today have almost identical summary statistics, as you'd expect for series that are only off by two observations.

## b)

Use the full data set to perform a logistic regression with ```Direction``` as the response and the five lag variables plus ```Volume``` as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

In [None]:
y = df["Direction"] == "Up"
X = df[[f"Lag{x}" for x in range(1, 6)]]

In [None]:
logit = sm.Logit(y, sm.add_constant(X)).fit()
print(logit.summary())

The intercept has a positive and statistically significant coefficient, suggesting returns are slightly more likely to be positive.

Lag 2 is statistically significant and positive (only lag with a positive sign). Implying likelihood of a positive return in the current period increases the more positive returns were two weeks ago.

## c) 

Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

In [None]:
class_labels = ["Down", "Up"] # took the Up dummy column as my independent variable, so 1 = Up
predict_prob = logit.predict(sm.add_constant(X))
predict_class = pd.Series(data=0, index=predict_prob.index)
predict_class.loc[predict_prob > 0.5] = 1 # question didn't specify threshold so let's assume 50%
confusion_mat = confusion_matrix(y, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

In [None]:
accuracy = (predict_class == y).sum() / len(y)
printm(f"Model correctly predicted direction {accuracy:0.1%} of the time")

The confusion matrix shows that the model is almost exclusively predicting an Up day, regardless of the true outcome. It looks like the positive intercept coefficient is dominating all the other factors in the model. Which is fine given their weak statistical results.

## d)

Now fit the logistic regression model using a training data period from 1990 to 2008, with ```Lag2``` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data.


In [None]:
df_train = df.loc[df["Year"] <= 2008].copy()
df_test = df.loc[df["Year"] > 2008].copy()
y_train = df_train["Direction"] == "Up"
y_test = df_test["Direction"] == "Up"
X_train = df_train["Lag2"]
X_test = df_test["Lag2"]
logit = sm.Logit(y_train, sm.add_constant(X_train)).fit()

In [None]:
predict_prob = logit.predict(sm.add_constant(X_test))
predict_class = pd.Series(data=0, index=predict_prob.index)
predict_class.loc[predict_prob > 0.5] = 1 # question didn't specify threshold so let's assume 50%
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

## e)

Repeat (d) using LDA

In [None]:
X_train = X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
predict_class = lda.predict(X_test)
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

## f)

Repeat (d) using QDA

In [None]:

qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)
predict_class = qda.predict(X_test)
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

## g)

Repeat (d) using KNN with K=1

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
predict_class = knn.predict(X_test)
confusion_mat = confusion_matrix(y_test, predict_class)
fig, ax = plot_confusion_matrix(conf_mat=confusion_mat, class_names=class_labels)
ax.set_ylim(len(confusion_mat)-0.5, -0.5) # have to keep this in until matplotlib 3.1.2 comes out
#https://github.com/matplotlib/matplotlib/issues/14751
plt.show()

# h)

Which of these methods appears to provide the best results on this data?

In [None]:
logit_pred = pd.Series(data=0, index=y_test.index)
logit_pred.loc[logit.predict(sm.add_constant(X_test)) > 0.5] = 1 
lda_pred = lda.predict(X_test)
qda_pred = qda.predict(X_test)
knn_pred = knn.predict(X_test)

logit_accuracy = (logit_pred == y_test).sum() / len(y_test)
lda_accuracy = (lda_pred == y_test).sum() / len(y_test)
qda_accuracy = (qda_pred == y_test).sum() / len(y_test)
knn_accuracy = (knn_pred == y_test).sum() / len(y_test)

printm(f"Logit accuracy: {logit_accuracy:0.1%}, LDA accuracy: {lda_accuracy:0.1%}, QDA accuracy: {qda_accuracy:0.1%}, KNN accuracy: {knn_accuracy:0.1%}")

Logit and LDA are tied, with KNN the worst. I don't really trust any of these though.

## i)

Experiment with different combinations of predictors, including possible transformations and interactions for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier

Pass.

# 11.

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the ```Auto``` data set.

## a)

Create a binary variable, ```mpg01```, that contains a 1 if ```mpg``` contains a value above its median, and a 0 if ```mpg``` contains a value below its median.

In [None]:
df = (
    sm.datasets.get_rdataset("Auto", "ISLR", cache=True)
    .data
    .assign(mpg01=lambda df: df["mpg"] > df["mpg"].median())
)
df.head()

## b)

Explore the data graphically in order to investigate the association between ```mpg01``` and the other features. Which of the other features seems most likely to be useful in predicting ```mpg01```? Scatterplots and boxplots are be useful tools to answer this question. Describe your findings.

In [None]:
sns.catplot(x="cylinders", y="mpg", kind="swarm", data=df);

Increasing cylinders decreases mpg. Looks like there's a wider distribution among 4 cylinders, with fat right tails on 6 and 8. Also worth noting that there are very few 3 and 5 cylinder vehicles, so those categories might have to be collapsed.

In [None]:
sns.scatterplot(x="displacement", y="mpg", data=df);

In [None]:
sns.scatterplot(x="horsepower", y="mpg", data=df);

Displacement and horsepower both have a negative and slightly nonlinear looking relationship with mpg. The relationship for both looks quite similar, which makes me wonder if displacement and horsepower are highly correlated.

In [None]:
sns.scatterplot(x="horsepower", y="displacement", data=df);

Looks like they are.

In [None]:
sns.scatterplot(x="weight", y="mpg", data=df);

Weight has a similar looking relationship to displacement and horsepower, although less clearly non-linear.

In [None]:
sns.scatterplot(x="acceleration", y="mpg", data=df);

There's a positive association here, but it's a lot weaker than the other variables.

In [None]:
sns.catplot(x="year", y="mpg", kind="swarm", data=df);

mpg appears to be improving over time

In [None]:
sns.catplot(x="origin", y="mpg", kind="swarm", data=df);

1 pretty clearly has a different distribution than 2 and 3, but it's less clear if 2 and 3 are distinct. 3 has the bulk of its distribution above 2, but the min and max are similar and the weights aren't obviously different. Might be worth combining. As a reminder 1 is American, 2 is European and 3 is Japanese.

## c)

Split the data into a training and a test set.

The question doesn't ask but I'm going to do some transformations as well.

In [None]:
df.columns

In [None]:
y = df["mpg01"].values
X = pd.get_dummies(df[["cylinders", "displacement", "horsepower", "weight", "acceleration", "year", "origin"]], columns=["origin"], drop_first=True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

## d)

Perform LDA on the training data in order to predict ```mpg01``` using the variables that seemed most associated with ```mpg01``` in (b). What is the test error of the model obtained?

In [None]:
subset = ["cylinders", "displacement", "weight", "year", "origin_2", "origin_3"]

In [None]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train[subset], y_train)
y_pred = lda.predict(X_test[subset])
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
printm(f"accuracy: {accuracy:0.7}")
printm(f"f1: {f1:0.7}")

## e)

Perform QDA on the training data in order to predict ```mpg01``` using the variables that seemed most associated with ```mpg01``` in (b). What is the test error of the model obtained?

In [None]:
qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train[subset], y_train)
y_pred = qda.predict(X_test[subset])
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
printm(f"accuracy: {accuracy:0.7}")
printm(f"f1: {f1:0.7}")

## f)

Perform Logistic regression on the training data in order to predict ```mpg01``` using the variables that seemed most associated with ```mpg01``` in (b). What is the test error of the model obtained?

In [None]:
logit = LogisticRegression(fit_intercept=True, penalty="none", solver="lbfgs", max_iter=500)
logit.fit(X_train[subset], y_train)
y_pred = logit.predict(X_test[subset])
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
printm(f"accuracy: {accuracy:0.7}")
printm(f"f1: {f1:0.7}")

## g)

Perform KNN on the training data, with several values of K, in order to predict ```mpg01```. Use only the variables that seemed most associated with ```mpg01``` in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

In [None]:
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': list(range(1,7))}
search = GridSearchCV(knn, param_grid, iid=False, cv=5)
search.fit(X_train, y_train)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
y_pred = search.predict(X_test)
accuracy = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test)
printm(f"accuracy: {accuracy:0.7}")
printm(f"f1: {f1:0.7}")

# 12)

This problem involves writing functions.

a) Write a function, ```Power()```, that prints out the result of raising 2 to the 3rd power.
b) Create a new function ```Power2()```, that allows you to pass *any* two numbers ```x``` and ```a```, and prints out the value ```x^a```.
c) Using the ```Power2``` funcion you just wrote compute $10^3$, $8^17$, and $131^3$.
d) Now create a new function, ```Power3()```, that actually returns the result ```x^a``` as a ```python``` object.
e) Now using the ```Power3``` function, create a plot of $f(x)=x^2$. The *x*-axis should display a range of integers from 1 to 10, and the *y*-axis should display $x^2$. Label the axes appropriately, and use an appropriate title for the figure. Consider displaying either the *x*-axis, the *y*-axis, or both on the log-scale.
f) Create a function, ```PlotPower``` that allows you to create a plot of ```x``` against ```x^a``` for a fixed ```a``` and for a range of values of ```x```.

# 13)

Using the ```Boston``` data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA,and KNN models using various subsets of the predictors. Describe your findings.