# Chapter 5 applied exercises

In [None]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

## 5

In chapter 4 we used logistic regression to predict the probability of *default* using *income* and *balance* on the *Default* data set. We will now estimate the test error of this logistic regression model using the validation set approach.

a) Fit a logistic regression model that uses *income* and *balance* to predict *default*

b) Using the validation set approach, estimate the test error of this mode.

c) Repeat the process in b) 3 times, using 3 different splits of the observations into a training set and a test set. Comment on the results obtained.

d) Now consider a logistic regression model that predicts the probability of *default* using *income*, *balance*, and a dummy variable for *student*. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for *student* leads to a reduction in the test error rate.

In [None]:
df = sm.datasets.get_rdataset("Default", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["default", "student"], drop_first=True)

In [None]:
df.columns

In [None]:
y = df["default_Yes"]
X = sm.add_constant(df[["income", "balance"]])

In [None]:
def train_test_validation_error():
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    logit = sm.Logit(y_train, X_train).fit()
    predict_prob = logit.predict(X_test)
    predict_class = pd.Series(data=0, index=predict_prob.index)
    predict_class.loc[predict_prob > 0.5] = 1
    validation_error = (predict_class.values != y_test.values).mean()
    return validation_error

In [None]:
validation_errors = [train_test_validation_error() for _ in range(3)]

In [None]:
validation_errors

Not really sure what to comment. They're all quite similar, which is good. If I did this a few more times I could make some distributional assumptions about the error rate. Note that this is a really imbalanced class, so my low error rate isn't that impressive

d) Now consider a logistic regression model that predicts the probability of *default* using *income*, *balance* and a dummy variable for *student*. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for *student* leads to a reduction in the test error rate

In [None]:
X = sm.add_constant(df[["income", "balance", "student_Yes"]])

In [None]:
validation_errors = [train_test_validation_error() for _ in range(3)]

In [None]:
validation_errors

Inconclusive result from this small test. One result is lower than obtained without the dummy, one is higher, and one is the same.

## 6

We continue to consider the use of a logistic regression model to predict the probability of *default* using *income* and *balance* on the *Default* data set. In particular, we will now compute estimates for the standard errors of the *income* and *balance* logistic regression coefficients in two different ways: 1) using the bootstrap, and 2) using the standard formula for computing the standard errors. 

a) Use statsmodels to determine the estimated standard errors for the coefficients associated with *income* and *balance* in the multiple logistic regression model

b) Write a function ```boot_fn```, that takes as input the *Default* data set as well as an index of observations,  and that outputs the coefficient estimates for *income* and *balance* in the multiple logistic regression model.

c) Use ```boot_fn```  to estimate the standard errors of the logistic regression coefficients for *income* and *balance*

d) comment on the results

In [None]:
df = sm.datasets.get_rdataset("Default", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["default", "student"], drop_first=True)

In [None]:
y = df["default_Yes"]
X = sm.add_constant(df[["income", "balance"]])
logit = sm.Logit(y, X).fit()

In [None]:
logit.bse

In [None]:
logit.params

In [None]:
def boot_fn(base_df):
    boot_df = base_df.sample(frac=1, replace=True)
    y = boot_df["default_Yes"]
    X = sm.add_constant(boot_df[["income", "balance"]])
    logit = sm.Logit(y, X).fit(disp=0) # disp = 0 silences convergence notification
    params = logit.params
    return (params.loc["income"], params.loc["balance"])

In [None]:
income_params = list()
balance_params = list()
for _ in range(1_000):
    income, balance = boot_fn(df)
    income_params.append(income)
    balance_params.append(balance)

income_params = np.array(income_params)
balance_params = np.array(balance_params)
income_param_boot = np.mean(income_params)
balance_param_boot = np.mean(balance_params)
income_se_boot = np.std(income_params)
balance_se_boot = np.std(balance_params)
print(f"Income: parameter {income_param_boot}, SE {income_se_boot}")
print(f"Balance: parameter {balance_param_boot}, SE {balance_se_boot}")

Bootstrap estimates of both the parameter values and standard error are quite close to the analytic solution

## 7

In sections 5.3.2 and 5.3.3, we saw that the ```cv.glm()``` function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the ```glm()``` and ```predict.glm()``` functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the ```Weekly``` data set. Recall that in the context of classification problems, the LOOCV error is given in 5.4:

$CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \text{Err}_{i}$

a) Fit a logistic regression model that predicts ```Direction``` using ```Lag1``` and ```Lag2```.

b) Fit a logistic regression model that predicts ```Direction``` using ```Lag1``` and ```Lag2``` *using all but the first observation*.

c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(```Direction="Up"```|```Lag1,Lag2```) > 0.5. Was this observation correctly classified?

d) Write a for loop from $i=1$ to $i=n$ where $n$ is the number of observations in the data set, that performs each of the following steps:
* Fit a logistic regression model using all but the $i$th observation in order to predict whether or not the market moves up.
* Compute the posterior probability for the $i$th observation in order to predict whether or not the market moves up.
* Determine whether or not an error was made in predicting the direction for the $i$th observation. If an error was made, then indicate this with as a 1, and otherwise indicate it as a 0.

e) Take the average of the $n$ numbers obtained by d) in order to obtain the LOOCV estimate of the test error. Comment on the results

In [None]:
df = sm.datasets.get_rdataset("Weekly", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["Direction"], drop_first=True)

In [None]:
y = df["Direction_Up"]
X = sm.add_constant(df[["Lag1", "Lag2"]])
logit = sm.Logit(y, X).fit(disp=0) # disp = 0 silences convergence notification

In [None]:
logit_loo = sm.Logit(y.loc[1:], X.loc[1:]).fit(disp=0)

In [None]:
logit_loo.predict(X.loc[0].values)

In [None]:
y.loc[0]

Observation incorrectly classified

In [None]:
predictions = list()
for index in y.index:
    y_loo = y.loc[~y.index.isin([index])]
    X_loo = X.loc[~X.index.isin([index])]
    logit = sm.Logit(y_loo, X_loo).fit(disp=0)
    pred = round(logit.predict(X.loc[index].values)[0], 0)
    predictions.append(pred == y.loc[index])
np.mean(predictions)

Hey, right a little more than half the time, let's take this baby to the stock market!