# Chapter 5 applied exercises

In [1]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, LeaveOneOut, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

## 5

In chapter 4 we used logistic regression to predict the probability of *default* using *income* and *balance* on the *Default* data set. We will now estimate the test error of this logistic regression model using the validation set approach.

a) Fit a logistic regression model that uses *income* and *balance* to predict *default*

b) Using the validation set approach, estimate the test error of this mode.

c) Repeat the process in b) 3 times, using 3 different splits of the observations into a training set and a test set. Comment on the results obtained.

d) Now consider a logistic regression model that predicts the probability of *default* using *income*, *balance*, and a dummy variable for *student*. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for *student* leads to a reduction in the test error rate.

In [2]:
df = sm.datasets.get_rdataset("Default", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["default", "student"], drop_first=True)

  return dataset_meta["Title"].item()


In [3]:
df.columns

Index(['balance', 'income', 'default_Yes', 'student_Yes'], dtype='object')

In [4]:
y = df["default_Yes"]
X = sm.add_constant(df[["income", "balance"]])

  return ptp(axis=axis, out=out, **kwargs)


In [5]:
def train_test_validation_error():
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    logit = sm.Logit(y_train, X_train).fit()
    predict_prob = logit.predict(X_test)
    predict_class = pd.Series(data=0, index=predict_prob.index)
    predict_class.loc[predict_prob > 0.5] = 1
    validation_error = (predict_class.values != y_test.values).mean()
    return validation_error

In [6]:
validation_errors = [train_test_validation_error() for _ in range(3)]

Optimization terminated successfully.
         Current function value: inf
         Iterations 10
Optimization terminated successfully.
         Current function value: inf
         Iterations 10
Optimization terminated successfully.
         Current function value: inf
         Iterations 10


  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


In [7]:
validation_errors

[0.024, 0.0288, 0.0288]

Not really sure what to comment. They're all quite similar, which is good. If I did this a few more times I could make some distributional assumptions about the error rate. Note that this is a really imbalanced class, so my low error rate isn't that impressive

d) Now consider a logistic regression model that predicts the probability of *default* using *income*, *balance* and a dummy variable for *student*. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for *student* leads to a reduction in the test error rate

In [8]:
X = sm.add_constant(df[["income", "balance", "student_Yes"]])

In [9]:
validation_errors = [train_test_validation_error() for _ in range(3)]

Optimization terminated successfully.
         Current function value: inf
         Iterations 10
Optimization terminated successfully.
         Current function value: inf
         Iterations 10
Optimization terminated successfully.
         Current function value: inf
         Iterations 10


In [10]:
validation_errors

[0.0268, 0.0272, 0.0292]

Inconclusive result from this small test. One result is lower than obtained without the dummy, one is higher, and one is the same.

## 6

We continue to consider the use of a logistic regression model to predict the probability of *default* using *income* and *balance* on the *Default* data set. In particular, we will now compute estimates for the standard errors of the *income* and *balance* logistic regression coefficients in two different ways: 1) using the bootstrap, and 2) using the standard formula for computing the standard errors. 

a) Use statsmodels to determine the estimated standard errors for the coefficients associated with *income* and *balance* in the multiple logistic regression model

b) Write a function ```boot_fn```, that takes as input the *Default* data set as well as an index of observations,  and that outputs the coefficient estimates for *income* and *balance* in the multiple logistic regression model.

c) Use ```boot_fn```  to estimate the standard errors of the logistic regression coefficients for *income* and *balance*

d) comment on the results

In [11]:
df = sm.datasets.get_rdataset("Default", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["default", "student"], drop_first=True)

In [12]:
y = df["default_Yes"]
X = sm.add_constant(df[["income", "balance"]])
logit = sm.Logit(y, X).fit()

Optimization terminated successfully.
         Current function value: inf
         Iterations 10


In [13]:
logit.bse

const      0.434772
income     0.000005
balance    0.000227
dtype: float64

In [14]:
logit.params

const     -11.540468
income      0.000021
balance     0.005647
dtype: float64

In [15]:
def boot_fn(base_df):
    boot_df = base_df.sample(frac=1, replace=True)
    y = boot_df["default_Yes"]
    X = sm.add_constant(boot_df[["income", "balance"]])
    logit = sm.Logit(y, X).fit(disp=0) # disp = 0 silences convergence notification
    params = logit.params
    return (params.loc["income"], params.loc["balance"])

In [16]:
income_params = list()
balance_params = list()
for _ in range(1_000):
    income, balance = boot_fn(df)
    income_params.append(income)
    balance_params.append(balance)

income_params = np.array(income_params)
balance_params = np.array(balance_params)
income_param_boot = np.mean(income_params)
balance_param_boot = np.mean(balance_params)
income_se_boot = np.std(income_params)
balance_se_boot = np.std(balance_params)
print(f"Income: parameter {income_param_boot}, SE {income_se_boot}")
print(f"Balance: parameter {balance_param_boot}, SE {balance_se_boot}")

Income: parameter 2.1132801977675804e-05, SE 4.72192168108425e-06
Balance: parameter 0.005649904104811975, SE 0.00022228528153523292


Bootstrap estimates of both the parameter values and standard error are quite close to the analytic solution

## 7

In sections 5.3.2 and 5.3.3, we saw that the ```cv.glm()``` function can be used in order to compute the LOOCV test error estimate. Alternatively, one could compute those quantities using just the ```glm()``` and ```predict.glm()``` functions, and a for loop. You will now take this approach in order to compute the LOOCV error for a simple logistic regression model on the ```Weekly``` data set. Recall that in the context of classification problems, the LOOCV error is given in 5.4:

$CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \text{Err}_{i}$

a) Fit a logistic regression model that predicts ```Direction``` using ```Lag1``` and ```Lag2```.

b) Fit a logistic regression model that predicts ```Direction``` using ```Lag1``` and ```Lag2``` *using all but the first observation*.

c) Use the model from (b) to predict the direction of the first observation. You can do this by predicting that the first observation will go up if P(```Direction="Up"```|```Lag1,Lag2```) > 0.5. Was this observation correctly classified?

d) Write a for loop from $i=1$ to $i=n$ where $n$ is the number of observations in the data set, that performs each of the following steps:
* Fit a logistic regression model using all but the $i$th observation in order to predict whether or not the market moves up.
* Compute the posterior probability for the $i$th observation in order to predict whether or not the market moves up.
* Determine whether or not an error was made in predicting the direction for the $i$th observation. If an error was made, then indicate this with as a 1, and otherwise indicate it as a 0.

e) Take the average of the $n$ numbers obtained by d) in order to obtain the LOOCV estimate of the test error. Comment on the results

In [17]:
df = sm.datasets.get_rdataset("Weekly", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["Direction"], drop_first=True)

In [18]:
y = df["Direction_Up"]
X = sm.add_constant(df[["Lag1", "Lag2"]])
logit = sm.Logit(y, X).fit(disp=0) # disp = 0 silences convergence notification

In [19]:
logit_loo = sm.Logit(y.loc[1:], X.loc[1:]).fit(disp=0)

In [20]:
logit_loo.predict(X.loc[0].values)

array([0.57139232])

In [21]:
y.loc[0]

0

Observation incorrectly classified

In [22]:
predictions = list()
for index in y.index:
    y_loo = y.loc[~y.index.isin([index])]
    X_loo = X.loc[~X.index.isin([index])]
    logit = sm.Logit(y_loo, X_loo).fit(disp=0)
    pred = round(logit.predict(X.loc[index].values)[0], 0)
    predictions.append(pred == y.loc[index])
np.mean(predictions)

0.5500459136822773

Hey, right a little more than half the time, let's take this baby to the stock market!

## 8

We will now perform cross-validation on a simulated data set.

a) Generate a simulated data set as follows
```
> set.seed(1)
> y=rnorm(100)
> x= rnorm(100)
> y = x - 2 * x^2 + rnorm(100)
```

In this data set, what is $n$ and what is $p$? Write out the model used to generate the data in equation form

b) Create a scatterplot of $X$ against $Y$. Comment on what you find.

c) Set a random seed, and then compute the LOOCV errors that result from fitting the following four models using least squares:
* $ Y = \beta_0 + \beta_1 X + \epsilon$
* $ Y = \beta_0 + \beta_1 X + \beta_2X^2 + \epsilon$
* $ Y = \beta_0 + \beta_1 X + \beta_2X^2 + \beta_2X^3 + \epsilon$
* $ Y = \beta_0 + \beta_1 X + \beta_2X^2 + \beta_2X^3 + \beta_4X^4 + \epsilon$

Note that you may find it helpful to use the ```data.frame()``` function to create a single data set containing both $X$ and $Y$

d) Repeat c) using another random seed, and report your results. Are they the same as c? Why?

e) Which of the models in c) had the smallest LOOCV error? Is this what you expected? Explain your answer

f) Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in c) using least squares. Do these results agree with the conclusions drawn based on the cross validation results?

In [23]:
def gen_xy(seed=42, size=100):
    np.random.seed(seed)
    epsilon = np.random.normal(loc=0.0, scale=1.0, size=size)
    x = np.random.normal(loc=0.0, scale=1.0, size=size)
    y = x - (2 * x**2) + epsilon
    return x, y

In [24]:
def gen_poly(x, degrees):
    poly = PolynomialFeatures(degrees, include_bias=False)
    x_poly = poly.fit_transform(x.reshape(-1, 1))
    return x_poly

In [25]:
def loocv(seed, degrees, size=100):
    x, y = gen_xy(seed, size)
    x_poly = gen_poly(x, degrees)
    ydf = pd.DataFrame(y, columns=["y"])
    xdf = pd.DataFrame(x_poly, index=ydf.index, columns=[f"X_{i + 1}" for i in range(x_poly.shape[1])])
    errors = list()
    for i in range(len(y)):
        lm = LinearRegression()
        x_loo = xdf.drop(i)
        y_loo = ydf.drop(i)
        lm.fit(x_loo, y_loo)
        y_pred = lm.predict(xdf.loc[i].values.reshape(1, -1))[0][0]
        errors.append(ydf.loc[i, "y"] - y_pred)
    return np.mean(errors)

In [26]:
for i in range(1, 5, 1):
    looc_err = loocv(42, i)
    print(f"Polynomial of degree {i} error: {looc_err:0.4f}")

Polynomial of degree 1 error: -0.0369
Polynomial of degree 2 error: -0.0044
Polynomial of degree 3 error: -0.0111
Polynomial of degree 4 error: 0.0026


In [27]:
for i in range(1, 5, 1):
    looc_err = loocv(53, i)
    print(f"Polynomial of degree {i} error: {looc_err:0.4f}")

Polynomial of degree 1 error: -0.0370
Polynomial of degree 2 error: -0.0020
Polynomial of degree 3 error: -0.0077
Polynomial of degree 4 error: -0.0050


They're different because I generated different data, I'm not sure if that was the intent of the question. If I hadn't then they'd obviously be the same. There's no randomness in the estimation component of LOOCV

The model with the second degree polynomial has the lowest error, which is what I'd expect given that the true form of the model is a second degree polynomial

In [28]:
def est_poly(degrees, seed=42, size=100):
    x, y = gen_xy(seed, size)
    x_poly = gen_poly(x, degrees)
    ydf = pd.DataFrame(y, columns=["y"])
    xdf = pd.DataFrame(x_poly, index=ydf.index, columns=[f"X_{i + 1}" for i in range(x_poly.shape[1])])
    lm = sm.OLS(ydf, sm.add_constant(xdf)).fit()
    print(lm.summary())

In [29]:
for i in range(1, 5, 1):
    est_poly(i)

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                   0.04064
Date:                Fri, 10 Jan 2020   Prob (F-statistic):              0.841
Time:                        21:51:57   Log-Likelihood:                -240.48
No. Observations:                 100   AIC:                             485.0
Df Residuals:                      98   BIC:                             490.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.8846      0.271     -6.959      0.0

  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)
  return ptp(axis=axis, out=out, **kwargs)


The model that matches the functional form has all the right stats. $X$ and $X^2$ are significant, while the intercept is not. For the most part the other models follow that patter, but it's interesting that the model with just $X$ finds the intercept significant and not $X$. Similarly, the last model generally matches, except $X^4$ is significant when it's not actually a factor for $y$, suggesting overfitting.