<a href="https://colab.research.google.com/github/lauracline/Technical-Specs-of-Automobiles/blob/master/Resampling_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Resampling Methods**

## **Cross Validation**

Usually a test set is not avaliable so a simple strategy to create one is to split the avaliable data into training and testing (validation set). For quanitative responses, usually use MSE, for categorical can use error rate, area under the curve, F1 score, weighting of confusion matrix, etc.

## **Leave One Out Cross Validation**

LOOCV has only one observation in the test set and uses all other n-1 observations to build a model. n different models are built leaving out eac observation once and error is averaged over these n trials. LOOCV is better than the simple method above. The model is built on nearly all the data and there is no randomness in the splits since each observation will be left out once. It is computationally expensive especially with large n and a complex model. 

## **K-Fold Cross Validation**

Similiar to LOOCV, but this time you leave some number greater than 1 out. Here, k is the number of partitions of your sample, so if you have 1000 observations and k = 10, then each fold will be 100. These 100 observations would act as your test set. Get an MSE for each fold of these 100 observations and take the average. LOOCV is a special case of k-fold CV whenever k equals the number of observations. 

## **Bias-Variance Tradeoff Between LOOCV and K-Folds**

Since LOOCV trains on nearly all the data, the test error rate will generally be lower than k-fold and therefore less biased. LOOCV will have higher vaariance since all n models will be very highly correlated to one another. Since the models won't differ much, the test error rate (which the CV is measuring) will vary more than k-fold which has fewer models that are less correlated with one another. A value of k between 5 and 10 is a good rule of thumb that balances the trade-off between bias and variance. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

## Applied Excercises 

1. In chapter 4, we used logistic regression to predict the probability of `default` using `income` and `balance` on the `Default` dataset. We will now estimate the test error of this logistic regression model using the validation set approach. 

a. Fit a logistic regression model that uses `income` and `balance` to predict `default`. 

In [2]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd

In [21]:
default = pd.read_csv('https://raw.githubusercontent.com/emredjan/ISL-python/master/datasets/Default.csv')
default['student_yes'] = (default['student'] == 'Yes').astype('int')
default['default_yes'] = (default['default'] == 'Yes').astype('int')

In [4]:
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income,student_yes,default_yes
0,1,No,No,729.526495,44361.625074,0,0
1,2,No,Yes,817.180407,12106.1347,1,0
2,3,No,No,1073.549164,31767.138947,0,0
3,4,No,No,529.250605,35704.493935,0,0
4,5,No,No,785.655883,38463.495879,0,0


In [6]:
X = default[['balance', 'income']]
y = default['default_yes']

No validation set

Using sklearn

In [7]:
# Notice how tol must be changed to less than default value or 
# convergence won't happen
# Use a high value of C to remove regularization 

model = LogisticRegression(C=100000, tol=.0000001)
model.fit(X,y)
model.intercept_, model.coef_

(array([-11.54046839]), array([[5.64710291e-03, 2.08089921e-05]]))

Statsmodels 

Coefficients are similiar

In [8]:
import statsmodels.formula.api as smf

In [9]:
result = smf.logit(formula = 'default_yes ~ balance + income', data=default).fit()

Optimization terminated successfully.
         Current function value: 0.078948
         Iterations 10


In [10]:
result.summary()

0,1,2,3
Dep. Variable:,default_yes,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9997.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 05 Sep 2021",Pseudo R-squ.:,0.4594
Time:,15:43:27,Log-Likelihood:,-789.48
converged:,True,LL-Null:,-1460.3
Covariance Type:,nonrobust,LLR p-value:,4.541e-292

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-11.5405,0.435,-26.544,0.000,-12.393,-10.688
balance,0.0056,0.000,24.835,0.000,0.005,0.006
income,2.081e-05,4.99e-06,4.174,0.000,1.1e-05,3.06e-05


Error without validation set

This is an in-sample prediction. Training error in both sklearn and statsmodels. Both are equivalent. 

In [11]:
(model.predict(X) == y).mean()

0.9737

In [12]:
((result.predict(X) > .5)*1 == y).mean()

0.9737

b. Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set and a validation set. 

ii. Fit a multiple logistic regression model using only the training observations. 

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the `default` category if the posterior probability is greater than 0.5. 

iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified. 

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [14]:
model = LogisticRegression(C=100000, tol=0.0000001)
model.fit(X_train, y_train)
model.intercept_, model.coef_

(array([-11.51405382]), array([[5.60596598e-03, 2.22188422e-05]]))

In [15]:
X_train_sm = X_train.join(y_train)

In [16]:
result = smf.logit(formula = 'default_yes ~ balance + income', data=X_train_sm).fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.080650
         Iterations 10


0,1,2,3
Dep. Variable:,default_yes,No. Observations:,7500.0
Model:,Logit,Df Residuals:,7497.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 05 Sep 2021",Pseudo R-squ.:,0.4581
Time:,15:49:20,Log-Likelihood:,-604.87
converged:,True,LL-Null:,-1116.2
Covariance Type:,nonrobust,LLR p-value:,8.459e-223

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-11.5141,0.495,-23.260,0.000,-12.484,-10.544
balance,0.0056,0.000,21.764,0.000,0.005,0.006
income,2.222e-05,5.7e-06,3.897,0.000,1.1e-05,3.34e-05


In [17]:
# Nearly the same as the training set. So not too much over fitting 
# has happened
(model.predict(X_test) == y_test).mean(), ((result.predict(X_test) > 0.5)*1 == y_test).mean()

(0.9752, 0.9752)

Validation error is only 0.272. 

c. Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained. 

In [18]:
model = LogisticRegression(C=100000, tol=.0000001)

for i in range(3):
  X_train, X_test, y_train, y_test = train_test_split(X,y)
  model.fit(X_train, y_train)

  X_train_sm = X_train.join(y_train)
  result = smf.logit(formula='default_yes ~ balance + income', data=X_train_sm).fit()
  print((model.predict(X_test) == y_test).mean(), ((result.predict(X_test) > 0.5)*1 == y_test).mean())

Optimization terminated successfully.
         Current function value: 0.077908
         Iterations 10
0.9712 0.9712
Optimization terminated successfully.
         Current function value: 0.082171
         Iterations 10
0.972 0.9808
Optimization terminated successfully.
         Current function value: 0.080580
         Iterations 10
0.9672 0.9728


d. Now consider a logistic regression model that predicts the probability of `default` using `income`, `balance`, and a dummy variable for `student`. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for `student` leads to a reduction in the test error rate. 

In [24]:
X = default[['balance', 'income', 'student_yes']]
y = default['default_yes']

model = LogisticRegression(C=100000, tol=.0000001)

for i in range(3):
  X_train, X_test, y_train, y_test = train_test_split(X,y)
  model.fit(X_train, y_train)

  X_train_sm = X_train.join(y_train)
  result = smf.logit(formula='default_yes ~ balance + income + student_yes', data = X_train_sm).fit()
  print((model.predict(X_test) == y_test).mean(), ((result.predict(X_test) > 0.5)*1 == y_test).mean())

Optimization terminated successfully.
         Current function value: 0.081403
         Iterations 10
0.972 0.9768
Optimization terminated successfully.
         Current function value: 0.080568
         Iterations 10
0.9676 0.974
Optimization terminated successfully.
         Current function value: 0.078957
         Iterations 10
0.9668 0.9728


Looks like the error rate is very similiar. 

2. We continue to consider the use of logistic regression model to predict the probability of `default` using `income` and `balance` on the `Default` dataset. In particular, we will now compute estimates for the standard errors of the `income` and `balance` logistic regression coefficients using two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the logistic regression function. 