# ISLR-Python Ch5 Applied 6 

- [Load Default Dataset](#Load-Default-Dataset)
- [A. Compute Standard Errors of Coeffecients](#A.-Compute-Standard-Errors-of-Coeffecients)
- [B. Write Function to Generate Coeffecient Estimates](#B.-Write-Function-to-Generate-Coeffecient-Estimates)
- [C-D. Bootstrap Coeffecient Standard Errors and Compare](#C-D.-Bootstrap-Coeffecient-Standard-Errors-and-Compare)

In [1]:
## perform imports and set-up
import numpy as np
import pandas as pd
import scipy
import statsmodels.api as sm

from matplotlib import pyplot as plt

%matplotlib inline
plt.style.use('ggplot') # emulate pretty r-style plots

# print numpy arrays with precision 4
np.set_printoptions(precision=4)

## Load Default Dataset

In [2]:
df = pd.read_csv('../data/Default.csv', index_col=0, true_values=['Yes'], false_values=['No'])
print(len(df))
df.head()

10000


Unnamed: 0,default,student,balance,income
1,False,False,729.526495,44361.625074
2,False,True,817.180407,12106.1347
3,False,False,1073.549164,31767.138947
4,False,False,529.250605,35704.493935
5,False,False,785.655883,38463.495879


## A. Compute Standard Errors of Coeffecients

In [42]:
# Design matrix and response #
##############################
predictors = ['balance','income']
X = sm.add_constant(df[predictors])
y = df.default.values

# Build Classifier and Fit #
############################
model = sm.Logit(y,X)
logit = model.fit()
print(logit.summary())
print('\nThe Standard Errors from Formula are:\n', logit.bse)

Optimization terminated successfully.
         Current function value: 0.078948
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9997
Method:                           MLE   Df Model:                            2
Date:                Wed, 27 Jul 2016   Pseudo R-squ.:                  0.4594
Time:                        13:28:17   Log-Likelihood:                -789.48
converged:                       True   LL-Null:                       -1460.3
                                        LLR p-value:                4.541e-292
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -11.5405      0.435    -26.544      0.000       -12.393   -10.688
balance        0.0056      0

## B. Write Function to Generate Coeffecient Estimates

In [38]:
def default_boot_sample(dataFrame, indices):
    """
    Returns a single bootstrap sample estimate of the coeffecients for a logistic model applied to dataframe using 
    ['income', 'balance'] predictors and [default] response corresponding to the supplied indices.
    """
    predictors = ['balance', 'income']
    response = ['default']
    
    # Get the design matrix and response variables
    X = sm.add_constant(dataFrame[predictors]).loc[indices]
    y = dataFrame[response].loc[indices]
    
    # Create model and fit #suppress model output
    results = sm.Logit(y,X).fit(disp=0)
    
    return [results.params[predictors].balance, results.params[predictors].income]
    
# test it out
np.random.seed(0)
indices = np.random.choice(df.index, size=len(df), replace=True)
print(default_boot_sample(df, indices))

[0.0057994903738334703, 2.1728365149007411e-05]


## C-D. Bootstrap Coeffecient Standard Errors and Compare

In [50]:
def se_boot(data, stat_func, num_samples=1000):
    """
    computes the num_samples bootstrapped standard errors from bootstrap samples generated by stat_func applied to
    data. 
    """
    boot_samples = []
    
    for sample in range(num_samples):
        # choose a sample of the data
        indices = np.random.choice(data.index, size=len(data), replace=True)
        # compute the coeffecients
        boot_samples.append(stat_func(data, indices))
    
    # compute the se estimate    
    se_estimate = scipy.std(boot_samples, axis=0)
    
    return se_estimate

np.random.seed(0)
print(se_boot(df,default_boot_sample, num_samples=1000))

[  2.3616e-04   4.7132e-06]


These standard errors are very close to the values estimated in the model using formula 3.8 from the text.