All instructions are provided for R. I am going to reproduce them in Python as best as I can.

# Preface

From the textbook, p. 198:
> In Chapter 4, we used logistic regression to predict the probability of
default using income and balance on the `Default` data set. We will
now estimate the test error of this logistic regression model using the
validation set approach. Do not forget to set a random seed before
beginning your analysis.

In [1]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, \
                                          QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import statsmodels.api as sm


%matplotlib inline
sns.set()

  import pandas.util.testing as tm


In [2]:
np.random.seed(1)
default = pd.read_csv('https://raw.githubusercontent.com'
                      '/dsnair/ISLR/master/data/csv/Default.csv')
default = default.replace({'Yes' : 1, 'No' : 0})
x = default.drop('default', axis='columns')
y = default.default
default.head()

Unnamed: 0,default,student,balance,income
0,0,0,729.526495,44361.625074
1,0,1,817.180407,12106.1347
2,0,0,1073.549164,31767.138947
3,0,0,529.250605,35704.493935
4,0,0,785.655883,38463.495879


# (a)

From the textbook, p. 198:
> Fit a logistic regression model that uses `income` and `balance` to predict `default`.

In [3]:
model_a = sm.Logit(y, x[['income', 'balance']]).fit()

Optimization terminated successfully.
         Current function value: 0.173456
         Iterations 8


# (b)

From the textbook, p. 198:
> Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

> i. Split the sample set into a training set and a validation set.

In [4]:
x_train, x_test, y_train, y_test = train_test_split(x, y)

> ii. Fit a multiple logistic regression model using only the training observations.

In [5]:
model_b = sm.Logit(y_train, x_train[['income', 'balance']]).fit()

Optimization terminated successfully.
         Current function value: 0.177937
         Iterations 8


> iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

In [6]:
y_pred = model_b.predict(x_test[['income', 'balance']])
y_pred = np.where(y_pred >= 0.5, 1, 0)

> iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

In [7]:
1 - accuracy_score(y_pred, y_test)

0.03159999999999996

# (c)

From the textbook, p. 199:
> Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

In [8]:
for i in range(3):
  x_train, x_test, y_train, y_test = train_test_split(x, y)
  model_c = sm.Logit(y_train, x_train[['income', 'balance']]).fit()
  y_pred = model_c.predict(x_test[['income', 'balance']])
  y_pred = np.where(y_pred >= 0.5, 1, 0)
  print(f'\n{1 - accuracy_score(y_pred, y_test)}\n')

Optimization terminated successfully.
         Current function value: 0.175927
         Iterations 8

0.032399999999999984

Optimization terminated successfully.
         Current function value: 0.171329
         Iterations 8

0.03600000000000003

Optimization terminated successfully.
         Current function value: 0.166725
         Iterations 8

0.036800000000000055



The test error is different with each new split. This means that the model's fit is dependent on how you fit the data.

# (d)

From the textbook, p. 199:
> Now consider a logistic regression model that predicts the probability of default using `income`, `balance`, and a dummy variable for `student`. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

In [9]:
model_d = sm.Logit(y_train, x_train).fit()
y_pred = model_d.predict(x_test)
y_pred = np.where(y_pred >= 0.5, 1, 0)
print(f'\n{1 - accuracy_score(y_pred, y_test)}\n')

Optimization terminated successfully.
         Current function value: 0.119899
         Iterations 9

0.03759999999999997



It is difficult to tell, whether including a dummy variable for student leads to a reduction in the test error rate, because its test error is in between the the test errors I got in (c).