# Quiz2 Coding

In this coding part, you need to implement logistic regression with boostrapping techniques on the Hitter dataset. This code consists of three parts:

I. preparing the data 
**(do not change if not necessary)**

II. bootstrapping 
**(add your code in this part)**

III. plot 
**(do not change if not necessary)**


## I. Preparing the Data

**(do not change if not necessary)**

Like the practice coding, we use NewLeague as label and the rest as features, aiming to find the logistic regression model between label and features.

In [2]:
import pandas as pd

#the data can be downloaded from "https://github.com/jcrouser/islr-python/blob/master/data/Hitters.csv"
df = pd.read_csv('Hitters.csv')

# print (df.head(10))
df.dropna(inplace=True)

X = df[['AtBat','Hits','HmRun','Runs','RBI','Walks','Years','CAtBat','CHits','CHmRun','CRuns','CRBI','CWalks','League','Division','PutOuts','Assists','Errors','Salary']]
y = df[['NewLeague']]

X_cat = X.select_dtypes(exclude=['int64', 'float64'])                                                                                                         
X_dog = X.select_dtypes(include=['int64', 'float64'])                                                                                                         
                                                                                                                                                              
X_cat = pd.get_dummies(X_cat)                                                                                                                                 
X = pd.concat([X_cat, X_dog], axis=1)  

NewLeague2number_dict = {
    'A':0,
    'N':1
}

y=y.replace({"NewLeague": NewLeague2number_dict})

## II.Bootstrapping


Implement logistic regression with bootstrapping on the same feature and label as lasso and ridge regression

**Ignore the different lamba values for the logistic regression since here we are not adding any regularization term, you can simply leave this as default**



In [3]:
from sklearn.linear_model import LogisticRegression,Ridge, Lasso
from sklearn.metrics import roc_auc_score

max_iter = 1e8

def lambda_aucs (model, lambda_vals, x_train, y_train, x_test, y_test):
    auc = []
    
    for lambda_ in lambda_vals:
        
        if model== "ridge":
            estimator = Ridge(max_iter = max_iter, alpha = lambda_)
        elif model == "lasso":
            estimator = Lasso(max_iter = max_iter, alpha = lambda_)
 
        
        estimator.fit(x_train, y_train)
        preds = estimator.predict(x_test)
            
        auc.append(roc_auc_score(y_test, preds))

    return auc

Now we perform the bootstrapping:

** (save your results to aucs["logistic"]) **


In [4]:
import numpy as np

n_bootstraps = int(100) ## we set loop as 100 to save computation time

n_test  = int(0.2*len(X))
n_train = len(X) - n_test

lambda_vals = [1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3]

aucs  = {"ridge":[], "lasso":[],"logistic":[]}

for _ in range(n_bootstraps):
    train_test_indicator = np.asarray([True]*n_train + [False]*n_test)
    np.random.shuffle(train_test_indicator)
#     print(train_test_indicator)

    x_train, x_test, y_train, y_test = X[train_test_indicator], X[np.logical_not(train_test_indicator)], y[train_test_indicator], y[np.logical_not(train_test_indicator)]
    
    ridge_lambda_auc = lambda_aucs ("ridge", lambda_vals, x_train, y_train, x_test, y_test)
    lasso_lambda_auc = lambda_aucs ("lasso", lambda_vals, x_train, y_train, x_test, y_test)
    
    aucs ["ridge"].append(ridge_lambda_auc)
    aucs ["lasso"].append(lasso_lambda_auc)    
    


## III.Plotting


**(do not change if not necessary)**

We plot the first 20 AUC vs lambda value curves: (we will have a logistic regression curve constant at different lambda value since we are not adding regularization term)

In [5]:
import matplotlib.pyplot as plt


def plot_auc_subplot(ax, auc, model):
    best_lambda_index = np.argmax(auc)
    ax.scatter(range(1, len(auc)+1), auc, color='black')
    ax.plot(range(1, len(auc)+1), auc, linewidth=0.2, color = 'black')
    ax.scatter(best_lambda_index+1, auc[best_lambda_index], color='blue', linewidth=7)
    
    plt.sca(ax)
    plt.title(model)
    plt.xlabel('lambda values')
    plt.xticks(range(1, len(lambda_vals)+1), lambda_vals)
    plt.ylabel('AUC')
    

for i in range(1,21):
    fig, axs = plt.subplots(1,3,figsize=(16,10))
    fig.suptitle("AUC vs lambda value (best value is blue)", fontsize=14)
    for j in range(i):
        auc_ridge = np.asarray(aucs["ridge"][j])
        auc_lasso = np.asarray(aucs["lasso"][j])
        auc_logistic = np.asarray(aucs["logistic"][j])

        plot_auc_subplot(axs[0], auc_ridge, "ridge")
        plot_auc_subplot(axs[1], auc_lasso, "lasso")
        plot_auc_subplot(axs[2], auc_logistic, "logistic")


IndexError: list index out of range

This is one referenced plot of logitic regression AUC curve, if you obtained a similar curve (not necessarily exactly same value due to the randomness), you should be right :

![image.png](attachment:image.png)

To get the mean value for all the different lambda values, we will make a dataframe from all the results from each and then take the mean of the values:

In [None]:
ridge_aucs=pd.DataFrame(aucs["ridge"])
lasso_aucs=pd.DataFrame(aucs["lasso"])
logistic_aucs=pd.DataFrame(aucs["logistic"])

print ("ridge mean AUCs:")
for lambda_val, ridge_auc in zip(lambda_vals, ridge_aucs.mean()):
    print (lambda_val, "AUC:", "%.2f"%ridge_auc)
    
    
print ("\nlasso mean AUCs:")
for lambda_val, lasso_auc in zip(lambda_vals, lasso_aucs.mean()):
    print (lambda_val, "AUC:", "%.2f"%lasso_auc)
    
print ("\nlogistic mean AUCs:")
for lambda_val, logistic_auc in zip(lambda_vals, logistic_aucs.mean()):
    print (lambda_val, "AUC:", "%.2f"%logistic_auc)

Now we plot the histograms of best performing ridge and lasso:

In [None]:
n_bins = 20

fig, axs = plt.subplots(1, 3, sharey=True)

axs[0].hist(ridge_aucs[ridge_aucs.mean().idxmax()].values, bins=n_bins)
axs[1].hist(lasso_aucs[lasso_aucs.mean().idxmax()].values, bins=n_bins)
axs[2].hist(logistic_aucs[logistic_aucs.mean().idxmax()].values, bins=n_bins)

fig.suptitle("AUC Histograms", fontsize=14)

def set_axis_title(ax, title):
    plt.sca(ax)
    plt.title(title)

set_axis_title(axs[0], "ridge with $\lambda=%.1f$"%lambda_vals[ridge_aucs.mean().idxmax()])
set_axis_title(axs[1], "lasso with $\lambda=%.1f$"%lambda_vals[lasso_aucs.mean().idxmax()])
set_axis_title(axs[2], "logistic with $\lambda=%.1f$"%lambda_vals[logistic_aucs.mean().idxmax()])