# 6.5 Lab 1: Subset Selection Methods

In [None]:
from itertools import combinations
import statsmodels.api as sm
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

## 6.5.1 Best Subset Selection

Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year. First of all, we note that the Salary variable is missing for some of the players. The is.na() function can be used to identify the missing observaitions. It returns a vector of the same length as the input vector, with a TRUE for any elements that are missing, and a FALSE for non-missing elements. The sum() function can then be used to count all of the missing elements.

In [None]:
df = sm.datasets.get_rdataset("Hitters", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["League", "Division", "NewLeague"], drop_first=True)

In [None]:
df.head()

In [None]:
df['Salary'].isna().sum()

Hence we see that Salary is missing for 59 players. The na.omit() function removes all of the rows that have missing values in any variable.

In [None]:
df = df.dropna(subset=["Salary"])

In [None]:
df['Salary'].isna().sum()

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

Fun times, doesn't look like python has an equivalent library so I guess I'm coding this by hand

In [None]:
y = df["Salary"]
X = df.drop(columns=["Salary"])

In [None]:
# It's too slow to do all the way up to 8, let's just do it for 3. I'll get the point
def modrsquared(coltuple, x, y):
    lm = sm.OLS(y, sm.add_constant(x[[col for col in coltuple]])).fit()
    return lm.rsquared

def best_subset(max_vars, x, y):
    models = dict()
    for i in range(1, max_vars + 1):
        col_opts = list(combinations(x.columns, i))
        i_models = {cols: modrsquared(cols, x, y) for cols in col_opts}
        best_cols = max(i_models.keys(), key=lambda k: i_models[k])
        models[i] = best_cols
    return models

models = best_subset(4, X, y)
models

The summary() function also returns $R^2$, RSS, adjusted $R^2$, $C_p$, and BIC. We can examine these to try to select the best overall model. For instance, we see that the $R^2$ statistic increases from 32%, when only one variable is included in the model, to almost 55 %, when all variables are included. As expected, the Plotting RSS, adjusted $R^2$, $C_p$, and BIC for all of the models at once will help us decide which model to select. Note the type="l" option tells R to connect the plotted points with lines.

In [None]:
# Statsmodels has AIC but not C_p and since they're equivalent for OLS I'll just use AIC
result_df = pd.DataFrame()
for i in models.keys():
    lm = sm.OLS(y, sm.add_constant(X[[col for col in models[i]]])).fit()
    result_df.loc[i, "R_square"] = lm.rsquared
    result_df.loc[i, "adj_R_square"] = lm.rsquared_adj
    result_df.loc[i, "RSS"] = lm.mse_resid
    result_df.loc[i, "AIC"] = lm.aic
    result_df.loc[i, "BIC"] = lm.bic
result_df

In [None]:
cdf = result_df.reset_index().melt(id_vars=["index"])
sns.relplot(x="index", y="value", col="variable", kind="line", facet_kws={"sharey": False}, data=cdf);

## 6.5.2 Forward and Backward Stepwise Selection
We can also use the ```regsubsets()``` function to perform forward stepwise or backward stepwise selection, using the argument ```method="forward"``` or ```method="backward"```.

Sweet, we don't have this in python either. 
I'll base my implementation on [this](https://planspace.org/20150423-forward_selection_with_statsmodels/)

In [None]:
def forward_selected(x, y, maxvars):
    """Linear model designed by forward selection.

    Parameters:
    -----------
    x: DataFrame, potential exogenous variables
    y: Series, variable to predict
    """
    remaining = set(x.columns)
    selected = []
    models = {}
    current_score, best_new_score = 0.0, 0.0
    while remaining and len(selected) <= maxvars and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            X_candidate = sm.add_constant(x[selected + [candidate]])
            score = sm.OLS(y, X_candidate).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
            models[len(selected)] = selected[:]
    return models

In [None]:
forward_models = forward_selected(X, y, maxvars=20)

In [None]:
def backward_selected(x, y):
    """Linear model designed by forward selection.

    Parameters:
    -----------
    x: DataFrame, potential exogenous variables
    y: Series, variable to predict
    """
    selected = list(x.columns)
    models = {}
    while len(selected) > 1:
        scores_with_candidates = []
        for candidate in selected:
            X_candidate = sm.add_constant(x[selected].drop(columns=[candidate]))
            score = sm.OLS(y, X_candidate).fit().rsquared_adj
            scores_with_candidates.append((score, X_candidate))
        scores_with_candidates.sort()
#         if len(scores_with_candidates) < 19:
#             return scores_with_candidates
        best_score, best_candidate = scores_with_candidates.pop()
        selected = list(best_candidate.drop(columns=["const"]).columns)
        models[len(selected)] = selected[:]
    return models

In [None]:
backward_models = backward_selected(X, y)

In [None]:
backward_models[7]

In [None]:
forward_models[7]

## 6.5.3 Choosing Among Models Using the Validation Set Approach and Cross-Validation

We just saw that it is possible to choose among a set of models of different sizes using $C_p$, BIC, and adjusted $R^2$. We will now consider how to do this using the validation set and cross-validation approaches.

In order for these approaches to yield accurate estimates of the test error, we must use *only the training observations* to perform all aspects of model-fitting - including variable selection. Therefore, the determination of which model of a given size is best must be made using *only the training observations*. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error. 

In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. 

Now we apply ```regsubsets()``` to the training set in order to perform best subset selection.

Notice that we subset the ```Hitters``` data frame directly in the call in order to access only the training subset of the data, using the expression ```Hitters[train,]```. We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.

The ```model.matrix()``` function is used in many regression packages for building an "X" matrix from data. Now we run a loop, and for each size ```i``` we extract the coefficients from ```regfit.best``` for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE.

We find that the best model is the one that contains ten variables.

This was a little tedious, partly because there is no ```predict()``` method for ```regsubsets()```. Since we will be using this function again, we can capture our steps above and write our own predict method.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
# best_subset(11, X_train, y_train)

In [None]:
# That takes forever to run so I'll just embed the results here
best_subset_11 = {1: ('CRuns',),
 2: ('Hits', 'CRBI'),
 3: ('Hits', 'CRuns', 'PutOuts'),
 4: ('AtBat', 'Hits', 'CRuns', 'PutOuts'),
 5: ('Hits', 'CAtBat', 'CHits', 'CHmRun', 'PutOuts'),
 6: ('Hits', 'CAtBat', 'CHits', 'CHmRun', 'PutOuts', 'Division_W'),
 7: ('AtBat', 'Hits', 'CAtBat', 'CRuns', 'CRBI', 'PutOuts', 'League_N'),
 8: ('AtBat',
  'Hits',
  'CAtBat',
  'CRuns',
  'CRBI',
  'PutOuts',
  'League_N',
  'Division_W'),
 9: ('AtBat',
  'Hits',
  'CAtBat',
  'CRuns',
  'CRBI',
  'CWalks',
  'PutOuts',
  'League_N',
  'Division_W'),
 10: ('AtBat',
  'Hits',
  'Walks',
  'CAtBat',
  'CRuns',
  'CRBI',
  'CWalks',
  'PutOuts',
  'League_N',
  'Division_W'),
 11: ('AtBat',
  'Hits',
  'Walks',
  'Years',
  'CAtBat',
  'CRuns',
  'CRBI',
  'CWalks',
  'PutOuts',
  'League_N',
  'Division_W')}

Summarizing, the next part is to do k-fold cross validation best model selection. For $k=10$ we want a $10x19$ matrix with the 10 folds and the best model on each fold for 1 through 19 variables. This is going to take forever to run so I'm going to skip it

# 6.6 Lab 2: Ridge Regression and the Lasso

We will use the glmnet package in order to perform ridge regression and the lasso. The main function in this package is glmnet(), which can be used glmnet() to fit ridge regression models, lasso models, and more. This function has slightly different syntax from other model-fitting functions that we have encountered thus far in this book. In particular, we must pass in an x matrix as well as a y vector, and we do not use the y ∼ x syntax. We will now perform ridge regression and the lasso in order to predict Salary on the Hitters data. Before proceeding ensure that the missing values have been removed rom the data as described above in section 6.5

In [None]:
df.head()

In [None]:
y = df["Salary"]
X = df.drop(columns=["Salary"])
lambdas = 10**np.linspace(10,-2,100)

In [None]:
y_sk = y.to_numpy().reshape(-1, 1)
yscaler = StandardScaler(with_mean=False).fit(y_sk)
y_rescale = yscaler.scale_
y_scaled = yscaler.transform(y_sk)

In [None]:
lam = len(X) * lambdas[49] / 2
lam = lambdas[49]
ridge = Ridge(alpha=lam, fit_intercept=True, normalize=True)
ridge.fit(X, y)
print(ridge.intercept_)
ridge.coef_

The model.matrix() function is particularly useful for creating x; not only does it produce a matrix corresponding to the 19 predictors but it also automatically transforms any qualitative variables into dummy variables. The latter property is important because glmnet() can only take numerical, quantitative inputs

## 6.6.1 Ridge Regression

The ```glmnet()``` function has an alpha argument that determines what type
of model is fit. If ```alpha=0``` then a ridge regression model is fit, and if ```alpha=1``` then a lasso model is fit. We first fit a ridge regression model.
By default the glmnet() function performs ridge regression for an automatically selected range of $\lambda$ values. However, here we have chosen to implement the function over a grid of values ranging from $\lambda = 10^{10}$ to $\lambda = 10^{−2}$, essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit. As we will see, we can also compute model fits for a particular value of $\lambda$ that is not one of the original grid values. Note that by default, the ```glmnet()``` function standardizes the variables so that they are on the same scale. To turn off this default setting,
use the argument ```standardize=FALSE```.

Associated with each value of $\lambda$ is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef(). In this case, it is a 20×100 matrix, with 20 rows (one for each predictor, plus an intercept) and 100
columns (for for each value of $\lambda$

Something about this isn't fitting the same, it's a difference in what they're optimizing for, see [StackOverflow](https://stats.stackexchange.com/questions/160096/what-are-the-differences-between-ridge-regression-using-rs-glmnet-and-pythons#160213)

In [None]:
lambdas = 10**np.linspace(10,-2,100)
ridge_coeffs = pd.DataFrame(index=lambdas, columns=X.columns)

for lam in lambdas:
    scaler = StandardScaler()
    ridge = Ridge(normalize=False)
    pipe = Pipeline([("scaler", scaler), ("ridge", ridge)])
    pipe.set_params(ridge__alpha=lam)
    pipe.fit(X, y_scaled)
    coef = pipe.named_steps["ridge"].coef_
    ridge_coeffs.loc[lam] = coef
ridge_coeffs

In [None]:
lam = 11_498  / 2
scaler = StandardScaler(with_mean=False)
ridge = Ridge(normalize=False)
pipe = Pipeline([("scaler", scaler), ("ridge", ridge)])
pipe.set_params(ridge__alpha=lam)
pipe.fit(X, y_scaled)
coef = pipe.named_steps["ridge"].coef_
coef * pipe.named_steps.scaler.scale_

In [None]:
lam = 705  / 2
scaler = StandardScaler(with_mean=False)
ridge = Ridge(normalize=False)
pipe = Pipeline([("scaler", scaler), ("ridge", ridge)])
pipe.set_params(ridge__alpha=lam)
pipe.fit(X, y_scaled)
coef = pipe.named_steps["ridge"].coef_
coef * pipe.named_steps.scaler.scale_