In [None]:
from itertools import combinations
import statsmodels.api as sm
import pandas as pd

Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year. First of all, we note that the Salary variable is missing for some of the players. The is.na() function can be used to identify the missing observaitions. It returns a vector of the same length as the input vector, with a TRUE for any elements that are missing, and a FALSE for non-missing elements. The sum() function can then be used to count all of the missing elements.

In [None]:
df = sm.datasets.get_rdataset("Hitters", "ISLR", cache=True).data.pipe(pd.get_dummies, columns=["League", "Division", "NewLeague"], drop_first=True)

In [None]:
df.head()

In [None]:
df['Salary'].isna().sum()

Hence we see that Salary is missing for 59 players. The na.omit() function removes all of the rows that have missing values in any variable.

In [None]:
df = df.dropna(subset=["Salary"])

In [None]:
df['Salary'].isna().sum()

The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

Fun times, doesn't look like python has an equivalent library so I guess I'm coding this by hand

In [None]:
y = df["Salary"]
X = df.drop(columns=["Salary"])

In [None]:
def modrsquared(coltuple):
    lm = sm.OLS(y, sm.add_constant(X[[col for col in coltuple]])).fit()
    return lm.rsquared

models = dict()
for i in range(1, 9):
    col_opts = list(combinations(X.columns, i))
    i_models = {cols: modrsquared(cols) for cols in col_opts}
    best_cols = max(i_models.keys(), key=lambda k: i_models[k])
    models[i] = best_cols
models