# Subset Selection
## CMSE 381 - Fall 2023
## Oct 13,  2023. Lecture 17



In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [None]:
# First, we're going to do all the data loading we've had for a while for this data set
auto = pd.read_csv('../../DataSets/Auto.csv')
auto = auto.replace('?', np.nan)
auto = auto.dropna()
auto.horsepower = auto.horsepower.astype('int')

#this shuffles my data set in advance so that i don't need to worry about it later 
auto = auto.sample(frac=1).reset_index(drop=True)


auto.head()


Let's try to run subset selection on the `auto` data set! We're going to use `cylinders`, `horsepower`, `weight`, and `acceleration` to predict `mpg`. 

In [None]:
inputvars = ['cylinders','horsepower','weight', 'acceleration']

The first tool we are going to use is the `itertools` package, which gives us a way to get subsets of whatever size we want using the `combinations` command.  

In [None]:
from itertools import combinations

The weird thing is it's an iterator, so if I just try to print out what I want, it's not helpful to me. 

In [None]:
combinations(inputvars,2)

But if I use it in a for loop it does what I want!

In [None]:
for x in combinations(inputvars,2):
    print(x)

Here's some code stolen from the last few days to run linear regression on a subset of the input variables. 

In [None]:
def myscore(df,listofvars, outputvar = 'mpg'):
    X = df[list(listofvars)]
    y = df[outputvar]
    
    #build linear regression model
    model = LinearRegression()
    

    #use 5-fold CV to evaluate model
    scores = cross_val_score(model, X,y, 
                             scoring='neg_mean_squared_error',
                             cv=5)

    #view mean absolute error
    return np.average(np.absolute(scores))
    

myvars = ('cylinders', 'acceleration')
myscore(auto,myvars)


&#9989; **<font color=red>Do this:</font>** Modify the code below as follows: 
- Set up two nested for loops to get every size $p = \{1,\cdots,4\}$ subset of my list of variables I want to use
- For each of these subsets, use the `myscore` function to get the k-fold CV error.
- Append it into a data frame as shown


In [None]:
myvars = []
myscores = []

#-----
# your loop goes in here
#-----
        
myResults = pd.DataFrame({'Vars':myvars, 'Score':myscores})
myResults

We got all our main subsets, we're just missing a null model. This is the model that predicts the sample mean `mpg` for any input data. 

&#9989; **<font color=red>Do this:</font>** What is the MSE on our data set if we just predict the mean for every data point? Add this entry to your `myResults` data frame

*Hint: you can get a numpy array with every entry being the same output by using the `np.full` command.*

In [None]:
## Your code here ##

myscore = np.nan #<---- fix this to get your score! Then run
                 #      the cell below to append it to your 
                 #      dataframe. 

In [None]:
newresult = pd.DataFrame({'Vars':['empty'], 'Score':[myscore]})

myResults = pd.concat([newresult,myResults],ignore_index = True)

# If you print it out now, you should have 16 models scored
myResults

&#9989; **<font color=red>Do this:</font>** What is the minimum score over all subsets? Which model makes it happen? The `idxmin` command will likely be useful for this. 

In [None]:
# Your code here #

## Stretch project

Have some free time? 

&#9989; **<font color=red>Strech project:</font>** Figure out how to do this for forward selection. 

- First, try to write pseudo code for how to search through these subsets. 
- Then if you still have time, see if you can get it working on this set of four variables. 

In [None]:
# Your code here #



-----
### Congratulations, we're done!
Written by Dr. Liz Munch, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.