<a href="https://colab.research.google.com/github/miraclehimself/Advanced_ML/blob/main/Automated_Feature_Selection_week5b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AML 22-23 S2W5 Webinar: Automated Feature Selection

## Some Take Home Messages from This Week's Lecture/Lab

* **human-in-the-loop**: we are using a tool; automate the tasks, but validate/cross-check/sanity-check/tweak/override.

* different stages/**iterations**, different needs
    - might start with broad strokes, and then refine.

* it is another building block for your workflow/pipelines; you might use it in the different iterations when exploring; or use in several steps in automated data processing (in a pipeline); different configurations/strategies.

* you rarely run out of improvements/things to do. It is a matter of time/resource management. Have you got the budget? What would you gain from more sophisticated models, feature selection, and so on?

* strategies that we have seen:
   - univariate; strength of relationship with target; select k-best
   - RFE: start with all features; build a model; obtain **feature importances** from the model; remove bottom-$p$ (e.g., $p=1$) features; repeat until desired $k$.
   - SFS-forward: start with no features; build a separate model for each added feature; pick the configuration with the best **model score**; repeat until desired $k$.

* search space grows exponentially with the number of features (different combinations); we are looking at **approximate solutions**, greedy search, and the like.

In [None]:
from itertools import combinations

In [None]:
# combinations formulate (binomial coefficient?): n choose k (factorial!)
features = [ 'x0', 'x1', 'x2', 'x3', 'x4']

In [None]:
list(combinations(features, 2))

[('x0', 'x1'),
 ('x0', 'x2'),
 ('x0', 'x3'),
 ('x0', 'x4'),
 ('x1', 'x2'),
 ('x1', 'x3'),
 ('x1', 'x4'),
 ('x2', 'x3'),
 ('x2', 'x4'),
 ('x3', 'x4')]

## Packages

In [None]:
!pip install --upgrade scikit-learn -q --user
# need to restart kernel, if latest versions not already installed

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# https://seaborn.pydata.org/tutorial/aesthetics.html
sns.set(
    style='ticks',
    context='talk',
    font_scale=0.8,
    rc={'figure.figsize': (8,6)}
)

In [None]:
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV, ParameterGrid

In [None]:
from sklearn.metrics import mean_squared_error
from functools import partial
rmse = partial(mean_squared_error, squared=False)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.dummy import DummyRegressor
from statsmodels.formula.api import ols
from sklearn.ensemble import RandomForestRegressor

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
from sklearn.datasets import make_regression

## Ames Dataset

In [None]:
ames = pd.read_csv('https://raw.githubusercontent.com/gerberl/6G7V0017_2223/main/datasets/ames/train.csv')

In [None]:
ames.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
X, y = ames.drop(columns='SalePrice'), ames['SalePrice']

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression

In [None]:
X_num = X.select_dtypes(exclude='object')

In [None]:
X_num.columns

Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold'],
      dtype='object')

In [None]:
X_num.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,1,60,65.0,8450,7,5,2003,2003,196.0,706,...,548,0,61,0,0,0,0,0,2,2008
1,2,20,80.0,9600,6,8,1976,1976,0.0,978,...,460,298,0,0,0,0,0,0,5,2007
2,3,60,68.0,11250,7,5,2001,2002,162.0,486,...,608,0,42,0,0,0,0,0,9,2008
3,4,70,60.0,9550,7,5,1915,1970,0.0,216,...,642,0,35,272,0,0,0,0,2,2006
4,5,60,84.0,14260,8,5,2000,2000,350.0,655,...,836,192,84,0,0,0,0,0,12,2008


In [None]:
X_num = X_num.drop(columns='Id')

In [None]:
selector = SelectKBest(f_regression, k=10).fit(X_num, y)

ValueError: ignored

In [None]:
imputer = make_pipeline(SimpleImputer()).set_output(transform='pandas').fit(X_num, y)

In [None]:
X_num_filled = imputer.transform(X_num)

In [None]:
selector = make_pipeline(
    SelectKBest(f_regression, k=10).fit(X_num_filled, y)
).set_output(transform='pandas')

In [None]:
selector.get_feature_names_out()

array(['OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF',
       '1stFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars',
       'GarageArea'], dtype=object)

In [None]:
X_sel = selector.transform(X_num_filled)

In [None]:
X_sel.head()

Unnamed: 0,OverallQual,YearBuilt,YearRemodAdd,TotalBsmtSF,1stFlrSF,GrLivArea,FullBath,TotRmsAbvGrd,GarageCars,GarageArea
0,7.0,2003.0,2003.0,856.0,856.0,1710.0,2.0,8.0,2.0,548.0
1,6.0,1976.0,1976.0,1262.0,1262.0,1262.0,2.0,6.0,2.0,460.0
2,7.0,2001.0,2002.0,920.0,920.0,1786.0,2.0,6.0,2.0,608.0
3,7.0,1915.0,1970.0,756.0,961.0,1717.0,1.0,7.0,3.0,642.0
4,8.0,2000.0,2000.0,1145.0,1145.0,2198.0,2.0,9.0,3.0,836.0


In [None]:
model = RandomForestRegressor().fit(X_sel, y)

In [None]:
scores = cross_val_score(model, X_sel, y, cv=20, scoring='neg_root_mean_squared_error')
scores.mean()*-1, scores.std()

(31133.463378909306, 9234.032999351293)

In [None]:
# ideally, using train and test data for estimation of performance

I would like you to take over from here. How:

* Include categorical features (careful with features with high-cardinality... you might want to use OHE with min_freq or TargetEncoding).

* Pipelines, rather than myself keeping track of intermediate results (inputs/outputs of models/transformers).

* Try out RFE, RFECV, SFS, and perhaps `SelectFromModel`

* Could we select best features from SHAP output?!

* https://scikit-learn.org/stable/modules/feature_selection.html

In [None]:
scores = cross_val_score(DummyRegressor(), X_sel, y, cv=20, scoring='neg_root_mean_squared_error')

In [None]:
scores.mean()*-1, scores.std()

(78615.89815341146, 11628.188644916067)