## Feature Selection by Random Shuffling

A popular method of **feature selection** consists in **random shuffling the values of a specific variable** and determining **how that permutation affects the performance metric of the machine learning algorithm**. In other words, the idea is to **permute the values of each feature**, one feature at the time, and **measure how much the permutation (or shuffling of its values) decreases the accuracy, or the roc_auc, or the mse of the machine learning model (or any other performance metric!)**. If the variables are important, a random permutation of their values will decrease dramatically any of these metrics. Contrarily, the permutation or shuffling of values should have little to no effect on the model performance metric we are assessing.

The procedure: Build a machine learning model and store its performance metric! Shuffle 1 feature, and make a new prediction using the previous model! Determine the performance of this prediction! Determine the change in the performance of the prediction with the shuffled feature vs the original one! Repeat for each feature!

To select features, we choose those that induced a decrease in model performance, beyond an arbitrarily set threshold. We see how to select features based on random shuffling using on a regression and classification problem. 

**Note** For the demonstration, I will continue to use **Random Forests**, but this selection procedure can be used with machine learning algorithm. In fact, the importance of the features are determined specifically for the algorithm used. Therefore, different algorithms may return different subsets of important features.

In [36]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import roc_auc_score, mean_squared_error, r2_score

## Classification

In [37]:
data = pd.read_csv('dataset_2.csv')
data.shape

(50000, 109)

In [38]:
data.head()

Unnamed: 0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,var_10,...,var_100,var_101,var_102,var_103,var_104,var_105,var_106,var_107,var_108,var_109
0,4.53271,3.280834,17.982476,4.404259,2.34991,0.603264,2.784655,0.323146,12.009691,0.139346,...,2.079066,6.748819,2.941445,18.360496,17.726613,7.774031,1.473441,1.973832,0.976806,2.541417
1,5.821374,12.098722,13.309151,4.125599,1.045386,1.832035,1.833494,0.70909,8.652883,0.102757,...,2.479789,7.79529,3.55789,17.383378,15.193423,8.263673,1.878108,0.567939,1.018818,1.416433
2,1.938776,7.952752,0.972671,3.459267,1.935782,0.621463,2.338139,0.344948,9.93785,11.691283,...,1.861487,6.130886,3.401064,15.850471,14.620599,6.849776,1.09821,1.959183,1.575493,1.857893
3,6.02069,9.900544,17.869637,4.366715,1.973693,2.026012,2.853025,0.674847,11.816859,0.011151,...,1.340944,7.240058,2.417235,15.194609,13.553772,7.229971,0.835158,2.234482,0.94617,2.700606
4,3.909506,10.576516,0.934191,3.419572,1.871438,3.340811,1.868282,0.439865,13.58562,1.153366,...,2.738095,6.565509,4.341414,15.893832,11.929787,6.954033,1.853364,0.511027,2.599562,0.811364


**Important**

Select the features by **examining only the training set** to **avoid overfit.**

In [39]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['target'], axis=1),
    data['target'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((35000, 108), (15000, 108))

**Reset the indeces of the returned datasets!**

In [40]:
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

### Train ML algo with all features

To determine feature importance by **feature shuffling:** Build the machine learning model, for example, Random Forests! Few and shallow trees to avoid overfitting! Thenprint **roc-auc** in train and testing sets!

In [41]:
rf = RandomForestClassifier(
    n_estimators=50, max_depth=2, random_state=2909, n_jobs=4)
rf.fit(X_train, y_train)
print('train auc score: ',
      roc_auc_score(y_train, (rf.predict_proba(X_train.fillna(0)))[:, 1]))
print('test auc score: ',
      roc_auc_score(y_test, (rf.predict_proba(X_test.fillna(0)))[:, 1]))

train auc score:  0.690997114685582
test auc score:  0.6857035229040285


### Shuffle features and asses performance drop

**Shuffle one by one**, each feature of the dataset! Then use the dataset with the shuffled variable to make predictions with the **random forests** trained in the previous cell! Overall train roc-auc: using all the features!

In [42]:
train_roc = roc_auc_score(y_train, (rf.predict_proba(X_train))[:, 1])
performance_shift = []  # list to capture the performance shift
for feature in X_train.columns:  # selection  logic
    X_train_c = X_train.copy()
    X_train_c[feature] = X_train_c[feature].sample(  # shuffle individual feature
        frac=1, random_state=10).reset_index(drop=True)
    shuff_roc = roc_auc_score(y_train, rf.predict_proba(X_train_c)[:, 1]) # make prediction!
    drift = train_roc - shuff_roc
    performance_shift.append(drift) # save the drop in roc-auc

** The list of performances**

In [43]:
performance_shift

[0.0,
 -9.919140466640997e-05,
 -5.777064524881137e-05,
 0.0,
 0.0,
 -3.334693591705573e-05,
 8.796265542265758e-05,
 0.0,
 0.0,
 0.0,
 2.2864223244933868e-05,
 -6.957465160806198e-05,
 4.371055238927557e-05,
 0.0,
 0.0,
 0.015497219959881292,
 -0.00012937667354617766,
 0.0,
 0.0,
 0.0,
 0.0013551709383483601,
 0.0,
 -7.38081844428029e-05,
 0.0,
 1.2296120844856873e-05,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.0015281413151823076,
 2.6043867067171433e-05,
 0.0,
 0.0,
 0.001336418904640313,
 0.0,
 0.0,
 0.0,
 0.00015224539098712686,
 0.0,
 0.0,
 0.0,
 0.0,
 3.020099856532177e-06,
 0.0,
 0.0,
 2.17743806627535e-05,
 0.0,
 0.0017028464517550024,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.07156079421237771,
 0.0,
 -0.00015256447891842662,
 4.3209449511305564e-05,
 0.0,
 0.0,
 0.0,
 -0.00018709114134207727,
 -8.323251390618402e-05,
 0.00010579113180830824,
 1.5033086339877322e-05,
 0.0,
 3.216945650863501e-05,
 0.008743101448133728,
 0.0035736702283486466,
 0.0,
 0.0,
 0.00010268788932177308,
 8.20055983392

**Transform the list into a pandas Series for easy manipulation! Then add variable names in the index!**

In [44]:
feature_importance = pd.Series(performance_shift)
feature_importance.index = X_train.columns
feature_importance.head()

var_1    0.000000
var_2   -0.000099
var_3   -0.000058
var_4    0.000000
var_5    0.000000
dtype: float64

**Sort the dataframe according to the drop in performance caused by feature shuffling!**

In [45]:
feature_importance.sort_values(ascending=False)

var_55     0.071561
var_16     0.015497
var_69     0.008743
var_108    0.006602
var_70     0.003574
             ...   
var_63    -0.000187
var_102   -0.000230
var_86    -0.000236
var_84    -0.000351
var_30    -0.001528
Length: 108, dtype: float64

**Visualise the top 10 features that caused the major drop in the roc-auc (aka model performance)!**

In [46]:
feature_importance.sort_values(ascending=False).head(10)

var_55     0.071561
var_16     0.015497
var_69     0.008743
var_108    0.006602
var_70     0.003574
var_48     0.001703
var_21     0.001355
var_34     0.001336
var_91     0.001059
var_88     0.000683
dtype: float64

**Original number of features (rows in this case)!**

In [47]:
feature_importance.shape[0]

108

**The number of features that cause a drop in performance when shuffled!**

In [48]:
feature_importance[feature_importance>0].shape[0]

30

Only 30 out of the 108 features caused a drop in the performance of the random forests when their values were permuted. This means that we could select those features and discard the rest, and should keep the original random forest performance. 

**Print the important features!**

In [49]:
feature_importance[feature_importance>0].index

Index(['var_7', 'var_11', 'var_13', 'var_16', 'var_21', 'var_25', 'var_31',
       'var_34', 'var_38', 'var_43', 'var_46', 'var_48', 'var_55', 'var_58',
       'var_65', 'var_66', 'var_68', 'var_69', 'var_70', 'var_73', 'var_74',
       'var_79', 'var_88', 'var_91', 'var_92', 'var_96', 'var_98', 'var_104',
       'var_105', 'var_108'],
      dtype='object')

### Select features

Build a **random forests** only with the selected features! Capture the selected features! Train a new random forests using only the selected features!

In [50]:
selected_features = feature_importance[feature_importance > 0].index
rf = RandomForestClassifier(n_estimators=50,
                            max_depth=2,
                            random_state=2909,
                            n_jobs=4)
rf.fit(X_train[selected_features], y_train)
print(
    'train auc score: ',# print roc-auc in train and testing sets
    roc_auc_score(y_train, (rf.predict_proba(X_train[selected_features]))[:,1]))
print(
    'test auc score: ',
    roc_auc_score(y_test, (rf.predict_proba(X_test[selected_features]))[:, 1]))

train auc score:  0.6954703746877449
test auc score:  0.6932896839648326


**The random forests with the selected features** show a **similar performance** (or even slightly higher) to the random forests built using all of the features. And it provides a **simpler, faster and more reliable model.**

## Regression

In [51]:
data = pd.read_csv('HousingPrices_train.csv')
data.shape

(1460, 81)

In practice, feature selection should be done **after data pre-processing,** so ideally, all the **categorical variables are encoded into numbers,** and then you can assess **how deterministic** they are of the target! Here for simplicity use **only numerical variables**! Select numerical columns:**

In [52]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(data.select_dtypes(include=numerics).columns)
data = data[numerical_vars]
data.shape

(1460, 38)

In [53]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(labels=['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((1022, 36), (438, 36))

**Reset the indeces of the returned datasets!**

In [54]:
X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)

In [55]:
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

### Train ML algo with all features

Determine feature importance by **feature shuffling** so: **Build the machine learning model wanted**! Select features, for example **Random Forests**! Few and shallow trees to avoid overfitting! Then print performance metrics!

In [56]:
rf = RandomForestRegressor(n_estimators=100,
                           max_depth=3,
                           random_state=2909,
                           n_jobs=4)
rf.fit(X_train, y_train)
print('train rmse: ', mean_squared_error(y_train, rf.predict(X_train), squared=False))
print('train r2: ', r2_score(y_train, (rf.predict(X_train))))
print()
print('test rmse: ', mean_squared_error(y_test, rf.predict(X_test), squared=False))
print('test r2: ', r2_score(y_test, rf.predict(X_test)))

train rmse:  34125.46855017603
train r2:  0.8090829266232026

test rmse:  39164.18326517837
test r2:  0.7740705281238518


### Shuffle features and asses performance drift

**Shuffle one by one**, each feature of the dataset! Use the dataset with the **shuffled variable** to make predictions! Use the **trained random forests**! Overall train rmse: use all the features! 

In [57]:
train_rmse = mean_squared_error(y_train, rf.predict(X_train), squared=False)
performance_shift = []#  List to capture the performance shift
for feature in X_train.columns:   # for each feature:
    X_train_c = X_train.copy()
    X_train_c[feature] = X_train_c[feature].sample(frac=1, random_state=11).reset_index(
        drop=True) # shuffle individual feature
    shuff_rmse = mean_squared_error(y_train, rf.predict(X_train_c), squared=False)  # make prediction with roc-auc!
    drift = train_rmse - shuff_rmse 
    performance_shift.append(drift)  # store the drop in roc-auc

**Transform the list into a pandas Series for easy manipulation! Add variable names in the index!**

In [58]:
feature_importance = pd.Series(performance_shift)
feature_importance.index = X_train.columns
feature_importance.head()

MSSubClass         0.000000
LotFrontage      -45.751607
LotArea         -390.252300
OverallQual   -42772.508176
OverallCond       -6.967475
dtype: float64

**Check the rmse, the smaller the better! Do original_rmse - shuffled_data_rmse! If the feature was important, **the shuffled data would increase the rsme**! Thus, we are looking for negative values! Here number of features that cause a drop in performance when shuffled!**

In [59]:
feature_importance[feature_importance<0].shape[0]

28

**The variable names!**

In [60]:
feature_importance[feature_importance<0].index

Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF',
       '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath',
       'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', 'ScreenPorch', 'MoSold', 'YrSold'],
      dtype='object')

### Select features

Compare the performance of a **random forest built only using the selected features**! Slice the data!

In [61]:
feat = feature_importance[feature_importance<0].index
X_train = X_train[feat]
X_test = X_test[feat]

In [62]:
X_train.shape, X_train.shape

((1022, 28), (1022, 28))

**Build and evaluate the model! Print performance metrics!**

In [63]:
rf = RandomForestRegressor(n_estimators=100,
                           max_depth=3,
                           random_state=2909,
                           n_jobs=4)
rf.fit(X_train, y_train)
print('train rmse: ', mean_squared_error(y_train, rf.predict(X_train), squared=False))
print('train r2: ', r2_score(y_train, (rf.predict(X_train))))
print()
print('test rmse: ', mean_squared_error(y_test, rf.predict(X_test), squared=False))
print('test r2: ', r2_score(y_test, rf.predict(X_test)))

train rmse:  34114.694591609696
train r2:  0.8092034587666255

test rmse:  39561.867880230304
test r2:  0.7694589242329681


**The model with less features shows similar performance to that with all features.**