# Exercise - Feature Selection, Importance and Interpretation


In this exercise, you will train a base model and investigate which features the performance of the model seems to be driven by. Then you will apply feature selection techniques to reduce the feature set and investigate the effect this has on the model's performance.


In [1]:
# DO NOT MODIFY - imports
import pandas as pd
import numpy as np

## 1. Setup, Baseline Model and Baseline Performance Score


Execute the cells below to create a synthetic dataset for binary classification with 50 features and 10,000 examples. Imagine the target `y` is the direction of price movements which we would like to predict using the 50 features at our disposal.

In [2]:
# DO NOT MODIFY - create dataset and display basic statistics
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=10_000, n_classes=2, n_features=50, n_informative=10, n_redundant=10, class_sep=0.4, n_clusters_per_class=3, random_state=52)

X = pd.DataFrame(X)
y = pd.Series(y)

# DO NOT MODIFY - Split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

In [4]:
X.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
9995,1.069238,-2.532424,0.83371,0.490532,-2.785205,-4.208075,-0.603974,-1.416313,-0.502804,1.956463,...,0.041912,-1.699398,2.115717,-1.225379,2.201482,-1.073025,-2.019165,-2.872325,-0.784192,-1.01401
9996,0.355791,-0.779096,-0.436056,-2.00135,-2.683171,-7.662517,-0.30445,-0.157975,0.788791,-0.368927,...,-0.752433,-0.291018,2.380161,-0.49336,-0.449471,-0.649293,-4.454454,-1.411793,-0.901456,-0.371381
9997,-2.207905,-1.473431,-0.679165,-3.057086,0.378352,-3.405337,0.896345,0.464524,-1.161364,-0.642827,...,-1.112232,0.082142,-1.916762,0.992739,1.024438,-7.948777,-7.483838,2.639158,-1.021207,1.170794
9998,-1.396371,4.541563,-1.764635,2.208096,5.659466,-0.485101,3.33759,-0.409131,-1.170648,-0.300203,...,0.009379,-1.26228,-2.41528,-0.349376,0.916443,-3.661179,-4.244927,-0.873561,-0.283725,-0.334594
9999,1.847969,-2.291281,-0.909441,2.803908,-0.833997,-5.657078,-1.996714,-1.728738,-0.618445,1.465954,...,-0.715081,-3.001926,8.876385,0.511807,1.399261,3.245872,-7.441697,-10.990252,0.467966,0.114167


In [5]:
y.tail()

9995    0
9996    0
9997    0
9998    0
9999    1
dtype: int64

Before we continue, we would like to establish a baseline score. We will choose accuracy as the relevant performance metric.  
Write code to calculate and display the accuracy score on the _test set_ of a naive baseline model that always predicts the majority class (based on the majority class in the _training set_).

> **HINT:**
> First, you have to find the majority class (either `0` or `1`) in the target variable on the _training set_. You can use `df.value_counts()` or you can look at the `mode()` of the target, since it only has two classes.  
> Next, create an array with the same length as `y_test` with all elements equal to the majority class you just found.  
> Finally, use this as the vector of predictions to evaluate this naive baseline model on the _test set_.


In [6]:
# DO NOT MODIFY - imports
from sklearn.metrics import accuracy_score

# FILL IN - Find the majority class in the training set
majority_class = y_train.mode()[0]
# Note: There are two classes, so the majority class is the mode of the target variable.
# But there are many ways to find the majority class. E.g. the student may look at the
# distribution of the target variable and assign the majority class manually.

# FILL IN - Calculate the precision of the majority class classifier on the test set
baseline_test_acc = accuracy_score(y_test, [majority_class] * len(y_test))
baseline_test_acc

0.4965

## 2. Feature Selection

Run the code cell below to train a `LogisticRegression` model with its default hyperparameter values, using all 50 features.

In [7]:
# DO NOT MODIFY - import and train a LogisticRegression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=52)
clf.fit(X_train, y_train)

Run the code cells below to get the cross-validated accuracy score and the actual accuracy score on the test set.

In [8]:
from sklearn.model_selection import cross_val_score
cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy").mean()

np.float64(0.5875)

In [9]:
from sklearn.metrics import accuracy_score

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.5835

Take a look at the permutation importance scores of the features using the test set. Use `n_repeats=10` and `random_state=52`. Store the average permutation importance scores in `mean_perm_imps`.

In [10]:
# DO NOT MODIFY - import
from sklearn.inspection import permutation_importance

# FILL IN - Calculate permutation importance scores for the features in the test set
# Use n_repeats=10 and random_state=52
perm_imps = permutation_importance(clf, X_test, y_test, n_repeats=10, random_state=52)
mean_perm_imps = perm_imps.importances_mean

Run the cell below to print out the features listed in decreasing order of absolute value of mean permutation importance.

In [11]:
# DO NOT MODIFY - Sort and print the features by decreasing absolute permutation importance
sorted_idx = np.argsort(np.abs(mean_perm_imps))[::-1]
for i in sorted_idx:
    print(f"{i}: {mean_perm_imps[i]}")

3: 0.013700000000000023
5: 0.013150000000000018
46: 0.011800000000000033
41: -0.0072999999999999845
28: 0.005300000000000038
42: -0.005099999999999982
47: -0.004599999999999982
19: 0.0044000000000000376
38: 0.004350000000000043
14: -0.0027499999999999747
40: 0.0027000000000000244
37: -0.0025499999999999746
15: -0.0024499999999999635
48: -0.002299999999999991
1: 0.0021500000000000186
25: -0.0020999999999999795
30: -0.0019499999999999739
21: 0.0017000000000000459
13: -0.001399999999999968
17: 0.001350000000000029
22: 0.0012000000000000344
36: 0.0012000000000000123
8: -0.0011999999999999789
31: 0.0011000000000000454
44: -0.0010499999999999843
2: -0.0010499999999999733
6: 0.001000000000000023
0: 0.0008000000000000229
35: -0.0007999999999999674
24: 0.0007500000000000395
34: 0.0007500000000000284
12: 0.0007500000000000284
16: -0.0007499999999999619
23: 0.0007000000000000339
7: 0.0007000000000000229
4: -0.0006999999999999674
9: -0.0005999999999999672
32: -0.0005999999999999672
26: 0.000500000

Reduce the feature set by dropping features that have a permutation importance score less than `0.003`. Store the resulting reduced feature sets in `X_train_reduced` and `X_test_reduced`.

In [12]:
# FILL IN - Filter out features with permutation importance less than 0.003
unimportant_features = []
for i in sorted_idx:
    if np.abs(mean_perm_imps[i]) < 0.003:
        unimportant_features.append(i)
X_train_reduced = np.delete(X_train, unimportant_features, axis=1)
X_test_reduced = np.delete(X_test, unimportant_features, axis=1)

Run the cell below to see how many features remain. (There should be 9.)

In [13]:
# DO NOT MODIFY - There should be 9 features remaining
X_train_reduced.shape[1]

9

Re-train the classifier from earlier and check its average cross-validated accuracy and test accuracy scores.

In [14]:
# FILL IN - Train a new LogisticRegression model on the reduced feature set
clf = LogisticRegression()
clf.fit(X_train_reduced, y_train)

In [15]:
# FILL IN - Calculate the mean cross-validated accuracy of the new model
cross_val_score(clf, X_train_reduced, y_train, cv=5, scoring="accuracy").mean()

np.float64(0.5925)

In [16]:
# FILL IN - Calculate the precision of the new model on the test set
y_pred = clf.predict(X_test_reduced)
accuracy_score(y_test, y_pred)

0.588

**NOTE:** Reducing the feature set may or may not improve performance. After all, even some less "important" features still provide some information and eliminating them might result in a hit to performance scores. But a reduction in performance may still be worthwhile if it means faster model training and a more interpretable model.