<center><img src=img/MScAI_brand.png width=70%></center>

# Scikit-Learn: Feature Selection & Feature Engineering

* Feature selection: we choose a subset of existing features
* Feature engineering: we construct new features from existing data

In Scikit-Learn, both are implemented as *transformers*: a `transform(X)` method, usually preceded by `fit(X)`. And optionally `fit_transform(X)` as a shortcut.

### Feature selection

* **Filter approach**: calculate a statistic per feature and choose those above a threshold.
* **Wrapper approach**: try different subsets and see which gives best performance after training.

Reference:
* https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection



### Filter approach


`VarianceThreshold` throws away features with too little variance. 

In [1]:
import numpy as np
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0.5, 1.0, 0.1], 
              [0.3, 1.0, 0.2], 
              [0.1, 1.0, 0.2], 
              [0.9, 1.0, 0.2], 
              [0.8, 1.0, 0.1]])

By default, only zero variance is thrown away.

In [2]:
sel = VarianceThreshold() 
sel.fit_transform(X)

array([[0.5, 0.1],
       [0.3, 0.2],
       [0.1, 0.2],
       [0.9, 0.2],
       [0.8, 0.1]])

But we can set the threshold:

In [3]:
sel = VarianceThreshold(threshold=0.01) 
sel.fit_transform(X)

array([[0.5],
       [0.3],
       [0.1],
       [0.9],
       [0.8]])

`SelectKBest` still considers just one feature at a time, but is more general. Eg we can use it to select features based on their relationship to the target $y$.

The *chi-squared* ($\chi^2$) score is a measure of correlation between a numerical feature and a discrete target, so it's suitable for this. We choose how many features to keep, eg if `k=2` here we will keep $X_2$ and $X_3$.

<center><img src=img/feature_selection.svg width=55%></center>

In [4]:
# https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target

In [5]:
sel = SelectKBest(chi2, k=2).fit(X, y)
X_new = sel.transform(X)
print("Scores", sel.scores_) # the chi-squared scores
print("Shapes", X.shape, X_new.shape)

Scores [ 10.81782088   3.7107283  116.31261309  67.0483602 ]
Shapes (150, 4) (150, 2)


### Wrapper approach

The *wrapper* approach to feature selection is to try different subsets and see which gives best performance when used inside the ML model we want to use them in. There are at least three types of wrapper approach:

* Forward
* Backward
* Metaheuristic

**Backward feature elimination** works this way:

1. Start with all features
2. Train the model
3. Inspect the feature importances, eg by looking at `coef_`, and choose the **least important**
4. Remove that feature and repeat from 2.
5. At the end, choose the best model of all those we've seen.

* Implementation: [`RFECV`](https://scikit-learn.org/stable/modules/feature_selection.html#rfe)
* [Example](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_rfe_with_cross_validation.html#sphx-glr-auto-examples-feature-selection-plot-rfe-with-cross-validation-py)

### Feature engineering

* Scaling
* Missing values
* One-hot encoding
* Arithmetic feature transformations

### Scaling

Some ML methods work better if features are normalised to $[0, 1]$ or standardised to have mean 0 and standard deviation 1. Scikit-Learn provides `StandardScaler`, for example, for the latter. The calculation is simple: $(X - \bar{X}) / \sigma(X)$. 

<center><img src=img/leak.jpg width=20%>
    
<font size=1>  
Photo by <a href="https://unsplash.com/@sanatoga?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Joe Zlomek</a> on <a href="https://unsplash.com/photos/zrb_TkHPVtE?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
</font></center>
    
We must not **leak information about the test set** into our training, so:

* First, calculate the mean and std of the train set
* Use them to transform the train set
* Then use the same mean and std to transform the test set
* Never use the mean and standard deviation of the test set!

In [6]:
from sklearn.preprocessing import StandardScaler
X_train = np.array([[0, 4, 1, 6, 7, 8, 5, 9.0]]).T
X_test = np.array([[3.3, 4.5, 5.5]]).T
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# do not fit on X_test!
X_test = scaler.transform(X_test) 

As a result, `X_train` is now standardised:

In [7]:
X_train

array([[-1.66666667],
       [-0.33333333],
       [-1.33333333],
       [ 0.33333333],
       [ 0.66666667],
       [ 1.        ],
       [ 0.        ],
       [ 1.33333333]])

Note `X_test` will not have a zero mean and unit variance. This is expected.

In [8]:
X_test

array([[-0.56666667],
       [-0.16666667],
       [ 0.16666667]])

### Imputing missing values

It's common to have missing values in our data:

In [9]:
X = np.array([[0, 4, 1, 6, 7, np.nan, 5, 9.0]]).T
X

array([[ 0.],
       [ 4.],
       [ 1.],
       [ 6.],
       [ 7.],
       [nan],
       [ 5.],
       [ 9.]])

A common strategy is just to impute the mean of the values present in the column. 

In [10]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
X2 = imp.fit_transform(X)
X2

array([[0.        ],
       [4.        ],
       [1.        ],
       [6.        ],
       [7.        ],
       [4.57142857],
       [5.        ],
       [9.        ]])

But perhaps a safer approach is to drop any rows containing missing values. In Pandas you can use `dropna`, while in Numpy you can use code like this. Again, we won't over-write our `X`.

In [11]:
X2 = X[~np.isnan(X).any(axis=1)]

How does this work? `isnan` returns a 2D `array` of `bool`:

In [12]:
np.isnan(X)

array([[False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False]])

Which we reduce to 1D using `any`

In [13]:
np.isnan(X).any(axis=1)

array([False, False, False, False, False,  True, False, False])

But be careful! Suppose we also have `y` and we're planning to train a model with `X` and `y`.

In [14]:
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
y.shape

(8,)

 Then we must drop the **same rows** in `y`, based on the missing values in `X`:

In [15]:
y2 = y[~np.isnan(X).any(axis=1)]

In [16]:
print(X2.shape, y2.shape)

(7, 1) (7,)


### Arithmetic feature transformations

<center><img src=img/arithmetic-feature-transformation.png width=50%></center> 
<font size=2>Derived from PDSH; code in `code/make_arithmetic_transformation_plot.py`</font>

Suppose we have data like the above. We'll find that linear regression $y = a+bx$ doesn't model it well (left). But if we added the feature $x^2$ to give the model $y = a+bx+b_2x^2$, we could find a good fit (right)!

The same idea can in principle work for $x^3$ and higher. These are called *polynomial features*. 

In [17]:
from sklearn.preprocessing import PolynomialFeatures
X = np.array([[0, 1.5, 2, 4, 4.5, 5, 6, 7, 8]]).T
poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X)
print(X2)

[[  0.      0.      0.   ]
 [  1.5     2.25    3.375]
 [  2.      4.      8.   ]
 [  4.     16.     64.   ]
 [  4.5    20.25   91.125]
 [  5.     25.    125.   ]
 [  6.     36.    216.   ]
 [  7.     49.    343.   ]
 [  8.     64.    512.   ]]


### One-hot Encoding

In *one-hot encoding*, we convert a single categorical feature `f` with $n$ levels to $n$ binary features `f0`, `f1`, etc:

`f` | `f0` | `f1` | `f2`
----|------|------|-----
 `a`|  1   |   0  |  0
 `b`|  0   |   1  |  0 
 `a`|  1   |   0  |  0
 `c`|  0   |   0  |  1

Of course, Scikit-Learn provides that for us:

In [31]:
from sklearn.preprocessing import OneHotEncoder
X_train = [["a"], ["b"], ["a"], ["c"]]
X_train_enc = OneHotEncoder(sparse=False).fit_transform(X_train)
X_train_enc

array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

The above works if we only want to transform one dataset. 

But if we have a test set as well, we should use the same `OneHotEncoder` to transform it, and again we should not re-`fit` it on the test set. If the test set doesn't contain the same **set** of categories, things would break. Eg, in the following case our test set would have the wrong number of columns:

In [32]:
X_test = [["c"], ["b"]]
X_test_enc = OneHotEncoder(sparse=False).fit_transform(X_test)
X_test_enc

array([[0., 1.],
       [1., 0.]])

So, the best way to do it is to keep the `OneHotEncoder` object, `fit` on the training set and then `transform` both the training and test sets with it.

In [33]:
ohe = OneHotEncoder(sparse=False).fit(X_train)
X_train_enc = ohe.transform(X_train)
X_test_enc = ohe.transform(X_test)
X_test_enc

array([[0., 0., 1.],
       [0., 1., 0.]])

### Conclusion

Scikit-Learn gives us lots of methods for feature **selection** and feature **engineering**, and we've seen a few of the most important and simplest ones.

Data hygiene is important: eg when selecting features, normalising, or one-hot encoding, we use only the training set to decide **what** to do, then **do that** to both the training set and test set.