# Feature engineering: selection

As we have previously discussed, the first step in a machine learning pipeline is data preparation, and in previous notebooks we have discussed outliers, imputation, encoding, and transformation. An eager approach would be to feed right away the prepared data to estimators. In some situations, that is possible and could be done. Most often, however, providing estimators raw features is not advisable, either because the number of features is too large (e.g. due to one-hot encoding) or the information they convey is redundant. 

**Feature engineering** is the ML pipeline step where we select the most relevant features, or even produce new features that are better representative of the information conveyed by the original ones. Two sets of approaches can be adopted:
- **Selection:** a subset of the original features is selected.
- **Extraction:** new features are produced from the original ones.

In this notebook, we'll discuss **feature selection** approaches, particularly the ones provided by scikit-learn. In a nutshell, these methods can be categorized as *filter*, *wrapper*, or *embedded*. To understand these from a practical perspective, let's load the Boston house price prediction dataset provided by scikit-learn:

In [0]:
import pandas as pd
import seaborn as sns

from sklearn.datasets import load_boston
boston_data = load_boston()
print(boston_data.DESCR)

The data is provided as a structure where the `data` and `feature_names` attributes respectively represent the input characteristics and their names, and the `target` attribute represents the characteristic to be predicted (median market value):

In [0]:
X = pd.DataFrame(boston_data.data, columns = boston_data.feature_names)
y = pd.Series(boston_data.target, index = X.index, name = "MEDV")
X.head()

In [0]:
y.head()

## Filter approaches

This kind of approach is based on thresholds, as only properties of the given characteristic are used to determine whether it should be selected or not. The most trivial example is discarding based on missing values, as we discussed before. This can be done directly with Pandas, using the `dropna()` method with the `thresh` parameter. Let's see if this would apply to our case:

In [0]:
X.isna().sum()

Well, as with toy datasets, no missing values! Let's anyway see how this would work, if we had to do it:

In [0]:
n_samples = X.shape[0]
X.dropna(axis=1, thresh=0.1 * n_samples)

The code above removes all features for which over 90% of the values are missing. 

> Note that the approach above is not compatible with scikit-learn pipelines, so it should be applied even before data preparation ;)

A second example of filter approach is based on variance, which filters out features for which the variance is below a given threshold. The variance filter is provided by scikit-learn as the `VarianceThreshold` resource. Let's apply this filter to our data: 

In [0]:
X.var()

> Hmm.. The variances are all over the place, so how can we set a common threshold? 

Remember that feature selection usually comes after data preparation, so at that point your numerical features should have unit standard deviation (which also means unit variance). Let's set the threshold to 0.1, to ensure only features with at least 10% variance will be preserved:

> In the code below, we get the preserved column names using the `get_support()` method. This method provides a series of boolean values that we can use to identify which columns have been kept and which have been discarded.

In [0]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import make_pipeline

var_selector = VarianceThreshold(threshold=0.1)
pipe = make_pipeline(MinMaxScaler(),
                     StandardScaler(),
                     var_selector)

pipe.fit(X)
column_mask = var_selector.get_support()
pd.DataFrame(pipe.transform(X), 
             columns=X.columns[column_mask])

Note that the no features have been discarded with a 0.1 threshold. Since we don't have many features, there's no need to increase this threshold here.

## Wrapper approaches

This family of approaches builds models to understand feature importance. Overall, they can be categorized as **uni-** or **multivariate**.

### Univariate selection

Univariate selection is done using statistical tests on individual features. Test scores are used as measures of feature importance, allowing an absolute or relative number of features to be retrieved. Scikit-learn offers two resources for selection:

- Absolute number of features: `SelectKBest`
- Relative number of features: `SelectPercentile`

Let's see a practical example using two different statistical tests:

In [0]:
from sklearn.feature_selection import SelectKBest, f_regression
kbest_selector = SelectKBest(f_regression, 5)
pd.DataFrame(kbest_selector.fit_transform(X, y), 
             columns=X.columns[kbest_selector.get_support()])

Let's discuss the example above for a minute. The statistical test applied is one of the options provided by scikit-learn for regression problems. It is based on the correlation between each characteristic and the target characteristic, so we have to provide `y` as an argument to `fit_transform`. Let's compare the selected features above with the correlation map between input and target features to see that they match:

> If you missed the episode on cluster maps, check [pandas-zero](https://github.com/leobezerra/pandas-zero).

In [0]:
import matplotlib.pyplot as plt

full_data = pd.concat([X, y], axis=1)

plt.figure(figsize=(12,10))
sns.clustermap(full_data.corr(), annot=True, cmap="Reds", 
               cbar_pos=None, dendrogram_ratio=0.1, fmt='.2g')
plt.show()

Note that the five features are best correlate with the target feature are indeed `LSTAT` (-0.74), `RM`(0.7), `PTRATIO` (-0.51), `INDUS` (-0.48), and `TAX` (-0.47). We could have, instead, provided a statistical test based on mutual information between the input and target features. Let's see how that works, but selecting a relative number of features:

In [0]:
from sklearn.feature_selection import SelectPercentile, mutual_info_regression
percentile_selector = SelectPercentile(mutual_info_regression, 50)
pd.DataFrame(percentile_selector.fit_transform(X, y), 
             columns=X.columns[percentile_selector.get_support()])

Scikit-learn offers several other statistics for regression and classification problems, but they would exceed the scope of this notebook. Let's move on to multivariate approaches.

### Multivariate selection

Approaches that build models from multiple input features are called multivariate. Selection based on multivariate approaches can be done at a one-shot pass, or iteratively.

#### One-shot selection

One-shot multivariate approaches fit a single model and select features based on the importance assigned by these models to each feature. Scikit-learn provides a handful of estimators that assign feature importance internally, and can then be used for this type of selection. Let's see a pratical example of that:

In [0]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
oneshot_selector = SelectFromModel(RandomForestRegressor())
pd.DataFrame(oneshot_selector.fit_transform(X, y), 
             columns=X.columns[oneshot_selector.get_support()])

> That was brutal!

Let's understand what happened. First, we provided a random forest estimator to the `SelectFromModel` selector. Random forests are among the most used estimators for feature selection, as well as the following options:

| Approach | Regression | Classification |
| --- | --- | --- |
| L1-norm | `Lasso`, `SVR` | `LogisticRegression` | 
| Tree-based | `ExtraTreesRegressor`  | `ExtraTreesClassifier`
| | `RandomForestRegressor` | `RandomForestClassifier` | 

Still, using a given estimator doesn't imply directly at such a brutal feature reduction. That happened because we didn't configure a threshold, so `SelectFromModel` used the mean feature importance as selection threshold. Let's see what happens if we used the median feature importance, instead:

In [0]:
oneshot_selector = SelectFromModel(RandomForestRegressor(),
                                   threshold="median")
pd.DataFrame(oneshot_selector.fit_transform(X, y), 
             columns=X.columns[oneshot_selector.get_support()])

Note that now we get about half of the features, and that they are not necessarily a superset of the features we woulg get with univariate selection. It's also possible to set a maximum number of features to be returned, using parameter `max_features`:

In [0]:
oneshot_selector = SelectFromModel(RandomForestRegressor(),
                                   threshold="median",
                                   max_features=5)
pd.DataFrame(oneshot_selector.fit_transform(X, y), 
             columns=X.columns[oneshot_selector.get_support()])

> Check the documentation on `SelectFromModel` to see how to configure selection based only on `max_features`!

#### Iterative selection

In contrast to one-shot selection, where feature importance is computed from a single model, iterative selection creates as many models as the number of features to be preserved/discarded. In scikit-learn, the `RFE` (recursive feature elimination) selector iteratively discard the least important feature: 

In [0]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import Lasso

iterative_selector = RFE(estimator=Lasso())
pd.DataFrame(iterative_selector.fit_transform(X, y), 
             columns=X.columns[iterative_selector.get_support()])

The `RFE` selector can be configured as to the total number of features to select and to the number of features to discard at each iteration:

In [0]:
iterative_selector = RFE(estimator=Lasso(),
                         n_features_to_select=5,
                         step=2)
pd.DataFrame(iterative_selector.fit_transform(X, y), 
             columns=X.columns[iterative_selector.get_support()])

### Adding wrappers to the pipeline

As we have discussed before, everything that happens in machine learning is part of a pipeline. In scikit-learn, wrapper selectors should also be added to pipelines, as we did with data preparation. Let's see how that would work using two selectors:

In [0]:
prep_sel_pipe = make_pipeline(MinMaxScaler(),
                              StandardScaler(),
                              var_selector,
                              iterative_selector
                              )

prep_sel_pipe.fit(X, y)
column_mask = var_selector.get_support() & iterative_selector.get_support()
pd.DataFrame(prep_sel_pipe.transform(X), columns=X.columns[column_mask])

Note that preserving feature names gets trickier with the addition of more components to the pipeline. To do that, we get the masks provided by each selector and combine them using the and operator `&`.

> 
> Scikit-learn should make this easier!

## Embedded selection

In filter and wrapper approaches, feature selection is a step that comes prior to estimation (even if wrapper approaches may use auxiliary model fitting). In embedded approaches, feature selection is done internally by the estimator during model fitting. Not all estimators provide this possibility, though. Besides the estimators discussed in the wrapper approach section, another estimator that deserves to be mentioned is the multi-layer perceptron.

Note that the different selection approaches offer different trade-offs between the information level used for selection and computational cost required to produce that information level. In principle, embedded approaches should be preferred if the application domain is favorable to those estimators. Next, filter approaches are the computationally cheapest, but their selection capabilities are quite limited. Lastly, wrapper approaches are a very interesting alternative, with the particular methods for model fitting and selection being decided as a function of the problem one has at hand. An important thing to remember when using wrapper approaches, though, is to use cheap-to-compute auxiliary estimators, specially in iterative approaches.