# [How to Choose a Feature Selection Method For Machine Learning](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)

__Feature selection__ is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.

Will learn here:
1. There are two main types of feature selection techniques: supervised and unsupervised, and supervised methods may be divided into wrapper, filter and intrinsic.
2. Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
3. Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

This tutorial is divided into 4 parts; they are:

1. Feature Selection Methods
2. Statistics for Filter Feature Selection Methods
    - Numerical Input, Numerical Output
    - Numerical Input, Categorical Output
    - Categorical Input, Numerical Output
    - Categorical Input, Categorical Output
3. Tips and Tricks for Feature Selection
    - Correlation Statistics
    - Selection Method
    - Transform Variables
    - What Is the Best Method?
4. Worked Examples
    - Regression Feature Selection
    - Classification Feature Selection

## 1. Feature Selection Methods

__Feature selection__ methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable.

Some predictive modeling problems have a large number of variables that can slow the development and training of models and require a large amount of system memory. Additionally, the performance of some models can degrade when including input variables that are not relevant to the target variable.

One way to think about feature selection methods are in terms of supervised and unsupervised methods.
> The difference has to do with whether features are selected based on the target variable or not. Unsupervised feature selection techniques ignores the target variable, such as methods that remove redundant variables using correlation. Supervised feature selection techniques use the target variable, such as methods that remove irrelevant variables..

Another way to consider the mechanism used to select features which may be divided into __wrapper__ and __filter__ methods. These methods are almost always supervised and are evaluated based on the performance of a resulting model on a hold out dataset.

__Wrapper feature selection methods__ create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. 
> RFE is a good example of a wrapper feature selection method.

Filter feature selection methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

> Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion.

Finally, there are some machine learning algorithms that perform feature selection automatically as part of learning the model. We might refer to these techniques as __intrinsic feature selection methods__. The model will only include predictors that help maximize accuracy. In these cases, the model can pick and choose which representation of the data is best.

This includes algorithms such as penalized regression models like Lasso and decision trees, including ensembles of decision trees like random forest.


#### Difference between Feature selection and dimensionality reduction

Feature selection is also related to dimensionally reduction techniques in that both methods seek fewer input variables to a predictive model. The difference is that __feature selection__ select features to keep or remove from the dataset, whereas __dimensionality reduction__ create a projection of the data resulting in entirely new input features. As such, dimensionality reduction is an alternate to feature selection rather than a type of feature selection.


## We can summarize feature selection as follows.

- Feature Selection: Select a subset of input features from the dataset.
    - Unsupervised: Do not use the target variable (e.g. remove redundant variables).
         - Correlation
    - Supervised: Use the target variable (e.g. remove irrelevant variables).
        - __Wrapper__: Search for well-performing subsets of features.
            - RFE
        - __Filter__: Select subsets of features based on their relationship with the target.
             - Statistical Methods
             - Feature Importance Methods
        - __Intrinsic__: Algorithms that perform automatic feature selection during training.
             - Decision Trees, Ridge/Lasso/Elastic-net regression.

- Dimensionality Reduction: Project input data into a lower-dimensional feature space. Eg: PCA, SVD

![](fs1.PNG)

## 2. Statistics for Filter-Based Feature Selection Methods

It is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection.

As such, the choice of statistical measures is highly dependent upon the variable data types.

Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Common input variable data types:

1. Numerical Variables
    - Integer Variables.
    - Floating Point Variables.
2. Categorical Variables.
    - Boolean Variables (dichotomous).
    - Ordinal Variables.
    - Nominal Variables.

![](fs2.PNG)

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures. This may mean that any interaction between input variables is not considered in the filtering process.

> Most of these techniques are univariate, meaning that they evaluate each predictor in isolation. In this case, the existence of correlated predictors makes it possible to select important, but redundant, predictors. The obvious consequences of this issue are that too many predictors are chosen and, as a result, collinearity problems arise.

With this framework, let’s review some univariate statistical measures that can be used for filter-based feature selection.

![](fs3.PNG)

#### Numerical Input, Numerical Output
This is a regression predictive modeling problem with numerical input variables.

The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.

- Pearson’s correlation coefficient (linear).
- Spearman’s rank coefficient (nonlinear)

#### Numerical Input, Categorical Output
This is a classification predictive modeling problem with numerical input variables.

This might be the most common example of a classification problem,

Again, the most common techniques are correlation based, although in this case, they must take the categorical target into account.

- ANOVA correlation coefficient (linear).
- Kendall’s rank coefficient (nonlinear).
Kendall does assume that the categorical variable is ordinal.

#### Categorical Input, Numerical Output
This is a regression predictive modeling problem with categorical input variables.

This is a strange example of a regression problem (e.g. you would not encounter it often).

Nevertheless, you can use the same “Numerical Input, Categorical Output” methods (described above), but in reverse.

#### Categorical Input, Categorical Output
This is a classification predictive modeling problem with categorical input variables.

The most common correlation measure for categorical data is the chi-squared test. You can also use mutual information (information gain) from the field of information theory.

- Chi-Squared test (contingency tables).
- Mutual Information.

In fact, mutual information is a powerful method that may prove useful for both categorical and numerical data, e.g. it is agnostic to the data types.

## 3. Tips and Tricks for Feature Selection
This section provides some additional considerations when using filter-based feature selection.

### Correlation Statistics
The scikit-learn library provides an implementation of most of the useful statistical measures.

For example:
- __Pearson’s Correlation Coefficient__: [f_regression()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html)
- __ANOVA__: [f_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html)
- __Chi-Squared__: [chi2()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html)
- __Mutual Information__: [mutual_info_classif()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) and [mutual_info_regression()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html)

Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau ([kendalltau](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kendalltau.html)) and Spearman’s rank correlation ([spearmanr](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html)).

### Selection Method
The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.

Two of the more popular methods include:

- Select the top k variables: [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html)
- Select the top percentile variables: [SelectPercentile](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html)

### Transform Variables
Consider transforming the variables in order to access different statistical methods.

For example, you can transform a categorical variable to ordinal, even if it is not, and see if any interesting results come out.

You can also make a numerical variable discrete (e.g. bins); try categorical-based measures.

Some statistical measures assume properties of the variables, such as Pearson’s that assumes a Gaussian probability distribution to the observations and a linear relationship. You can transform the data to meet the expectations of the test and try the test regardless of the expectations and compare results.

### What Is the Best Method?
There is no best feature selection method.

Just like there is no best set of input variables or best machine learning algorithm. At least not universally.

Instead, you must discover what works best for your specific problem using careful systematic experimentation.

Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.