```
From: https://github.com/ksatola
Version: 0.0.1

TODOs
1. https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e
2. https://pypi.org/project/rfpimp/
3. https://explained.ai/rf-importance/
4. https://medium.com/@vigneshmadanan/linear-regression-basics-and-regularization-methods-b40359b0aea5

```

# Features Selection
We use feature selection to select features that are useful to the model. Irrelevant features may have a negative effect on a model. Correlated features can make coefficients in regression (or feature importance in tree models) unstable or difficult to interpret.

The `curse of dimensionality` is another issue to consider. As you increase the number of dimensions of your data, it becomes more sparse. This can make it difficult to pull out a signal unless you have more data. Neighbor calculations tend to lose their usefulness as more dimensions are added.

Also, training time is usually a function of the number of columns (and sometimes it is worse than linear). If you can be concise and precise with your columns, you can have a better model in less time.

Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.

Perhaps the simplest case of feature selection is the case where there are numerical input variables and a numerical target for regression predictive modeling. This is because the strength of the relationship between each input variable and the target can be calculated, called correlation, and compared relative to each other.

Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.

Three **benefits of performing feature selection before modeling your data** are:

- **Reduces Overfitting:** Less redundant data means less opportunity to make decisions based on noise.
- **Improves Accuracy:** Less misleading data means modeling accuracy improves.
- **Reduces Training Time:** Less data means that algorithms train faster.

## Methods
`Feature selection` methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable. `Feature selection` is primarily focused on removing non-informative or redundant predictors from the model.

One way to think about feature selection methods are in terms of `supervised` and `unsupervised` methods. The difference has to do with whether features are selected based on the target variable or not. `Unsupervised feature selection` techniques ignores the target variable, such as methods that remove redundant variables using correlation. `Supervised feature selection` techniques use the target variable, such as methods that remove irrelevant variables

Another way to consider the mechanism used to select features which may be divided into `wrapper` and `filter` methods. These methods are almost always supervised and are evaluated based on the performance of a resulting model on a hold out dataset.

`Wrapper feature selection` methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. [RFE](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) is a good example of a wrapper feature selection method. 

`Filter feature selection` methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model. Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion.

Finally, there are some machine learning algorithms that perform feature selection automatically (built-in feature selection) as part of learning the model. We might refer to these techniques as `intrinsic/embedded feature selection` methods. In these cases, the model can pick and choose which representation of the data is best. This includes algorithms such as penalized regression models like `MARS`, `Lasso` and `decision trees`, including `ensembles of decision trees` like `random fores`t.

Feature selection is also related to `dimensionally reduction techniques` in that both methods seek fewer input variables to a predictive model. The difference is that feature selection select features to keep or remove from the dataset, whereas dimensionality reduction create a projection of the data resulting in entirely new input features. As such, **dimensionality reduction is an alternate to feature selection rather than a type of feature selection**.

**Feature Selection:** Select a subset of input features from the dataset.
- **Unsupervised:** Do not use the target variable (e.g. remove redundant variables).
    - Variation
    - Correlation (multicolinearity - between independent variables)
- **Supervised:** Use the target variable (e.g. remove irrelevant variables).
    - **Wrapper:** Search for well-performing subsets of features.
        - RFE
    - **Filter:** Select subsets of features based on their relationship with the target.
        - Correlation (between a feature and the target)
        - Statistical Methods
        - Feature Importance Methods
    - **Intrinsic:** Algorithms that perform automatic feature selection during training.
        - Linear Regression with regularization (Lasso, Ridge, ElasticNet) - https://medium.com/@vigneshmadanan/linear-regression-basics-and-regularization-methods-b40359b0aea5
        - (Ensembles of) Decision Trees
        - XGBoost, CATBoost
**Dimensionality Reduction:** Project input data into a lower-dimensional feature space

## Feature Selection Checklist
Isabelle Guyon and Andre Elisseeff the authors of [An Introduction to Variable and Feature Selection](https://jmlr.csail.mit.edu/papers/volume3/guyon03a/guyon03a.pdf) (PDF) provide an excellent checklist that you can use the next time you need to select data features for you predictive modeling problem.

1. Do you have domain knowledge? If yes, construct a better set of ad hoc features.
1. Are your features commensurate? If no, consider normalizing them.
1. Do you suspect interdependence of features? If yes, expand your feature set by constructing conjunctive features or products of features, as much as your computer resources allow you.
1. Do you need to prune the input variables (e.g. for cost, speed or data understanding reasons)? If no, construct disjunctive features or weighted sums of feature
1. Do you need to assess features individually (e.g. to understand their influence on the system or because their number is so large that you need to do a first filtering)? If yes, use a variable ranking method; else, do it anyway to get baseline results.
1. Do you need a predictor? If no, stop
1. Do you suspect your data is “dirty” (has a few meaningless input patterns and/or noisy outputs or wrong class labels)? If yes, detect the outlier examples using the top ranking variables obtained in step 5 as representation; check and/or discard them.
1. Do you know what to try first? If no, use a linear predictor. Use a forward selection method with the “probe” method as a stopping criterion or use the 0-norm embedded method for comparison, following the ranking of step 5, construct a sequence of predictors of same nature using increasing subsets of features. Can you match or improve performance with a smaller subset? If yes, try a non-linear predictor with that subset.
1. Do you have new ideas, time, computational resources, and enough examples? If yes, compare several feature selection methods, including your new idea, correlation coefficients, backward selection and embedded methods. Use linear and non-linear predictors. Select the best approach with model selection
1. Do you want a stable solution (to improve performance and/or understanding)? If yes, subsample your data and redo your analysis for several “bootstrap”.

## Lasso Regression
If you use lasso regression, you can set an `alpha` parameter that acts as a regularization parameter. As you increase the value, it gives less weight to features that are less important. Here we use the LassoLarsCV model to iterate over various values of alpha and track the feature coefficients.

In [68]:
from sklearn import linear_model

model = linear_model.LassoLarsCV(cv=10, max_n_alphas=10).fit(X_train, y_train)

fig, ax = plt.subplots(figsize=(12, 8))
cm = iter(plt.get_cmap("tab20")(np.linspace(0, 1, X.shape[1])))

for i in range(X.shape[1]):
    c = next(cm)
    ax.plot(
        model.alphas_,
        model.coef_path_.T[:, i],
        c=c,
        alpha=0.8,
        label=X.columns[i],
    )
    
ax.axvline(
    model.alpha_,
    linestyle="-",
    c="k",
    label="alphaCV",
)

plt.ylabel("Regression Coefficients")
ax.legend(X.columns, bbox_to_anchor=(1, 1))
plt.xlabel("alpha")
plt.title("Regression Coefficients Progression for Lasso Paths")
fig.savefig("images/feature_selection_regression_coefs.png", dpi=300)

NameError: name 'X_train' is not defined

## Recursive Feature Elimination
Recursive feature elimination will remove the weakest features, then fit a model. It does this by passing in a scikit-learn model with a `.coef_` or `.feature_importances_` attribute. We will use recursive feature elimination to find the 10 most important features.

In [None]:
from sklearn.feature_selection import RFE

model = ensemble.RandomForestClassifier(n_estimators=100)
rfe = RFE(model, 4)
rfe.fit(X, y)
agg_X.columns[rfe.support_]