# **Feature selection Info**

* In feature selection we use sklearn.feature_selection module to get good accuracy score and to avoid curse od dimensionality.

* Always we first do train test split and then we will apply the feature selection technique to avoid the overfitting.

* To know deeper about this read this : https://scikit-learn.org/stable/modules/feature_selection.html

## Diffrent types of feature selection


1) Removing features with low variance (VarianceThreshold)
2) Univariate feature selection:

            a) For regression: f_regression, mutual_info_regression
            b) For classification: chi2, f_classif, mutual_info_classif


3) Correlation Matrix (Using Pearson correlation)
4) Recursive feature elimination
5) Feature selection using SelectFromModel:

            a) L1-based feature selection:
                    a.1) Regression : Lasso
                    a.2) Classification : LogisticRegression and Linear SVM
                    
            b) Tree-based feature selection
            
6) Sequential Feature Selection
7) Feature selection as part of a pipeline


## 1) Removing features with low variance (VarianceThreshold)

* It removes all features whose variance doesn’t meet some threshold

## 2) Univariate feature selection:

* Univariate feature selection works by selecting the best features based on univariate statistical tests.

* Here we extarct the top features manually or by using SelectKBest, SelectPercentile modules

* For regression we use -->  f_regression, mutual_info_regression models

* For classification we use --> chi2, f_classif, mutual_info_classif models

## 3) Correlation Matrix (Using Pearson correlation)

* Correlation states how the features are related to each other or the target variable.

* Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable)

* Heatmap makes it easy to identify which features are most related to the target variable, we will plot heatmap of correlated features using the seaborn library.

## 4) Recursive feature elimination

* The goal of RFE is to select features by recursively considering smaller and smaller sets of features.

* First, the estimator is trained on the initial set of features and the importance of each feature is obtained through coef_ or feature_importances_

* Then, the least important features are pruned from current set of features

## 5) Feature selection using SelectFromModel:

* Generally it is applied inside the ml model 

* There are 2 types in this.

### a) L1-based feature selection:

* Linear models penalized with the L1 norm have sparse solution.

* Here we can SelectFromModel library to select the non-zero coefficients.

* For Regression we use Lasso model 

* For classification we use LogisticRegression and Linear SVM

### b) Tree-based feature selection

* Tree-based estimators (see the sklearn.tree module and forest of trees in the sklearn.ensemble module) can be used to compute impurity-based feature importances

* In turn can be used to discard irrelevant features (when coupled with the SelectFromModel meta-transformer):

## 6) Sequential Feature Selection

* Sequential Feature Selection [sfs] (SFS) is available in the SequentialFeatureSelector transformer. SFS can be either forward or backward:

* Use this link to get more info on this: https://scikit-learn.org/stable/modules/feature_selection.html

* us this link to get more more info: https://scikitlearn.org/stable/auto_examples/feature_selection/plot_select_from_model_diabetes.html#sphx-glr-auto-examples-featureselection-plot-select-from-model-diabetes-py

## 7) Feature selection as part of a pipeline

* Feature selection is usually used as a pre-processing step before doing the actual learning. The recommended way to do this in scikit-learn is to use a Pipeline.