# Feature Selection


* ` skelearn.feaeture_selection ` module can be used for feature selection / dimensionality reduction.

* This helps to imporve the accuracy score or performance while dealing with large dimensional data.

https://scikit-learn.org/stable/modules/feature_selection.html

## Removing features with low variance

* Variance thresholding method can be used to remove features having low variance.

* Set a particular variance threshold for a given attribute.

* ` VarianceThreshold ` will remove the column having variance less than the given threshold.

* By default ` VarianceThreshold ` removes the columns having zero variance.



In [0]:
from sklearn.feature_selection import VarianceThreshold

In [2]:
X =  [[0,0,1], [0,1,0], [1,0,0], [0,1,1], [0,1,0], [0,1,1]]
sel = VarianceThreshold(threshold = 0.16)
sel.fit_transform(X)

array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

## Univariate Feature Selection

* Univariate Feature Selection works by considering statistical tests.

* It is prepreocessing step before estimator

* Use the ` SelectBest ` and apply ` fit_transform `

* ` Select_best ` removes all the ` k ` highest scoring features

* ` SelectPercentile ` removes all but a user-specified highest scoring percentage of feature.

* Using common univariate statistical tests for each feature: false positive rate ` SelectFpr `, false discovery rate ` SelectFdr `, or famaily wise error ` SelectFwe `

* Let us perform $ \chi^2 $ test to the samples to retrieve only the two best features

In [0]:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [0]:
iris = load_iris()
X, y = iris.data, iris.target

In [6]:
print(X.shape)

(150, 4)


In [0]:
SB = SelectKBest(chi2, k=2)
X_new = SB.fit_transform(X,y)

In [8]:
print(X_new.shape)

(150, 2)


* These objects take input a scoring function and return univariate scores or p-values

* Some guidelines: -

* For regresiion: - `f_regrssion` , ` mutual_info_regression `
* For classification: - ` chi2 ` , ` f_classif `, ` mutual_info_classif ` 

* The methods based on F-test estimate the degree of linear dependency between two random varaibles.

* Mututal information methods can capture any kind of statistical dependency, but they are non parametric and require more samples for accurate estimation

## Recursive Feature Elimination

* Given an external estimator that assigns weights to features, recursive feature elimination is to select features by recursively considering smaller and smaller sets of features.

* First, the estimator is trained on the inital set of features and the importance of the features is obtained using the ` coef_ ` method or through the ` feature_importances_ ` attribute.

* Then, the least important features are pruned from current set of features.

* This procedure is repeated on the pruned set unitil the desired number of features to be selected are eventually reached.

### E.g. Recursive feature elimination

In [0]:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR

In [0]:
X,y = make_friedman1(n_samples = 50, n_features=10, random_state = 0)
estimator = SVR(kernel = 'linear')

In [24]:
print(X.shape)

(50, 10)


In [0]:
# The classifier must support the coef_ or feature_importances_ attributes
# Estimator denotes the estimator which we are using
# n_feaures denotes the maximum number of features that we are want to choose
# step denotes the amount of features to be removed at end of every iteration
selector = RFE(estimator, n_features_to_select= 5, step=1)

In [0]:
selector = selector.fit(X,y)

In [27]:
# Use selector.support_ do display the mask of features, that is which features were selected
print(selector.support_) 

[ True  True  True  True  True False False False False False]


In [28]:
# Use selectior.ranking_ to correspond to the ranking of the ith position of the feature
# Best features are ranked as 1
print(selector.ranking_)

[1 1 1 1 1 6 4 3 2 5]


### E.g. Recursive feature elimination using cross-validation

Feature ranking using cross-validation selection of best number of features

In [0]:
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR

In [0]:
X, y = make_friedman1(n_samples = 50, n_features = 10, random_state=0)
estimator = SVR(kernel = 'linear')
# cv denotes number of times we do cross_validation
selector = RFECV(estimator, min_features_to_select=5, cv = 5)
selector = selector.fit(X,y)

In [31]:
selector.support_

array([ True,  True,  True,  True,  True, False, False, False, False,
       False])

In [32]:
selector.ranking_

array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

## Feature selection using SelectFromModel

* ` SelectFromModel ` is a meta-transformer that helps can be used with any estimator having ` coef_ ` or ` features_importance_ ` attribute after fitting.

* The features are considered unimportant are removed, if the corresponding `coef_` or ` features_importance_ ` values are below the providied ` threshold `  parameter.

* Apart from specifying the threshold numerically, there are built-in hueristics for finding for finding a threshold using a string argument such as "mean", "mode" or "median".


### L1-based feature Selection

* Linear models penalized with L1 norm have sparse solutions.
* When the goal is to reduce the dimensionality of the data to use with another classifier then they can be used along with the ` feature_selection.SelectFromModel ` to select the non-zero coefficients.
* In particular, sparse estimators useful for this purpose are the 1 `  linear_model.Lasso ` for regression, and of ` linear_model.LogisticRegression `and ` svm.LinearSVC ` for classification

In [0]:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

In [0]:
iris = load_iris()

In [0]:
X,y = iris.data, iris.target

In [38]:
print(X.shape)

(150, 4)


In [0]:
lsvc = LinearSVC(C = 0.01, penalty = "l1", dual = False, max_iter = 2500)
lsvc = lsvc.fit(X,y)

In [0]:
# Estimator contains the name of estimator we are trying to fit
# Whether a prefit model is expected to be passed into the constructor directly or not.
# If True, transform must be called directly and SelectFromModel cannot be used with cross_val_score, 
# GridSearchCV and similar utilities that clone the estimator. 
# Otherwise train the model using fit and then transform to do feature selection.
model = SelectFromModel(lsvc, prefit = True)

In [44]:
X_new = model.transform(X)
print(X_new.shape)

(150, 3)


* With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected. 

* With Lasso, the higher the alpha parameter, the fewer features selected.

### Tree-based feature Selection

* Tree-based estimator can be used to compute the feature importances which in turn can be used to dicared the irrelevant features

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel

In [0]:
iris = load_iris()

In [0]:
X, y = iris.data, iris.target

In [49]:
print(X.shape)

(150, 4)


In [0]:
clf = RandomForestClassifier(n_estimators=100, n_jobs = -1, random_state=0)
clf = clf.fit(X,y)

In [53]:
clf.feature_importances_

array([0.09090795, 0.02453104, 0.46044474, 0.42411627])

In [0]:
model = SelectFromModel(clf, threshold = 0.3, prefit = True)
X_new = model.transform(X)

In [57]:
print(X_new.shape)

(150, 2)


## Feature Selection as Part of ML Pipeline

* Feature selection is usually used as a prepreocessing step before doing actual learning.

* Recommended way to do this is use ` sklearn.pipeline.Pipeline `

In [0]:
from sklearn.pipeline import Pipeline

In [0]:
clf = Pipeline([
    ('feature_selection', SelectFromModel(LinearSVC(max_iter = 8000))),
    ('classification', RandomForestClassifier(n_estimators = 100))   
])

In [0]:
clf = clf.fit(X,y)

* In this snippet we make use of `sklearn.svm.LinearSVC` with ` SelectfromModel `.
* ` SelectfromModel ` selects the important feature and passes it to  `RandomForestClassifier`.
* `RandomForestClassifer` trains only on the relevant input given by the pipeline