# Automatic Feature Selection

Features play a key role in performance of any machine learning model. Non-important features can impact the model performance. We will learn here automatic feature selection techniques with scikit learn.

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Three benefits of performing feature selection before modeling your data are:

__Reduces Overfitting__: Less redundant data means less opportunity to make decisions based on noise.
__Improves Accuracy__: Less misleading data means modeling accuracy improves.
__Reduces Training Time__: Less data means that algorithms train faster.

## Data

We will use  Pima Indians onset of diabetes dataset for the example. We will implement 4 techniques here for Feature selection.

In [1]:
import pandas
import numpy

# load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

basedataframe = pandas.read_csv("diabetes.csv", names=names)
basedataframe.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# 1.  Removing features with low variance

__VarianceThreshold__ is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.

As an example, suppose that we have a dataset with boolean features, and we want to remove all features that are either one or zero (on or off) in more than 80% of the samples. Boolean features are Bernoulli random variables, and the variance of such variables is given by

For this we will create a new column in our dataset which has 90% of its values as 0 and 10% as 1. and will check that whether variance Threshold remove it with 80% threshold or not.

In [2]:
data = basedataframe
data['newcol'] = 0

length = len(data)

for i in range (length):
    if (i%10 == 0):
        data['newcol'][i] = 1

data.head(20)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class,newcol
0,6,148,72,35,0,33.6,0.627,50,1,1
1,1,85,66,29,0,26.6,0.351,31,0,0
2,8,183,64,0,0,23.3,0.672,32,1,0
3,1,89,66,23,94,28.1,0.167,21,0,0
4,0,137,40,35,168,43.1,2.288,33,1,0
5,5,116,74,0,0,25.6,0.201,30,0,0
6,3,78,50,32,88,31.0,0.248,26,1,0
7,10,115,0,0,0,35.3,0.134,29,0,0
8,2,197,70,45,543,30.5,0.158,53,1,0
9,8,125,96,0,0,0.0,0.232,54,1,0


In [3]:
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(data)


array([[  6. , 148. ,  72. , ...,  33.6,  50. ,   1. ],
       [  1. ,  85. ,  66. , ...,  26.6,  31. ,   0. ],
       [  8. , 183. ,  64. , ...,  23.3,  32. ,   1. ],
       ...,
       [  5. , 121. ,  72. , ...,  26.2,  30. ,   0. ],
       [  1. , 126. ,  60. , ...,  30.1,  47. ,   1. ],
       [  1. ,  93. ,  70. , ...,  30.4,  23. ,   0. ]])

Here you can see that based on Variance Threshold , last column is eleminated having low variance.

# 2. Univariate feature selection

Univariate feature selection works by selecting those features which shoes strongest relationship with the ouput variable
scikit learn have some routines which can help.

1. **SelectKBest** removes all but the  highest scoring features
2. **SelectPercentile** removes all but a user-specified highest scoring percentage of features
3. **GenericUnivariateSelect** allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.


We also need a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and selectPercentile). For regression and classification there are separate scoring functions.

1. **For regression**: f_regression, mutual_info_regression
2. **For classification**: chi2, f_classif, mutual_info_classif



In following example we will use selectKBest method with chi square scoring function.



In [4]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

data = basedataframe
datavalues = data.values
X = datavalues[:,0:8]
Y = datavalues[:,8]

print (X.shape)
print (Y.shape)


(768, 8)
(768,)


In [5]:
X_new = SelectKBest(chi2, k=4).fit_transform(X, Y)
print (X_new.shape)

(768, 4)


we can see that now we only have those two features which are most important. let us see the features selected

In [6]:
print(X_new[0:5,:])

[[148.    0.   33.6  50. ]
 [ 85.    0.   26.6  31. ]
 [183.    0.   23.3  32. ]
 [ 89.   94.   28.1  21. ]
 [137.  168.   43.1  33. ]]


its clear that plass, test, mass and age are selected.

# 3. Recursive Feature Elimination

Recursive Feature Elimination (or RFE) recursively remove attributes and build a model on remaining features. RFE use model accuracy and find those featuers or combination of features which contribute the most to predicting the dependent variable.

We can use any algorithm to use RFE for featue elimination. In following example we will use Logistic Regression and SVC both


In [7]:
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

data = basedataframe
datavalues = data.values
X = datavalues[:,0:8]
Y = datavalues[:,8]

In [8]:
#feature extraction using logistic regression

Logistic_model = LogisticRegression()
Logistic_rfe = RFE(Logistic_model, 4)
Logistic_fit = Logistic_rfe.fit(X, Y)

print("Num Features: ",Logistic_fit.n_features_)
print("Selected Features: ",Logistic_fit.support_)
print("Feature Ranking: ",Logistic_fit.ranking_)


Num Features:  4
Selected Features:  [ True  True False False False  True  True False]
Feature Ranking:  [1 1 2 4 5 1 1 3]


In [9]:
#feature extraction using svc 

svc = SVC(kernel="linear", C=1)
svc_rfe = RFE(svc, 4)
svc_fit = svc_rfe.fit(X, Y)

print("Num Features: ",svc_fit.n_features_)
print("Selected Features: ",svc_fit.support_)
print("Feature Ranking: ",svc_fit.ranking_)

Num Features:  4
Selected Features:  [ True  True False False False  True  True False]
Feature Ranking:  [1 1 2 4 5 1 1 3]


both model identified same features.

# 4. Principal Component Analysis

Principal component analysis (PCA) is a dimensionality reduction technique which use Singular Value Decomposition of the data to project it to a lower dimensional space. A property of PCA is that you can choose the number of dimensions or principal component in the transformed result.

In [11]:
from sklearn.decomposition import PCA

# load data
data = basedataframe
datavalues = data.values
X = datavalues[:,0:8]
Y = datavalues[:,8]

# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)

# summarize components
print("Explained Variance: %s", fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: %s [0.88854663 0.06159078 0.02579012]
[[-2.02176587e-03  9.78115765e-02  1.60930503e-02  6.07566861e-02
   9.93110844e-01  1.40108085e-02  5.37167919e-04 -3.56474430e-03]
 [-2.26488861e-02 -9.72210040e-01 -1.41909330e-01  5.78614699e-02
   9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01]
 [-2.24649003e-02  1.43428710e-01 -9.22467192e-01 -3.07013055e-01
   2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]


so we can see thar first 3 features are able to capture the 96% of variance in the data

# 5. Feature Importance and Select from Model

__SelectFromModel__ is a meta-transformer that can be used along with any estimator that has a **coef_** or **feature_importances_** attribute after fitting. The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter.

Tree-based estimators (see the __sklearn.tree module__ and forest of trees in the __sklearn.ensemble module__) can be used to compute feature importances, which in turn can be used to discard irrelevant features (when coupled with the 

In [17]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

# load data
data = basedataframe
datavalues = data.values
X = datavalues[:,0:8]
Y = datavalues[:,8]

model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[0.11669173 0.21368335 0.09742078 0.08665977 0.07444054 0.14084862
 0.12243711 0.1478181 ]


we can see there are 4 features which shows importance more than 10% else all other less than 10% . So model should pick those features . only 3 features have importance more than 13%. So if we pass threshold as 13% our model should only add 3 features

In [19]:
target = SelectFromModel(model, threshold = .13, prefit=True)
X_new = target.transform(X)
X_new.shape 

(768, 3)