### Objective : To Study the Feature Selection techniques

Dataset used : Pima Indians onset of diabetes dataset <br>
Problem : This is a Binary classification problem, where all the variables are numeric<br>

To learn about 4 different automatic feature selection techniques:

- Univariate Selection.
- Recursive Feature Elimination.
- Principle Component Analysis.
- Feature Importance.

In [1]:
import pandas as pd
import numpy as np

In [6]:
df = pd.read_csv('./Data/Feature_selection_Prima_indiana.csv',sep=",",header=None)

In [7]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### 1. Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.

In [8]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [9]:
df.columns = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

In [12]:
X = df.loc[:,'preg':'age']
Y = df.loc[:,'class']

In [13]:
# featuer extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

In [14]:
# summarize scores - see the fit scores for each feature
np.set_printoptions(precision=3)
print(fit.scores_)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


In [16]:
# Selecting the features
selected_featuers =fit.transform(X)
selected_featuers[0:5,:]

array([[148. ,   0. ,  33.6,  50. ],
       [ 85. ,   0. ,  26.6,  31. ],
       [183. ,   0. ,  23.3,  32. ],
       [ 89. ,  94. ,  28.1,  21. ],
       [137. , 168. ,  43.1,  33. ]])

### 2. Recursive Feature Elimination
The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.<br>

It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.

In [17]:
# Example for RFE - with Logistic REgresssion to select top 3 features
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [26]:
#  feature Extraction
model = LogisticRegression()
rfe = RFE(model,n_features_to_select=3, step=5,verbose=3)

fit = rfe.fit(X,Y)

Fitting estimator with 8 features.


In [27]:
print("Num Features : %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features : 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 2 2 2 1 1 2]


In [30]:
# Selecting the features
selected_featuers =fit.transform(X)
selected_featuers[0:5,:]

array([[ 6.   , 33.6  ,  0.627],
       [ 1.   , 26.6  ,  0.351],
       [ 8.   , 23.3  ,  0.672],
       [ 1.   , 28.1  ,  0.167],
       [ 0.   , 43.1  ,  2.288]])

### 3. Principal Component Analysis
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form.<br>

Generally this is called a data reduction technique

In [31]:
# Example below show PCA to select 3 principal components
from sklearn.decomposition import PCA

In [32]:
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)

In [36]:
# Summarize components
print("Explained Variance : %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance : [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


In [34]:
# Selecting the features
selected_featuers =fit.transform(X)
selected_featuers[0:5,:]

array([[-7.571e+01, -3.595e+01, -7.261e+00],
       [-8.236e+01,  2.891e+01, -5.497e+00],
       [-7.463e+01, -6.791e+01,  1.946e+01],
       [ 1.108e+01,  3.490e+01, -5.302e-02],
       [ 8.974e+01, -2.747e+00,  2.521e+01]])

### 4. Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be use to estimate the importance of features.


In [37]:
# ExtraTreesClassifier 
from sklearn.ensemble import ExtraTreesClassifier

In [38]:
# feature extraction
model = ExtraTreesClassifier()
model.fit(X,Y)
print(model.feature_importances_)

[0.109 0.235 0.095 0.077 0.083 0.127 0.127 0.147]
