# Feature Selection For Machine Learning

1. Univariate Selection.
2. Recursive Feature Elimination.
3. Principle Component Analysis.
4. Feature Importance.


Three benefits of performing feature selection before modeling your data are:
    - Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
    - Improves Accuracy: Less misleading data means modeling accuracy improves.
    - Reduces Training Time: Less data means that algorithms train faster.

In [401]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

## Read file using pandas

In [402]:
filename = 'data/05/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pd.read_csv(filename, names=names)
dataframe

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## Univariate Selection


The scikit-learn library provides the [SelectKBest](https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) class that can be used with a suite of different statistical tests to select a specific number of features.

In [403]:
array = dataframe.values
X = np.array(array[:,0:8], dtype=np.float32)
Y = np.array(array[:,8], dtype=np.float32)

# feature extraction
test = SelectKBest(score_func=chi2, k=2)

In [404]:
X_new = test.fit(X, Y)

# summarize scores
np.set_printoptions(precision=3)
print(X_new.scores_)

[ 111.52  1411.887   17.605   53.108 2175.565  127.669    5.393  181.304]


In [405]:
features = X_new.transform(X)
print(features.shape)

(768, 2)


In [406]:
# summarize selected features
print(features[0:5,:])

[[148.   0.]
 [ 85.   0.]
 [183.   0.]
 [ 89.  94.]
 [137. 168.]]


## Recursive Feature Elimination

In [399]:
# feature extraction
model = LogisticRegression(solver='lbfgs', max_iter=200)

rfe = RFE(model , n_features_to_select=2)
fit = rfe.fit(X, Y)

In [400]:
print("Num Features:", fit.n_features_)
print("Selected Features", fit.support_)
print("Feature Ranking:", fit.ranking_)

Num Features: 2
Selected Features [ True False False False False False  True False]
Feature Ranking: [1 3 5 7 6 2 1 4]


### Principal Component Analysis

In [393]:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)

print(pca.explained_variance_ratio_)

print(pca.singular_values_)

[0.992 0.008]
[6.301 0.55 ]


In [392]:
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)

print(pca.explained_variance_ratio_)

print(pca.singular_values_)

[0.992 0.008]
[6.301 0.55 ]


In [386]:
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)

# summarize components
print("Explained Variance:", fit.explained_variance_ratio_ )
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]
[[-2.022e-03  9.781e-02  1.609e-02  6.076e-02  9.931e-01  1.401e-02
   5.372e-04 -3.565e-03]
 [-2.265e-02 -9.722e-01 -1.419e-01  5.786e-02  9.463e-02 -4.697e-02
  -8.168e-04 -1.402e-01]
 [-2.246e-02  1.434e-01 -9.225e-01 -3.070e-01  2.098e-02 -1.324e-01
  -6.400e-04 -1.255e-01]]


##  Feature Importance

In [385]:
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

[0.11  0.233 0.099 0.082 0.076 0.136 0.12  0.144]
