<a href="https://colab.research.google.com/github/marcelounb/ML-Mastery-with-Python-Course/blob/master/chap8_Feature_Selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from pandas import read_csv 
from numpy import set_printoptions 
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE 
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier

In [0]:
# load data 
filename = '/content/diabetes_moddd.csv' 
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names=names) 
array = dataframe.values 
X = array[:,0:8] 
Y = array[:,8]

In [4]:
X

array([[  6.   , 148.   ,  72.   , ...,  33.6  ,   0.627,  50.   ],
       [  1.   ,  85.   ,  66.   , ...,  26.6  ,   0.351,  31.   ],
       [  8.   , 183.   ,  64.   , ...,  23.3  ,   0.672,  32.   ],
       ...,
       [  5.   , 121.   ,  72.   , ...,  26.2  ,   0.245,  30.   ],
       [  1.   , 126.   ,  60.   , ...,  30.1  ,   0.349,  47.   ],
       [  1.   ,  93.   ,  70.   , ...,  30.4  ,   0.315,  23.   ]])

# Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn library provides the SelectKBest class2 that can be used with a suite of diﬀerent statistical tests to select a speciﬁc number of features. The example below uses the chi-squared (chi2) statistical test for non-negative features to select 4 of the best features from the Pima Indians onset of diabetes dataset.


In [5]:
# feature extraction 
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y) 
# summarize scores 
set_printoptions(precision=3) 
fit.scores_

array([ 111.52 , 1411.887,   17.605,   53.108, 2175.565,  127.669,
          5.393,  181.304])

In [6]:
features = fit.transform(X)
# summarize selected features 
features[0:5,:]

array([[148. ,   0. ,  33.6,  50. ],
       [ 85. ,   0. ,  26.6,  31. ],
       [183. ,   0. ,  23.3,  32. ],
       [ 89. ,  94. ,  28.1,  21. ],
       [137. , 168. ,  43.1,  33. ]])

# Recursive Feature Elimination

 (RFE) works by recursively removing attributes and building a model on those attributes that remain.  It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute. 


In [10]:
# feature extraction 
model = LogisticRegression() 
rfe = RFE(model, 3) 
fit = rfe.fit(X, Y) 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [11]:
fit.n_features_

3

In [12]:
fit.support_
# ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#  RFE chose the top 3 features as preg, mass and pedi. These are marked True in the support array 
# and marked with a choice 1 in the ranking array. 

array([ True, False, False, False, False,  True,  True, False])

In [13]:
fit.ranking_

array([1, 2, 4, 5, 6, 1, 1, 3])

# Principal Component Analysis

 uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique. A property of PCA is that you can choose the number of dimensions or principal components in the transformed result. In the example below, we use PCA and select 3 principal components

In [16]:
# feature extraction 
pca = PCA(n_components=3) 
fit = pca.fit(X) 
# summarize components 
fit.explained_variance_ratio_

array([0.889, 0.062, 0.026])

In [17]:
fit.components_

array([[-2.022e-03,  9.781e-02,  1.609e-02,  6.076e-02,  9.931e-01,
         1.401e-02,  5.372e-04, -3.565e-03],
       [-2.265e-02, -9.722e-01, -1.419e-01,  5.786e-02,  9.463e-02,
        -4.697e-02, -8.168e-04, -1.402e-01],
       [-2.246e-02,  1.434e-01, -9.225e-01, -3.070e-01,  2.098e-02,
        -1.324e-01, -6.400e-04, -1.255e-01]])

# Feature Importance

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features

In [21]:
model = ExtraTreesClassifier() 
model.fit(X, Y) 
model.feature_importances_

array([0.114, 0.234, 0.097, 0.083, 0.072, 0.14 , 0.117, 0.143])