The motivation of this notebook is that the data features we include in our ML models have a huge influence on performance. Thus, it is imperative that we exclude irrelevant features!

In this notebook, we will cover several methods that can be used!

Feature selection should occur before modelling!

Good feature selection will:

- Reduce Overfitting
- improve accuracy
- reduce training time (for very large datasets)

Let's use the Pima diabetes dataset for our workthrough.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier
np.set_printoptions(precision=3)

In [2]:
diabetes = pd.read_csv("pima-indians-diabetes.data.csv", header=None)

In [3]:
cols = ["Number of times pregnant.",
"Plasma glucose concentration a 2 hours in an oral glucose tolerance test.",
"Diastolic blood pressure (mm Hg).",
"Triceps skinfold thickness (mm).",
"2-Hour serum insulin (mu U/ml).",
"Body mass index (weight in kg/(height in m)^2).",
"Diabetes pedigree function.",
"Age (years).",
"Class variable (0 or 1)."] #obtained from UCI website. Simple copy paste.

reference = {}
counter = 0
for i in cols:
    if i not in reference.keys():
        reference[i] = counter
        counter +=1
final_reference = {v:k for k, v in reference.items()} #this is so cute - i was very proud of this dict comp.

In [4]:
final_reference

{0: 'Number of times pregnant.',
 1: 'Plasma glucose concentration a 2 hours in an oral glucose tolerance test.',
 2: 'Diastolic blood pressure (mm Hg).',
 3: 'Triceps skinfold thickness (mm).',
 4: '2-Hour serum insulin (mu U/ml).',
 5: 'Body mass index (weight in kg/(height in m)^2).',
 6: 'Diabetes pedigree function.',
 7: 'Age (years).',
 8: 'Class variable (0 or 1).'}

In [5]:
diabetes.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# **Univariate Selection via Stats**

The SKlearn library provides us with a host of tools for this.

Let's start with the statistical test of Chi-Squared which will tell use about those features that have the strongest relationship with our output (label) variable.

We will use the SelectKBest class that can be used with different **statistical tests** to select a specified number of features.

In [6]:
X = diabetes.iloc[:,0:8].values
y = diabetes.iloc[:, 8].values

Now that we have out features and output variable selected. Say we want to know what the best 4 (k=4) features are absed on the Chi-squared test.

In [7]:
selection = SelectKBest(score_func=chi2, k=4)
fit = selection.fit(X, y)

In [8]:
feature_scores = []
for i, val in enumerate(fit.scores_):
    feature_scores.append((i, val))

In [9]:
best_features = sorted(feature_scores, key=lambda x: x[1], reverse=True)[:4]

we have to select the top 4 as our fit object calculates the scores for all variables! However, when you use the fit object to transform the X variable directly, it will only return the top 4 faeature columns (see below).

Note that "fit.scores_" presents scores in order!

In [10]:
best_features

[(4, 2175.5652729220137),
 (1, 1411.8870406441411),
 (7, 181.30368904430023),
 (5, 127.669343331037)]

The above lines of colde give us the highest scores for each attribute and returns those 4 attributes with the highest scores.

We do this step so that we are aware of which columns are the most pertinent.

We can also see these 4 columns automatically by transforming the 4 variables directly doing the following:

Notice how there are only 4 columns - we specified this earlier in our code

This is great, but it isnt as clear in discerning which columns are the most important as you would have to compare this matrix to our original diabetes set

i.e. first row in the array below corresponds to first row in our dataframe, with only the 4 important variables.

At this point, we can use our final_features variable moving forward to split our dataset into training and testing!

We might want to know the names of these columns. We can do this below!

In [11]:
for i in best_features:
    index, score = i
    print(final_reference[index])
    print()

2-Hour serum insulin (mu U/ml).

Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

Age (years).

Body mass index (weight in kg/(height in m)^2).



The above lists the feature names in decending order of importance.

In [12]:
final_features = fit.transform(X) #see how only our 4 columns if interest are present.
print(final_features[:5]) #corresponds to diabetes.head() for our 4 variables of interest, see below df.

[[ 148.     0.    33.6   50. ]
 [  85.     0.    26.6   31. ]
 [ 183.     0.    23.3   32. ]
 [  89.    94.    28.1   21. ]
 [ 137.   168.    43.1   33. ]]


In [13]:
diabetes.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


At this point, we can use our final_features variable moving forward to split our dataset into training and testing!

We might want to know the names of these columns. We can do this below!

Awesome - so that was how to use a Chi-Square metric for feature selection. Let's move to soemthing called Recursive Feature Selection!

# **RECURSIVE FEATURE SELECTION**

RFE recursively removes attributes and builds a model on those attributes that remain. Very very cool!

The primary metric used for it's decision to eject a feature is model accuracy. It considers indivudal attributes and groups of attributes together - so it covers all bases! Cool!

Let's use RFE in a logisitic regression algo where we only want the top 4 features! (For consistency) Yes, you have to specify how many max feature you want in your algo.

In [14]:
lgr = LogisticRegression()
rfe = RFE(lgr, 4)
fit = rfe.fit(X, y)
print("Num Features: %d" % fit.n_features_)
print()
print("Selected Features: %s" % fit.support_)
print()
print("Feature Ranking: %s" % fit.ranking_)
print()

Num Features: 4

Selected Features: [ True  True False False False  True  True False]

Feature Ranking: [1 1 2 4 5 1 1 3]



From the above we can see that it selected the best 3 features and gae them a ranking. In the selected features array, those columns with a 'True' correspond to a feature that was selected. In the Feature array, those columns with a 1 correspond to a feature that was selected.

Let's check what those columns are!

In [15]:
for i, val in enumerate(fit.ranking_):
    if val == 1:
        print(final_reference[i])
        print()

Number of times pregnant.

Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

Body mass index (weight in kg/(height in m)^2).

Diabetes pedigree function.



Here we can see that the logistic regression is using the above 3 columns as the most important!

Awesome!

In [16]:
v = fit.transform(X) #we can then use these features to split out dataset into training and test!

In [17]:
v[:5]

array([[  6.000e+00,   1.480e+02,   3.360e+01,   6.270e-01],
       [  1.000e+00,   8.500e+01,   2.660e+01,   3.510e-01],
       [  8.000e+00,   1.830e+02,   2.330e+01,   6.720e-01],
       [  1.000e+00,   8.900e+01,   2.810e+01,   1.670e-01],
       [  0.000e+00,   1.370e+02,   4.310e+01,   2.288e+00]])

In [18]:
v.shape

(768, 4)

In [19]:
diabetes.shape

(768, 9)

An important thing to note at this point is that whenever we did fit.transform (X), we still obtained results that were identical to our original dataset, however, with unselected columns removed. This is great! However, this will NOT be the case with PCA - as such, we will then have to go back to our original dataset and make the appropriate feature celections based on the results of our PCA.

Also note that the Chi-square and RFE methods only produces 2 features in common! It is interesting to observe how different models will lead to different feature selection. It is nice to have options as when you later run your models, you can see which has a higher accuracy rate/cross validation report.

Let's now move on to PCA - a form of dimensionality reduction! It is usually used when we have a very large number of features and we wish to see which features can be ommitted. Again, we have to choose the number of dimensions that we want a reduction to.

# **PCA**

Here we will use PCA to again select 4 features for consistency!

In [20]:
pca = PCA(n_components=4)
fit = pca.fit(X)

In [21]:
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print()
print(fit.components_)

Explained Variance: [ 0.889  0.062  0.026  0.013]

[[ -2.022e-03   9.781e-02   1.609e-02   6.076e-02   9.931e-01   1.401e-02
    5.372e-04  -3.565e-03]
 [ -2.265e-02  -9.722e-01  -1.419e-01   5.786e-02   9.463e-02  -4.697e-02
   -8.168e-04  -1.402e-01]
 [ -2.246e-02   1.434e-01  -9.225e-01  -3.070e-01   2.098e-02  -1.324e-01
   -6.400e-04  -1.255e-01]
 [ -4.905e-02   1.198e-01  -2.627e-01   8.844e-01  -6.555e-02   1.928e-01
    2.699e-03  -3.010e-01]]


In [22]:
fit.components_.shape

(4, 8)

In [23]:
fit.transform(X).shape

(768, 4)

In [24]:
fit.transform(X) # we can use thise data to split and test/train our data.

array([[-75.715, -35.951,  -7.261,  15.669],
       [-82.358,  28.908,  -5.497,   9.005],
       [-74.631, -67.906,  19.462,  -5.653],
       ..., 
       [ 32.113,   3.377,  -1.588,  -0.878],
       [-80.214, -14.186,  12.351, -14.294],
       [-81.308,  21.621,  -8.153,  13.822]])

This concludes PCA! Note that original column names cannot be retrieved as PCA is a weighted linear combination of all columns!

# **Extra Tree Classifier**

In [32]:
etc = ExtraTreesClassifier()
etc.fit(X, y)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [33]:
etc.feature_importances_

array([ 0.108,  0.261,  0.092,  0.078,  0.068,  0.145,  0.123,  0.126])

Each column corresponds to the column in the original data set. The higher the score, the more important.

In [34]:
feature_scores = []
for i, val in enumerate(etc.feature_importances_):
    feature_scores.append((i, val))

In [35]:
best_features = sorted(feature_scores, key=lambda x: x[1], reverse=True)[:4]

In [36]:
best_features

[(1, 0.26064746831779861),
 (5, 0.14529529703003063),
 (7, 0.12620896153185038),
 (6, 0.12285697106678732)]

As we can see above, the ETC has ranked out features. We can look at the top 4 below:

In [37]:
for i in best_features:
    index, score = i
    print(final_reference[index])
    print()

Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

Body mass index (weight in kg/(height in m)^2).

Age (years).

Diabetes pedigree function.



This concludes out feature selection process!