# Trial Book : Forward Feature Selection

In [1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

%matplotlib inline

Load data and column names. Then separate the cols containing features and class col.

In [10]:
iris = datasets.load_iris()
feat_labels = ['Sepal Length','Sepal Width','Petal Length','Petal Width']

X = iris.data
y = iris.target

print('First 5 rows of dataset:')
print(X[0:5])
print('\nClass col from dataset:')
print(y)

First 5 rows of dataset:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

Class col from dataset:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


We now split the data into a training set and a test set (60/40).

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

Next we init and train the Random Tree classifier, then we use the classifier to determine the importance of the features.

In [12]:
# n_estimators is the number of trees in the forest, gini is default
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(X_train, y_train)

# importance is the gini importance
for feature in zip(feat_labels, clf.feature_importances_):
    print(feature)

('Sepal Length', 0.11024282328064565)
('Sepal Width', 0.016255033655398394)
('Petal Length', 0.45028123999239533)
('Petal Width', 0.42322090307156124)


This result indicates that the ranking of feature importance is: Petal Length, Petal Width, Sepal Length, Sepal Width. Petal Length and Petal Width are similarly important and far more important than Sepal Length and Sepal Width.

Now we create a selector and train it to select the features above a certain threshold (0.15). 

In [15]:
# pass the classifier and the threshold to the selector
sfm = SelectFromModel(clf, threshold=0.15)

sfm.fit(X_train, y_train)

# get_support returns an array of the selected features indicies 
# [2, 3] means cols 2 and 3, the petal cols
for feature_list_index in sfm.get_support(indices=True):
    print(feat_labels[feature_list_index])

Petal Length
Petal Width


Now that we know which features are important, we can create new datasets from the original dataset(s). We use the selector to create training and test sets, which only include the features that we identified as important.

In [18]:
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)

Using the new training set, we init a new random forest classifier and train/fit it using only the important features.

In [19]:
clf_important = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf_important.fit(X_important_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=-1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

# Results

In [20]:
# RT with orginal/full dataset
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.9333333333333333


In [21]:
# RT with only important features
y_important_pred = clf_important.predict(X_important_test)
print(accuracy_score(y_test, y_important_pred))

0.8833333333333333


The original dataset does yield a more accurate result, but the second classifier, which uses only half of the features, is also satisfactorily accurate. Because of constraints, like time and space, it might be preferable to use the second method.