# Trial Book : Forward Feature Selection

This is a wrapper method. Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

In [None]:
def forward_feature_selection(x_train, x_cv, y_train, y_cv, n):
    # start with an empty feature set
    feature_set = []
    # for the number of features that you want to select (n - based on model performance)
    for num_features in range(n):
        # entries look like [metric(), feature]
        metric_list = []
        # choose a model
        model = SGDClassifier()
        # for all of the features available in the org dataset
        for feature in x_train.columns:
            # if the feature hasn't already been added to the selected feature set
            if feature not in feature_set:
                # make a copy of the selected feature set
                f_set = feature_set.copy()
                # add the new feature to it
                f_set.append(feature)
                # fit the model using the selected feature set + the new feature
                model.fit(x_train[f_set], y_train)
                # evaluate the model using the choosen metric and the test data
                # add a tuple containing the result and the feature to the metric list
                metric_list.append((evaluate_metric(model, x_cv[f_set], y_cv), feature))
        # sort the metric list
        metric_list.sort(key=lambda x : x[0], reverse = True) 
        # add the feature with the best metric to the selected feature set
        feature_set.append(metric_list[0][1])
    return feature_set

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, jaccard_similarity_score

%matplotlib inline

Load data and column names. Then separate the cols containing features and class col.

In [2]:
iris = datasets.load_iris()
feat_labels = ['Sepal Length','Sepal Width','Petal Length','Petal Width']
X = iris.data
y = iris.target

print('First 5 rows of dataset:')
print(X[0:5])
print('\nClass col from dataset:')
print(y)

First 5 rows of dataset:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

Class col from dataset:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


We now split the data into a training set and a test set (80/20).

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

Now we basically do the steps from the commented code above. We start by creating an array to push the features that we select to and choose how many features to select. After that we start selecting the features by creating a model, in this case an SGD Classifier. We then go through all of the features that haven't already been selected and add them individually to the selected feature set. We then fit the model with this extended feature set and evaluate it's performance. The extended feature set with the best performance becomes the new selected feature set.

In [20]:
feature_set = []
n = 4
for num_features in range(n):
    # entries look like [metric(), feature]
    metric_list = [] 
    # using the SGD Classifier, like in the example
    model = SGDClassifier(random_state=1000) 
    # for all of the features available in the iris dataset (petal width, petal length, sepal width, sepal length)
    for feature in range(len(feat_labels)):
        if feature not in feature_set:
            f_set = feature_set.copy()
            f_set.append(feature)

            X_train_tmp = X_train[:, f_set] #[:, [1, 9]]
            model.fit(X_train_tmp, y_train)

            X_test_tmp = X_test[:, f_set]
            metric_list.append((f1_score(y_test, model.predict(X_test_tmp), average='micro'), feature))
            print(metric_list)
    # sort the list
    metric_list.sort(key=lambda x : x[0], reverse = True)
    print('metric_list')
    print(metric_list)
    # add the feature with the best metric to the selected feature set
    feature_set.append(metric_list[0][1])
    print('feature_set')
    print(feature_set)

[(0.3333333333333333, 0)]
[(0.3333333333333333, 0), (0.3333333333333333, 1)]
[(0.3333333333333333, 0), (0.3333333333333333, 1), (0.6666666666666666, 2)]
[(0.3333333333333333, 0), (0.3333333333333333, 1), (0.6666666666666666, 2), (0.6666666666666666, 3)]
metric_list
[(0.6666666666666666, 2), (0.6666666666666666, 3), (0.3333333333333333, 0), (0.3333333333333333, 1)]
feature_set
[2]
[(0.5666666666666667, 0)]
[(0.5666666666666667, 0), (0.6666666666666666, 1)]
[(0.5666666666666667, 0), (0.6666666666666666, 1), (0.7666666666666667, 3)]
metric_list
[(0.7666666666666667, 3), (0.6666666666666666, 1), (0.5666666666666667, 0)]
feature_set
[2, 3]
[(0.6333333333333333, 0)]
[(0.6333333333333333, 0), (0.36666666666666664, 1)]
metric_list
[(0.6333333333333333, 0), (0.36666666666666664, 1)]
feature_set
[2, 3, 0]
[(0.6666666666666666, 1)]
metric_list
[(0.6666666666666666, 1)]
feature_set
[2, 3, 0, 1]




This was giving inconsistent results. It seems to have had something to do with the SGDClassifier. Because random_state defaults to none, the model must have been slightly different each time. I set it to 1000 and it produces the correct/desired result, but there are values for which it produces different results.