## Ranking and selecting features

In this example, we'll exemplify some of scikit-learn's ranking functions used to score the importance of features. We'll reuse the running example, the Adult dataset that we used in the first exercise.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import mutual_info_classif
from sklearn.pipeline import make_pipeline

train_data = pd.read_csv('adult_train.csv')
n_cols = len(train_data.columns)
Xtrain_dicts = train_data.iloc[:, :n_cols-1].to_dict('records')
Ytrain = train_data.iloc[:, n_cols-1]

test_data = pd.read_csv('adult_test.csv')
Xtest_dicts = test_data.iloc[:, :n_cols-1].to_dict('records')
Ytest = test_data.iloc[:, n_cols-1]

dv = DictVectorizer()
dv.fit(Xtrain_dicts)

X_vec = dv.transform(Xtrain_dicts)

dv.get_feature_names_out()

#feature_scores = mutual_info_classif(X_vec, Ytrain)

#for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
#    print(fname, score)
    
#from sklearn.feature_selection import SelectKBest, SelectPercentile
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.metrics import accuracy_score

#pipeline = make_pipeline(
#        DictVectorizer(),
#        SelectKBest(mutual_info_classif, k=100), # or SelectPercentile(...)
#        DecisionTreeClassifier()
#)
#pipeline.fit(Xtrain_dicts, Ytrain)
#accuracy_score(Ytest, pipeline.predict(Xtest_dicts))

In [None]:
import pandas as pd

train_data = pd.read_csv('adult_train.csv')

n_cols = len(train_data.columns)
Xtrain_dicts = train_data.iloc[:, :n_cols-1].to_dict('records')
Ytrain = train_data.iloc[:, n_cols-1]

test_data = pd.read_csv('adult_test.csv')
Xtest_dicts = test_data.iloc[:, :n_cols-1].to_dict('records')
Ytest = test_data.iloc[:, n_cols-1]

As you might recall, the instances in this dataset consist of several features describing each individual.

In [None]:
Xtrain_dicts[0]

We first convert the training set into numerical vectors.

In [None]:
import numpy as np

from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer()
dv.fit(Xtrain_dicts)

X_vec = dv.transform(Xtrain_dicts)

The first scoring function we'll investigate is called the [mutual information](https://en.wikipedia.org/wiki/Mutual_information). [Here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) is the description from scikit-learn about how this scoring function works.

(To see the formula used to compute the mutual information score, see the [description](https://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html) in the book *Introduction to Information Retrieval* by Manning and Schütze.)

We apply the scoring function to all the features, and we then print the top 10 high-scoring features. Please refer back to the perceptron example in the previous lecture for an explanation about the step where we sort the features by importance.

In [None]:
from sklearn.feature_selection import mutual_info_classif

feature_scores = mutual_info_classif(X_vec, Ytrain)

for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

The second scoring function uses the so-called $F$-statistic in an [ANOVA test](https://en.wikipedia.org/wiki/Analysis_of_variance).

As you can see, there is an overlap between the top-10 list produced by this scorer and the previous list, but they are not identical.

In [None]:
from sklearn.feature_selection import f_classif

feature_scores = f_classif(X_vec, Ytrain)[0]

for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

Yet another feature scoring function. It is based on the well-known [$\chi^2$ statistical test](https://en.wikipedia.org/wiki/Chi-squared_test).

In [None]:
from sklearn.feature_selection import chi2

feature_scores = chi2(X_vec, Ytrain)[0]

for score, fname in sorted(zip(feature_scores, dv.get_feature_names_out()), reverse=True)[:10]:
    print(fname, score)

In practice when we'd like to use feature selection in scikit-learn, we just plug a selector into our pipeline. `SelectKBest` and `SelectPercentile` are the most common selectors. They use a feature scoring function (such as the ones above) to rank the features; by default, the `f_classif` scoring function is used.

In [None]:
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

pipeline = make_pipeline(
        DictVectorizer(),
        SelectKBest(k=100), # or SelectPercentile(...)
        DecisionTreeClassifier()
)
pipeline.fit(Xtrain_dicts, Ytrain)
accuracy_score(Ytest, pipeline.predict(Xtest_dicts))