## Feature Selection with scikit-learn (sklearn)

Feature extraction is one of the essential setp in Data Science/Machine Learning and Data Mining excercises. Effective use of feature extraction techniques helps a Data Scientist to build the bset model. This note is intent to give a brief over view on feature selection with scikit-learn (sklearn). The result of a feature selection excercise is to find the most important and descriptive feature from a given data.

#### Find K-Best features for classification and regression
The first methos which we are going to expore is the selecting the K-best featres using the SelectKBest utility in sklearn. We will use the famous IRIS two class data-set. 

The first example we are going to look is feature selection for classification.

In [52]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif

def select_kbest_clf(data_frame, target, k=2):
    """
    Selecting K-Best features for classification
    :param data_frame: A pandas dataFrame with the training data
    :param target: target variable name in DataFrame
    :param k: desired number of features from the data
    :returns feature_scores: scores for each feature in the data as 
    pandas DataFrame
    """
    feat_selector = SelectKBest(f_classif, k=k)
    _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
    
    feat_scores = pd.DataFrame()
    feat_scores["F Score"] = feat_selector.scores_
    feat_scores["P Value"] = feat_selector.pvalues_
    feat_scores["Support"] = feat_selector.get_support()
    feat_scores["Attribute"] = data_frame.drop(target, axis=1).columns
    
    return feat_scores

iris_data = pd.read_csv("/resources/iris.csv")

kbest_feat = select_kbest_clf(iris_data, "Class", k=2)
kbest_feat = kbest_feat.sort(["F Score", "P Value"], ascending=[False, False])
kbest_feat


Unnamed: 0,F Score,P Value,Support,Attribute
2,2498.618817,1.5048010000000001e-71,True,petal-length
3,1830.624469,3.230375e-65,True,petal-width
0,236.735022,6.892546e-28,False,sepal-length
1,41.607003,4.246355e-09,False,sepal-width


##### What just happened ?

The select_kbest function accepts a pandas DataFrame, and target variable name and k as parameters. First we create a SelectKBest object with estimator as f_classif (because we are working with a classification problem). The we are fitting the model with the data. Once we fit the model information on feature importnace will be available in the fitted model. The Annova F score of the features are accesible thorugh the scores_ attributes and the p-values are avaiale thorugh the pvalues_. The get_support function will return a bool value if a feature is selected. 

Now the question is how can I determine which feature is selected? The easy way is that if the Support is Tru those features are are good. The higher the F Score and the lesser the p-values the feature is best. 

Let's examine the results we obtained from the iris data. The attributes 'petal-length' and 'petal-width' got higher F Score and lesser P Value; and Support is true. So those feature are important comapred to other features. To understand the real-power of this methos you have to check this with a data with more diamensions.

##### Next ....
In the next example we can try to see how we can apply this technique to a regression problem. Basically there is not much difference in the code. We will change the estimator to f_regression. We can try this with the Boston house price dataset.

In [53]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_regression


def select_kbest_reg(data_frame, target, k=5):
    """
    Selecting K-Best features for regression
    :param data_frame: A pandas dataFrame with the training data
    :param target: target variable name in DataFrame
    :param k: desired number of features from the data
    :returns feature_scores: scores for each feature in the data as 
    pandas DataFrame
    """
    feat_selector = SelectKBest(f_regression, k=k)
    _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
    
    feat_scores = pd.DataFrame()
    feat_scores["F Score"] = feat_selector.scores_
    feat_scores["P Value"] = feat_selector.pvalues_
    feat_scores["Support"] = feat_selector.get_support()
    feat_scores["Attribute"] = data_frame.drop(target, axis=1).columns
    
    return feat_scores

boston = pd.read_csv("/resources/boston.csv")

kbest_feat = select_kbest_reg(boston, "price", k=5)

kbest_feat = kbest_feat.sort(["F Score", "P Value"], ascending=[False, False])
kbest_feat


Unnamed: 0,F Score,P Value,Support,Attribute
12,601.617871,5.081103e-88,True,12
5,471.84674,2.487229e-74,True,5
10,175.105543,1.6095089999999999e-34,True,10
2,153.954883,4.90026e-31,True,2
9,141.761357,5.637734e-29,True,9
4,112.59148,7.065042e-24,False,4
0,88.151242,2.0835499999999997e-19,False,0
8,85.914278,5.465933e-19,False,8
6,83.477459,1.569982e-18,False,6
1,75.257642,5.713584e-17,False,1


##### Select features according to a percentile of the highest scores.

The next trick we are going to explore is 'SelectPercentile' based feature selection. This technique will return the features base on percentile of the highest score. Let's see it in action with Boston data.

In [54]:
import pandas as pd
from sklearn.feature_selection import SelectPercentile, f_regression


def select_percentile(data_frame, target, percentile=15):
    """
    Percentile based feature selection for regression
    :param data_frame: A pandas dataFrame with the training data
    :param target: target variable name in DataFrame
    :param k: desired number of features from the data
    :returns feature_scores: scores for each feature in the data as 
    pandas DataFrame
    """
    feat_selector = SelectPercentile(f_regression, percentile=percentile)
    _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
    
    feat_scores = pd.DataFrame()
    feat_scores["F Score"] = feat_selector.scores_
    feat_scores["P Value"] = feat_selector.pvalues_
    feat_scores["Support"] = feat_selector.get_support()
    feat_scores["Attribute"] = data_frame.drop(target, axis=1).columns
    
    return feat_scores

boston = pd.read_csv("/resources/boston.csv")

kbest_feat = select_percentile(boston, "price", percentile=50)

kbest_feat = kbest_feat.sort(["F Score", "P Value"], ascending=[False, False])
kbest_feat

Unnamed: 0,F Score,P Value,Support,Attribute
12,601.617871,5.081103e-88,True,12
5,471.84674,2.487229e-74,True,5
10,175.105543,1.6095089999999999e-34,True,10
2,153.954883,4.90026e-31,True,2
9,141.761357,5.637734e-29,True,9
4,112.59148,7.065042e-24,True,4
0,88.151242,2.0835499999999997e-19,False,0
8,85.914278,5.465933e-19,False,8
6,83.477459,1.569982e-18,False,6
1,75.257642,5.713584e-17,False,1


##### Univarite feature selection 
The next method we are going to expore is univarite feature selection. We will use the same Boston data for this example also.

In [55]:
import pandas as pd
from sklearn.feature_selection import GenericUnivariateSelect, f_regression


def select_univarite(data_frame, target, mode='fdr'):
    """
    Percentile based feature selection for regression
    :param data_frame: A pandas dataFrame with the training data
    :param target: target variable name in DataFrame
    :param k: desired number of features from the data
    :returns feature_scores: scores for each feature in the data as 
    pandas DataFrame
    """
    feat_selector = GenericUnivariateSelect(f_regression, mode=mode)
    _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
    
    feat_scores = pd.DataFrame()
    feat_scores["F Score"] = feat_selector.scores_
    feat_scores["P Value"] = feat_selector.pvalues_
    feat_scores["Support"] = feat_selector.get_support()
    feat_scores["Attribute"] = data_frame.drop(target, axis=1).columns
    
    return feat_scores

boston = pd.read_csv("/resources/boston.csv")

kbest_feat = select_univarite(boston, "price", mode='fpr')

kbest_feat = kbest_feat.sort(["F Score", "P Value"], ascending=[False, False])
kbest_feat

Unnamed: 0,F Score,P Value,Support,Attribute
12,601.617871,5.081103e-88,True,12
5,471.84674,2.487229e-74,True,5
10,175.105543,1.6095089999999999e-34,True,10
2,153.954883,4.90026e-31,True,2
9,141.761357,5.637734e-29,True,9
4,112.59148,7.065042e-24,True,4
0,88.151242,2.0835499999999997e-19,True,0
8,85.914278,5.465933e-19,True,8
6,83.477459,1.569982e-18,True,6
1,75.257642,5.713584e-17,True,1


In the example if we change the mode to 'fdr' the algo will find the score based on false discovery rate, 'fpr' false positive rate, 'fwr' family based error, 'percentile' and 'kbest' will do Percentile and KBest based scoring.

##### Family-wise error rate 
The next method we are going to expore is Family-wise error rate. We will use the same Boston data for this example also.

In [56]:
import pandas as pd
from sklearn.feature_selection import SelectFwe, f_regression


def select_univarite(data_frame, target):
    """
    Percentile based feature selection for regression
    :param data_frame: A pandas dataFrame with the training data
    :param target: target variable name in DataFrame
    :param k: desired number of features from the data
    :returns feature_scores: scores for each feature in the data as 
    pandas DataFrame
    """
    feat_selector = SelectFwe(f_regression)
    _ = feat_selector.fit(data_frame.drop(target, axis=1), data_frame[target])
    
    feat_scores = pd.DataFrame()
    feat_scores["F Score"] = feat_selector.scores_
    feat_scores["P Value"] = feat_selector.pvalues_
    feat_scores["Support"] = feat_selector.get_support()
    feat_scores["Attribute"] = data_frame.drop(target, axis=1).columns
    
    return feat_scores

boston = pd.read_csv("/resources/boston.csv")

kbest_feat = select_univarite(boston, "price")

kbest_feat = kbest_feat.sort(["F Score", "P Value"], ascending=[False, False])
kbest_feat

Unnamed: 0,F Score,P Value,Support,Attribute
12,601.617871,5.081103e-88,True,12
5,471.84674,2.487229e-74,True,5
10,175.105543,1.6095089999999999e-34,True,10
2,153.954883,4.90026e-31,True,2
9,141.761357,5.637734e-29,True,9
4,112.59148,7.065042e-24,True,4
0,88.151242,2.0835499999999997e-19,True,0
8,85.914278,5.465933e-19,True,8
6,83.477459,1.569982e-18,True,6
1,75.257642,5.713584e-17,True,1
