## Filter Methods - Fisher Score

Compute chi-squared stats between each non-negative feature and class. 

- This score should be used to evaluate categorical variables in a classification task.

It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories. I explained this in more detail the introductory lecture of this section.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn import preprocessing

In [None]:
# load dataset and features from previus method
features = np.load('../features/featuresFromMIClassif.npy').tolist()
data = pd.read_pickle('../../data/features/features.pkl').loc[:,features]


data.shape

### encode categorical variables

In [None]:
# select categorical variables in the dataset and encode them into numbers
catvars = ['var1', 'var2', 'var3',..., 'varn']

# for each categorical variable do this
for catvar in catvars:
    le = preprocessing.LabelEncoder()
    data.loc[:,catvar] = le.fit_transform(data.loc[:,catvar])


### split train - test

In [None]:
# In all feature selection procedures, it is good practice to select the features by examining only the training set. And this is to avoid overfit.

# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data[catvars],
    data['target'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

### calculate chi2

In [None]:
# calculate the chi2 p_value between each of the variables
# and the target
# it returns 2 arrays, one contains the F-Scores which are then 
# evaluated against the chi2 distribution to obtain the pvalue
# the pvalues are in the second array, see below

f_score = chi2(X_train.fillna(-9999), y_train)
f_score

In [None]:
# let's add the variable names and order it for clearer visualisation

pvalues = pd.Series(f_score[1])
pvalues.index = X_train.columns
pvalues.sort_values(ascending=False)

Keep in mind, that contrarily to MI, where we were interested in the higher MI values, for Fisher score, the smaller the p_value, the more significant the feature is to predict the target.

**Note**
One thing to keep in mind when using Fisher score or univariate selection methods, is that in very big datasets, most of the features will show a small p_value, and therefore look like they are highly predictive. This is in fact an effect of the sample size. So care should be taken when selecting features using these procedures. An ultra tiny p_value does not highlight an ultra-important feature, it rather indicates that the dataset contains too many samples. 

### save features

In [None]:
# how many var would you like to keep from the previous fisher analysis 
NCATVAR = 10

In [None]:
features_to_keep = pvalues.sort_values(ascending=True).index.tolist()[:NCATVAR]

In [None]:
np.save('../features/featuresFromFisherScore.npy',features_to_keep)