### Fisher Score- Chisquare Test For Feature Selection

Compute chi-squared stats between each non-negative feature and class.

**Note:** This score should be used to evaluate categorical variables in a classification task.

This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

Recall that the chi-square test measures dependence between stochastic variables, so using this function “weeds out” the features that are the most likely to be independent of class and therefore irrelevant for classification. **The Chi Square statistic is commonly used for testing relationships between categorical variables.**

It compares the observed distribution of the different classes of target Y among the different categories of the feature, against the expected distribution of the target classes, regardless of the feature categories.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB


In [4]:
df=df[['sex','embarked','alone','pclass','survived']]
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,male,S,False,3,0
1,female,C,False,1,1
2,female,S,True,3,1
3,female,S,False,1,1
4,male,S,True,3,0


In [5]:
# Let's perform label encoding on sex column
df['sex']=np.where(df['sex']=="male",1,0)
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,S,False,3,0
1,0,C,False,1,1
2,0,S,True,3,1
3,0,S,False,1,1
4,1,S,True,3,0


In [6]:
#let's perform label encoding on embarked
ordinal_label = {k: i for i, k in enumerate(df['embarked'].unique(), 0)}
df['embarked'] = df['embarked'].map(ordinal_label)

In [7]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,False,3,0
1,0,1,False,1,1
2,0,0,True,3,1
3,0,0,False,1,1
4,1,0,True,3,0


In [8]:
# let's perform label encoding on alone
df['alone']=np.where(df['alone']==True,1,0)

In [9]:
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,0,3,0
1,0,1,0,1,1
2,0,0,1,3,1
3,0,0,0,1,1
4,1,0,1,3,0


In [11]:
X = df.drop('survived', axis=1)
y = df['survived']

In [13]:
### train Test split 
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X, y,test_size=0.3,random_state=42)

In [14]:
X_train.head()

Unnamed: 0,sex,embarked,alone,pclass
445,1,0,0,1
650,1,0,1,3
172,0,0,0,3
450,1,0,0,2
314,1,0,0,2


In [15]:
# Perform chi2 test
# chi2 returns 2 values
# Fscore and the pvalue
from sklearn.feature_selection import chi2
f_p_values=chi2(X_train,y_train)

In [16]:
f_p_values

(array([60.41964418, 11.363384  ,  8.35778435, 17.40807208]),
 array([7.66442567e-15, 7.49062219e-04, 3.84038546e-03, 3.01542662e-05]))

Higher the f values more important for model and Lower the p value more important for model.

In [17]:
p_values = pd.Series(f_p_values[1])
p_values.index = X_train.columns
p_values

sex         7.664426e-15
embarked    7.490622e-04
alone       3.840385e-03
pclass      3.015427e-05
dtype: float64

In [18]:
p_values.sort_index(ascending=False)

sex         7.664426e-15
pclass      3.015427e-05
embarked    7.490622e-04
alone       3.840385e-03
dtype: float64

#### Observation
Sex Column is the most important column when compared to the output feature Survived

## END