# Chisquare  Test For Feature Selection

* chi-squared stats between each non-negative feature and class.
* This score should be used to evaluate categorical variables in a classification task.

* This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

* The Chi Square statistic is commonly used for testing relationships between categorical variables.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = sns.load_dataset('titanic')

In [3]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [5]:
# taking only these columns ['sex','embarked','alone','pclass','Survived']
df = df[['sex','embarked','alone','pclass','survived']]
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,male,S,False,3,0
1,female,C,False,1,1
2,female,S,True,3,1
3,female,S,False,1,1
4,male,S,True,3,0


In [6]:
# label encoding on sex column i.e replacing male with 1 and female with 0
df['sex'] = np.where(df['sex']=="male", 1, 0)
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,S,False,3,0
1,0,C,False,1,1
2,0,S,True,3,1
3,0,S,False,1,1
4,1,S,True,3,0


In [7]:
# label encoding on embarked column
ordinal_label = {k: i for i, k in enumerate(df['embarked'].unique(), 0)}
df['embarked'] = df['embarked'].map(ordinal_label)
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,False,3,0
1,0,1,False,1,1
2,0,0,True,3,1
3,0,0,False,1,1
4,1,0,True,3,0


In [8]:
### let's perform label encoding on alone
df['alone']=np.where(df['alone'] == True, 1, 0)
df.head()

Unnamed: 0,sex,embarked,alone,pclass,survived
0,1,0,0,3,0
1,0,1,0,1,1
2,0,0,1,3,1
3,0,0,0,1,1
4,1,0,1,3,0


In [9]:
X = df[['sex','embarked','alone','pclass']]
y = df['survived']

In [10]:
### train Test split is usually done to avaoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X, y, test_size=0.3, random_state=100)

In [11]:
X_train.isnull().sum()

sex         0
embarked    0
alone       0
pclass      0
dtype: int64

# Applying the Chi-square test

In [12]:
from sklearn.feature_selection import chi2
f_p_values = chi2(X_train,y_train)
f_p_values

(array([65.67929505,  7.55053653, 10.88471585, 21.97994154]),
 array([5.30603805e-16, 5.99922095e-03, 9.69610546e-04, 2.75514881e-06]))

The above tuple has two array 1st array is f score and 2d array is p values

* The more the f score then more important that feature is
* The more the p value then more important that feature is

In [13]:
# mapping the f scores to the column names
f_scores = pd.Series(f_p_values[0])
f_scores.index = X_train.columns
f_scores

sex         65.679295
embarked     7.550537
alone       10.884716
pclass      21.979942
dtype: float64

In [14]:
# sorting that values in decending order
f_scores.sort_values(ascending = False)

sex         65.679295
pclass      21.979942
alone       10.884716
embarked     7.550537
dtype: float64

In [15]:
# mapping the p values to the column names
p_values = pd.Series(f_p_values[1])
p_values.index = X_train.columns
p_values

sex         5.306038e-16
embarked    5.999221e-03
alone       9.696105e-04
pclass      2.755149e-06
dtype: float64

In [16]:
# sorting that values in decending order
p_values.sort_index(ascending = False)

sex         5.306038e-16
pclass      2.755149e-06
embarked    5.999221e-03
alone       9.696105e-04
dtype: float64

Sex Column is the most important column when compared to the output feature Survived

## We can select the top features manually or  by using SelectKBest module

In [17]:
# here we select the top 1st important feature
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

sel_one_cols = SelectKBest(chi2, k = 1)
sel_one_cols.fit(X_train, y_train)
X_train.columns[sel_one_cols.get_support()]

Index(['sex'], dtype='object')

In [18]:
# To make the scores in dataframe use the below code

dfscores = pd.DataFrame(sel_one_cols.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Column name','Score']  #naming the dataframe columns

In [19]:
featureScores

Unnamed: 0,Column name,Score
0,sex,65.679295
1,embarked,7.550537
2,alone,10.884716
3,pclass,21.979942


To learn more about about Mutual information for classification use this link : https://scikitlearn.org/stable/modules/generated/sklearn.feature_selection.chi2.html