## Chi-square

Compute **chi-squared stats** between each **non-negative feature and class**. This score should be used to **evaluate categorical variables in a classification task**.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

**Load dataset!**

In [2]:
data = pd.read_csv('titanic.csv')
data.shape

(1309, 14)

In [3]:
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22,S,,,"Montreal, PQ / Chesterville, ON"


**First encode the labels of the categories into numbers!**

In [4]:
data['sex'] = np.where(data['sex'] == 'male', 1, 0)  # for Sex / Gender
ordinal_label = {k: i for i, k in enumerate(data['embarked'].unique(), 0)}
data['embarked'] = data['embarked'].map(ordinal_label)  # for Embarked
# pclass is already ordinal

**Start with the training set to avoid overfit!**

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['pclass', 'sex', 'embarked']],
    data['survived'],
    test_size=0.3,
    random_state=0)
X_train.shape, X_test.shape

((916, 3), (393, 3))

**Calculate the chi2 p_value between each of the variables and the target. chi2 returns 2 arrays, first the F-Scores, second the pvalue!**

In [6]:
f_score = chi2(X_train.fillna(0), y_train)
f_score

(array([27.18283095, 95.93492132,  8.51621324]),
 array([1.85095118e-07, 1.18722647e-22, 3.51996172e-03]))

**Capture the p_values in the second array! Add the variable names! Order the variables based on their fscore!**

In [7]:
pvalues = pd.Series(f_score[1]) 
pvalues.index = X_train.columns
pvalues.sort_values(ascending=True)

sex         1.187226e-22
pclass      1.850951e-07
embarked    3.519962e-03
dtype: float64

In **MI**, we take the **higher MI values**. In the **chi2**, we take the **smaller p_value** (**more significant to predict** the target). Here, **sex is has the smallest p-value** and it is the **most important feature**.

In this demo, we used chi2 to determine the predictive value of 3 categorical variables only. If the dataset contained several categorical variables, we could then combine this procedure with **SelectKBest** or **SelectPercentile**, as we did in the previous notebook, to select the top k features, or the features in the top n percentile, based on the chi2 p-values.

Let's select the top 1 feature for the demo:

**Use SelectKBest or SelectPercentile! Especillay if data has several categorical variable, preferred!**

In [8]:
sel_ = SelectKBest(chi2, k=1).fit(X_train, y_train)
X_train.columns[sel_.get_support()]  # display features

Index(['sex'], dtype='object')

**Remove the rest of the features:**

In [9]:
X_train = sel_.transform(X_train)
X_test = sel_.transform(X_test)