# Chi-Squared Test for Feature Selection

A higher Chi-Square value means the feature is more dependent on the label. 

Steps: 
- Define a hypothesis
- Build a contingency table
- Find the expected values
- Calculate the Chi-Square statistics
- Accept/Reject the null hypothesis

Assumptions:
- The observations are independant
- No expected cell count is = 0
- No more than 20% of the cells have an expected cell coutn of <5

Current goal: try to figure out how to make contingency table for all features/figure out the chi-square for ALL FEATURES and LABEL (or ALL FEATURES generally)

In [103]:
############ SELECTS FEATURES MOST CORRELATED WITH LABEL ############
import pandas as pd 
from sklearn.datasets import load_iris 
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 
import scipy.stats as stats
import numpy as np

In [105]:
#This is what I implemented --Fabi
#load dataset
data = np.loadtxt("uniform_large_d_1.tex")

# Creating NumPy array
array = np.array(data)

# Converting to Pandas DataFrame
df_table = pd.DataFrame(array)

# Displaying the table
print(df_table)

          0         1         2         3         4         5         6    \
0    2.014037  2.842330  2.093059  2.314322  2.550290  2.556514  2.063987   
1    2.655125  2.439494  2.387897  2.414520  2.677007  2.066587  2.221681   
2    2.397686  2.129261  2.228847  2.574741  2.672454  2.330393  2.379493   
3    0.021023  0.884131  0.570157  0.950007  0.570792  0.741419  0.251829   
4    0.087550  0.596086  0.355909  0.447322  0.680048  0.198563  0.192330   
..        ...       ...       ...       ...       ...       ...       ...   
495  0.611156  0.236036  0.896368  0.773777  0.538057  0.402998  0.090796   
496  2.761173  2.080949  2.939479  2.325925  2.977614  2.109083  2.517269   
497  0.401104  0.340544  0.555580  0.230778  0.600226  0.992868  0.274078   
498  0.248207  0.096274  0.516660  0.946114  0.271408  0.845261  0.546188   
499  2.647101  2.363681  2.077603  2.632778  2.676110  2.920187  2.866320   

          7         8         9    ...       141       142       143  \
0  

In [107]:
#Chi-square dependent with the label column or dependent for all features (with each other)

# subsetting to last 15 columns that include the label
df_table = df_table.iloc[:, 136:151]

#loop, converts floats to ints and then category
for i in range(0,5):
    df_table.iloc[:,i] = df_table.iloc[:,i].astype(int)
    df_table.iloc[:,i] = df_table.iloc[:,i].astype("category")

# subset to select only categorical variables
df_chi = df_table.select_dtypes(include = ["category"])

In [None]:
#Ace's demo code
#df_chi = df_table.round(0)

In [90]:
#Chi-square dependent with the label column 
# Number of features, excluding label
var_count = len(df_chi.columns)-1

# Empty list
out = []

for i in range(0, var_count):

    # Create contigency table
    crosstab = pd.crosstab(df_chi.iloc[:, i], df_chi.iloc[:, -1])

    # Passing contingency table into chi-squared test
    result = stats.chi2_contingency(crosstab)
    print(result)

Chi2ContingencyResult(statistic=496.008, pvalue=7.023536136418314e-110, dof=1, expected_freq=array([[125., 125.],
       [125., 125.]]))
Chi2ContingencyResult(statistic=496.008, pvalue=7.023536136418314e-110, dof=1, expected_freq=array([[125., 125.],
       [125., 125.]]))
Chi2ContingencyResult(statistic=496.008, pvalue=7.023536136418314e-110, dof=1, expected_freq=array([[125., 125.],
       [125., 125.]]))
Chi2ContingencyResult(statistic=496.008, pvalue=7.023536136418314e-110, dof=1, expected_freq=array([[125., 125.],
       [125., 125.]]))


In [109]:
#Chi-square features dependent association for all categorical features with each other
############### CHI-SQUARE TEST FOR ALL FEATURES V. ALL FEATURES ########################

# Number of features, excluding label
var_count = len(df_chi.columns)-1

for j in range(0, var_count):

    for i in range(0, var_count):
   
        # Create contigency table
        crosstab = pd.crosstab(df_chi.iloc[:, i], df_chi.iloc[:,j])
   
        # Passing contingency table into chi-squared test
        chi, p, dof, exp = stats.chi2_contingency(crosstab)
        print("V", i, "V", j)
        print("Chi-squared:", chi)
        print("p-value:", p)
        print(" ")


V 0 V 0
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 1 V 0
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 2 V 0
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 3 V 0
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 0 V 1
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 1 V 1
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 2 V 1
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 3 V 1
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 0 V 2
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 1 V 2
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 2 V 2
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 3 V 2
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 0 V 3
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 1 V 3
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 2 V 3
Chi-squared: 496.008
p-value: 7.023536136418314e-110
 
V 3 V 3
Chi-squared: 496.008
p-value: 7.023536136418314