# Chi-Squared Test for Feature Selection

A higher Chi-Square value means the feature is more dependent on the label. 

Steps: 
- Define a hypothesis
- Build a contingency table
- Find the expected values
- Calculate the Chi-Square statistics
- Accept/Reject the null hypothesis

Assumptions:
- The observations are independant
- No expected cell count is = 0
- No more than 20% of the cells have an expected cell coutn of <5

Current goal: try to figure out how to make contingency table for all features/figure out the chi-square for ALL FEATURES and LABEL (or ALL FEATURES generally)

In [44]:
import pandas as pd 
from sklearn.datasets import load_iris 
from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 
import scipy.stats as stats
import numpy as np
from itertools import combinations

In [52]:
#This is what I implemented --Fabi
#load dataset
data = np.loadtxt("uniform_large_d_1.tex")

# Creating NumPy array
array = np.array(data)

# Converting to Pandas DataFrame
df_table = pd.DataFrame(array)

# Displaying the table
print(df_table)

          0         1         2         3         4         5         6    \
0    2.014037  2.842330  2.093059  2.314322  2.550290  2.556514  2.063987   
1    2.655125  2.439494  2.387897  2.414520  2.677007  2.066587  2.221681   
2    2.397686  2.129261  2.228847  2.574741  2.672454  2.330393  2.379493   
3    0.021023  0.884131  0.570157  0.950007  0.570792  0.741419  0.251829   
4    0.087550  0.596086  0.355909  0.447322  0.680048  0.198563  0.192330   
..        ...       ...       ...       ...       ...       ...       ...   
495  0.611156  0.236036  0.896368  0.773777  0.538057  0.402998  0.090796   
496  2.761173  2.080949  2.939479  2.325925  2.977614  2.109083  2.517269   
497  0.401104  0.340544  0.555580  0.230778  0.600226  0.992868  0.274078   
498  0.248207  0.096274  0.516660  0.946114  0.271408  0.845261  0.546188   
499  2.647101  2.363681  2.077603  2.632778  2.676110  2.920187  2.866320   

          7         8         9    ...       141       142       143  \
0  

In [54]:
####################### Creating small data set #############################
#Chi-square dependent with the label column or dependent for all features (with each other)

# subsetting to last 15 columns that include the label
df_table = df_table.iloc[:, 135:151]

#loop, converts floats to ints and then category
for i in range(0,5):
    df_table.iloc[:,i] = df_table.iloc[:,i].astype(int)
    df_table.iloc[:,i] = df_table.iloc[:,i].astype("category")

# Turn label into categorical label
df_table.iloc[:,15] = df_table.iloc[:,15].astype('category')

# Creating subset of only CATEGORICAL variables + LABEL
df_categorical = df_table.select_dtypes(include=['category'])
df_categorical['label'] = df_table.iloc[:,15]
df_categorical



Unnamed: 0,135,136,137,138,139,150,label
0,2.0,2.0,2.0,2.0,2.0,1.0,1.0
1,2.0,2.0,2.0,2.0,2.0,1.0,1.0
2,2.0,2.0,2.0,2.0,2.0,1.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
495,0.0,0.0,0.0,0.0,0.0,0.0,0.0
496,2.0,2.0,2.0,2.0,2.0,1.0,1.0
497,0.0,0.0,0.0,0.0,0.0,0.0,0.0
498,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [60]:
############### CHI-SQUARE TEST FOR LABEL V. ALL FEATURES #####################
# Number of features, excluding label
var_count = len(df_categorical.columns)-1


# Creates an empty array to print values in a table
results = []

for i in range(0, var_count):

    # Create contigency table of all features v. label
    crosstab = pd.crosstab(df_categorical.iloc[:, i], df_categorical.iloc[:,-1])
   
    # Compute chi-squared and p-values
    chi2 = stats.chi2_contingency(crosstab)[0]
    p = stats.chi2_contingency(crosstab)[1]
   
    # Append results to the list
    results.append({
        "Feature": df_categorical.columns[i],
        "Chi Squared Statistic": chi2,
        "P-Value": p})

# Create a DataFrame from the results
results_df = pd.DataFrame(results)

# Print the DataFrame
print("Label:", df_categorical.columns.values[-1])
print(results_df.to_string(index=False))

Label: label
 Feature  Chi Squared Statistic       P-Value
     135                496.008 7.023536e-110
     136                496.008 7.023536e-110
     137                496.008 7.023536e-110
     138                496.008 7.023536e-110
     139                496.008 7.023536e-110
     150                496.008 7.023536e-110


In [62]:
############### CHI-SQUARE TEST FOR ALL FEATURES V. ALL FEATURES ########################
#Chi-square features dependent association for all categorical features with each other
#Chi-square dependent with the label column 
data = df_categorical
# Extract variable names
variable_names = list(data.columns)

# Initialize matrices to store chi-squared and p-values
num_variables = len(variable_names)
chi_squared = np.zeros((num_variables, num_variables))
p_values = np.zeros((num_variables, num_variables))

# Compute chi-squared and p-values for each pair of variables
for i, j in combinations(range(num_variables), 2):
    contingency_table = pd.crosstab(data.iloc[:, i], data.iloc[:, j])
   
    # Compute chi-squared and p-values
    chi2 = stats.chi2_contingency(contingency_table)[0]
    p = stats.chi2_contingency(contingency_table)[1]
   
    # Assign results to chi_squared and p_values matrices
    chi_squared[i, j] = chi2
    chi_squared[j, i] = chi2  # Assign to symmetric position in the matrix
    p_values[i, j] = p
    p_values[j, i] = p  # Assign to symmetric position in the matrix

# Create a DataFrame with variable names as index and columns
chi_squared_df = pd.DataFrame(chi_squared, index=variable_names, columns=variable_names)
p_values_df = pd.DataFrame(p_values, index=variable_names, columns=variable_names)

# Printing the matrix-like output with variable names
print("Chi-Squared Values:")
print(chi_squared_df)
print("\nP-Values:")
print(p_values_df)



Chi-Squared Values:
           135      136      137      138      139      150    label
135      0.000  496.008  496.008  496.008  496.008  496.008  496.008
136    496.008    0.000  496.008  496.008  496.008  496.008  496.008
137    496.008  496.008    0.000  496.008  496.008  496.008  496.008
138    496.008  496.008  496.008    0.000  496.008  496.008  496.008
139    496.008  496.008  496.008  496.008    0.000  496.008  496.008
150    496.008  496.008  496.008  496.008  496.008    0.000  496.008
label  496.008  496.008  496.008  496.008  496.008  496.008    0.000

P-Values:
                 135            136            137            138  \
135     0.000000e+00  7.023536e-110  7.023536e-110  7.023536e-110   
136    7.023536e-110   0.000000e+00  7.023536e-110  7.023536e-110   
137    7.023536e-110  7.023536e-110   0.000000e+00  7.023536e-110   
138    7.023536e-110  7.023536e-110  7.023536e-110   0.000000e+00   
139    7.023536e-110  7.023536e-110  7.023536e-110  7.023536e-110   
150