# 

## CHI-SQUARE TEST OF INDEPENDENCE

Theory about chi-square test of independence

$\chi^2$ test of independence assumptions:

* The two samples are independent
* No expected cell count is = 0
* No more than 20% of the cells have and expected cell count < 5

Hypothesis

* **$H_0$: Variables are independent**
* **$H_A$: Variables are dependent**

Test statistic

$\chi^2 = \sum_{i,j} \frac{(O_{i,j} - \hat{E}_{i,j})^2}{\hat{E}_{i,j}} $
 
One would reject the null hypothesis, $H_0$, if the calculated $\chi^2$ test statistic is > the critical $\chi^2$ value based on the degrees of freedom and 
$\alpha$ level. Degrees of freedom are calculated using $(r-1)(c-1)$ where $r$ is the number of rows and $c$ is the number of columns.

One needs to look-up the critical $\chi^2$ test statistic using the calculated degrees of freedom and set $\alpha$ value; this is typically calculated for the user when using statistical software.
Before the decision to accept or reject $H_0$, check the assumptions.

The data used in this example comes from Stata and is 1980 U.S. census data from 956 cities.

In [1]:
import pandas as pd
import researchpy as rp
import scipy.stats as stats

# To load a sample dataset for this demonstration
import statsmodels.api as sm

df = sm.datasets.webuse("citytemp2")

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 956 entries, 0 to 955
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   division  956 non-null    category
 1   region    956 non-null    category
 2   heatdd    953 non-null    float64 
 3   cooldd    953 non-null    float64 
 4   tempjan   954 non-null    float32 
 5   tempjuly  954 non-null    float32 
 6   agecat    956 non-null    category
dtypes: category(3), float32(2), float64(2)
memory usage: 26.0 KB


In [4]:
# rp.summary_cat(df[["agecat", "region"]])

In [5]:
crosstab = pd.crosstab(df["region"], df["agecat"])
crosstab

agecat,19-29,30-34,35+
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NE,46,83,37
N Cntrl,162,92,30
South,139,68,43
West,160,73,23


In [6]:
stats.chi2_contingency(crosstab)

Chi2ContingencyResult(statistic=61.28767688406036, pvalue=2.463382670201326e-11, dof=6, expected_freq=array([[ 88.03556485,  54.87029289,  23.09414226],
       [150.61506276,  93.87447699,  39.51046025],
       [132.58368201,  82.63598326,  34.78033473],
       [135.76569038,  84.61924686,  35.61506276]]))

In [7]:
crosstab['30-34']

region
NE         83
N Cntrl    92
South      68
West       73
Name: 30-34, dtype: int64

## Resources

* https://www.pythonfordatascience.org/chi-square-test-of-independence-python/
* https://stats.stackexchange.com/questions/2391/what-is-the-relationship-between-a-chi-squared-test-and-test-of-equal-proportion