**The Chi-Square Independence Test**

The $\chi^2$ **independence test** is an inferential method to decide whether an association exists between two variables. Similar to other hypothesis tests the null hypothesis states that the two variables are not associated. In contrast, the alternative hypothesis states that the two variables are associated.

Recall that statistically **dependent variables** are called **associated variables**. In contrast, non-associated variables are called statistically independent variables. Further recall the concept of **contingency tables** (also known as two-way table, cross-tabulation table or cross tabs), which display the frequency distributions of bivariate data.

**$\chi^2$ Independence Test**

The basic idea behind the $\chi^2$ **independence test** is to compare the **observed frequencies** in a contingency table with the **expected frequencies**, given the null hypothesis of non-association is true. The expected frequency for each cell of a contingency table is given by

$$E = \frac{R \times C}{n},$$

where $R$ is the row total, $C$ is the column total, and $n$ is the sample size.

Let us construct an example for a better understanding. We consider an exit poll in form of a contingency table that displays the age of $n=1189$ people in categories from 18-29,30-44,45-64 and >65
years, and their political affiliation, which is "Conservatie", "Socialist" and "Other". This table corresponds to the observed frequencies.

**Observed frequencies:**

\begin{array}{|l|c|}
\hline
\ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\
\hline
\ \text{18-29}  & 141 & 68 & 4 & 213\\
\ \text{30-44}  & 179 & 159 & 7 & 345\\
\ \text{45-64} & 220 & 216 & 4 & 440\\
\ \text{65 & older} & 86 & 101 & 4 & 191\\
\hline 
\  \text{Total} & 626 & 544 & 19 & 1189\\
\hline 
\end{array}

**Expected frequencies:**

\begin{array}{|l|c|}
\hline
\ & \text{Conservative} & \text{Socialist} & \text{Other} & \text{Total} \\
\hline
\ \text{18-29}  & \frac{213 \times 626 }{1189} \approx 112.14 & \frac{213 \times 544 }{1189} \approx97.45 & \frac{213 \times 19 }{1189} \approx3.4 & 213\\
\ \text{30-44}  & \frac{345 \times 626 }{1189} \approx181.64 &\frac{345 \times 544 }{1189} \approx 157.85 & \frac{345 \times 19 }{1189} \approx5.51 & 345\\
\ \text{45-64} & \frac{440 \times 626 }{1189} \approx231.66 & \frac{440 \times 544 }{1189} \approx201.31 &\frac{440 \times 19 }{1189} \approx 7.03 & 440\\
\ \text{65 & older} &\frac{191 \times 626 }{1189} \approx 100.56 &\frac{191 \times 544 }{1189} \approx 87.39 & \frac{191 \times 19 }{1189} \approx3.05 & 191\\
\hline 
\ \text{Total} & 626 & 544 & 19 & 1189\\
\hline 
\end{array}



Once we know about the expected frequencies we have to check for two assumptions. First, we have to make sure that all expected frequencies are 1 or greater, and second, at most 20% of the expected frequencies are less than 5. By looking at the table we may confirm that both assumptions are fulfilled.

The actual comparison is done based on the the $\chi^2$ test statistic for the observed frequency and the expected frequency. The $\chi^2$ test statistic follows the $\chi^2$ distribution and is given by

$$\chi^2 = \sum{\frac{(O-E)^2}{E}},$$

where $O$ represents the observed frequency and $E$ represents the expected frequency. Please note that $\frac{(O-E)^2}{E}$ is evaluated for each cell and then summed up.

The number of degrees of freedom are given by

$$df=(r-1) \times (c-1),$$

where $r$ and $c$ are the number of possible values for the two variables under consideration.

Adopted to the example from above this leads to a somehow long expression, which for the sake of brevity is just given for the first and last row of the contigency tables of interest.

$$\chi^2 = \frac{141 \times 112.14}{112.14} + \frac{68 \times 97.45}{97.45} + \frac{4 \times 3.4}{3.4}  + ... + \frac{86 \times 100.56}{100.56} + \frac{101 \times 87.39}{87.39} + \frac{4 \times 3.05}{3.05}$$

If the null hypothesis is true, the observed and expected frequencies are roughly equal, resulting in a small value of the $\chi^2$ test statistic; thus supporting $H_0$. If, however, the value of the $\chi^2$ test statistic is large, the data provides evidence against $H_0$. In the next sections we further discuss how to asses the value $\chi^2$ test statistic in the framework of hypothesis testing.

**$\chi^2$ Independence Test: An example**

In order to get some hands-on experience we apply the $\chi^2$ independence test in an exercise. Therefore we load the students data set.

In [1]:
import pandas as pd

students_df = pd.read_csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")

In this exercise we want to examine **if there is an association between the varibles *gender* and *major*, or in other words, we want to know if male students favor different study subjects compared to female students.**

**Data preparation**

We start with the data preparation. We do not want to deal with the whole data set of $8239$ entries, thus we randomly select $865$ student from the data set. The first step of data preparation is to display our data of interest in form of a contingency table.

In [2]:
sample_df = students_df.sample(n=865)

In [3]:
data = sample_df.groupby('major').gender.value_counts()

In [4]:
data

major                       gender
Biology                     Female     88
                            Male       65
Economics and Finance       Male       97
                            Female     45
Environmental Sciences      Male      100
                            Female     72
Mathematics and Statistics  Male      106
                            Female     32
Political Science           Female     95
                            Male       57
Social Sciences             Female     80
                            Male       28
Name: gender, dtype: int64

In [9]:
# don't know what to do instead of just creating the table manually

table = {'Major': ['Biology', 'Economics and Finance', 'Environmental Sciences',
                   'Mathematics and Statistics','Political Science','Social Sciences'],
        'Female': [88,45,72,32,95,80], 'Male': [65,97,100,106,57,28]}

table_df = pd.DataFrame(table)

table_df.set_index('Major', inplace=True)

table_df


Unnamed: 0_level_0,Female,Male
Major,Unnamed: 1_level_1,Unnamed: 2_level_1
Biology,88,65
Economics and Finance,45,97
Environmental Sciences,72,100
Mathematics and Statistics,32,106
Political Science,95,57
Social Sciences,80,28


**Hypothesis testing**

In order to conduct the $\chi^2$ **independence test** we follow the step-wise implementation procedure for hypothesis testing. The $\chi^2$ **independence test** follows the same step-wise procedure as discussed in the previous sections

\begin{array}{l}
\hline
\ \text{Step 1}  & \text{State the null hypothesis } H_0 \text{ and alternative hypothesis } H_A \text{.}\\
\ \text{Step 2}  & \text{Decide on the significance level, } \alpha\text{.} \\
\ \text{Step 3}  & \text{Compute the value of the test statistic.} \\
\ \text{Step 4} &\text{Determine the p-value.} \\
\ \text{Step 5} & \text{If } p \le \alpha \text{, reject }H_0 \text{; otherwise, do not reject } H_0 \text{.} \\
\ \text{Step 6} &\text{Interpret the result of the hypothesis test.} \\
\hline 
\end{array}

**Step 1: State the null hypothesis $H_0$ and alternative hypothesis $H_A$**

The null hypothesis states that there is no association between the gender and the major study subject of students.

$$H_0 : \text{ No association between gender and major study subject}$$

$$H_A: \text{ There is an association between gender and major study subject}$$

**Step 2: Decide on the significance level, $\alpha$**

$$\alpha = 0.05$$

In [6]:
alpha=0.05

**Step 3 and 4: Compute the value of the test statistic and the $p$-value.**


In [11]:
# test statistic

import numpy as np

from scipy.stats import chi2_contingency

obs = np.array([[88, 65],[45,97],[72,100],[32,106],[95,57],[80,28]])

chi2_contingency(obs)

(99.55642172734031,
 6.554511834745349e-20,
 5,
 array([[72.87398844, 80.12601156],
        [67.63468208, 74.36531792],
        [81.92369942, 90.07630058],
        [65.72947977, 72.27052023],
        [72.39768786, 79.60231214],
        [51.44046243, 56.55953757]]))

The $p$-value is less than the specified significance level of $0.05$; we reject $H_0$. The test results are statistically significant at the $5$% level and provide very strong evidence against the null hypothesis.

**Step 6: Interpret the result of the hypothesis test.**

$p= 6.554511834745349 \times 10^{-20}$. At the $5$% significance level, the data provides very strong evidence to conclude that there is an association betweeen gender and the major study subject.