# Chi-Square Test

## Setup

In [1]:
import pandas as pd
import scipy.stats as stats
import numpy as np

## Example 1

Let's apply the Chi-square Test of Independence to our example where we have as random sample of 500 U.S. adults who are questioned regarding their political affiliation and opinion on a tax reform bill. We will test if the political affiliation and their opinion on a tax reform bill are dependent at a 5% level of significance.

### Load data

In [2]:
df = pd.read_csv('political_affiliation.txt', sep='\t')
df.index = ['democrat', 'republican']

### Observed table

In [3]:
observed = df.copy()
observed

Unnamed: 0,favor,indifferent,opposed
democrat,138,83,64
republican,64,67,84


### Calculate totals

In [4]:
df['row_totals'] = df.sum(axis=1)
df = df.append(df.sum(axis=0), ignore_index=True)
df.index = ['democrat', 'republican', 'col_totals']

df

Unnamed: 0,favor,indifferent,opposed,row_totals
democrat,138,83,64,285
republican,64,67,84,215
col_totals,202,150,148,500


### Calculate expected table

In [5]:
row_totals = df['row_totals'][:-1]
col_totals = df.loc['col_totals', :][:-1]
totals = df.iloc[-1, -1]

expected = np.outer(row_totals, col_totals) / (1.0 * totals)
expected = pd.DataFrame(expected)
expected.columns = observed.columns
expected.index = observed.index

expected

Unnamed: 0,favor,indifferent,opposed
democrat,115.14,85.5,84.36
republican,86.86,64.5,63.64


### Calculate the chi-square statistic

In [6]:
chi_square_stat = (((observed-expected)**2)/expected).sum().sum()

print('Chi-square test statistic: {}'.format(chi_square_stat))

Chi-square test statistic: 22.152468645918482


### Calculate critical value and p value

In [7]:
degree = 2 # Degree of freedoms
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = degree)

print('Critical value: {}'.format(crit))

p_value = 1 - stats.chi2.cdf(x=chi_square_stat,  # Find the p-value
                             df=degree)

print('P value: {}'.format(p_value))

Critical value: 5.99146454710798
P value: 1.547578021399154e-05


__Note:__ You can simply use function `chi2_contingency` to calculate p value.

In [8]:
ret = stats.chi2_contingency(observed=observed)

print('Result: {}'.format(ret))
print('Chi-square test statistic: {}'.format(ret[0]))
print('P value: {}'.format(ret[1]))

Result: (22.152468645918482, 1.547578021398957e-05, 2, array([[115.14,  85.5 ,  84.36],
       [ 86.86,  64.5 ,  63.64]]))
Chi-square test statistic: 22.152468645918482
P value: 1.547578021398957e-05


The p-value 1.547578021398957e-05 is less than 0.05, we reject the null hypothesis that political affiliation and their opinion on a tax reform bill are independent. We conclude that they are dependent, that there is an association between the two variables.

## Example 2

The operations manager of a company that manufactures tires wants to determine whether there are any differences in the quality of workmanship among the three daily shifts. She randomly selects 496 tires and carefully inspects them. Each tire is either classified as perfect, satisfactory, or defective, and the shift that produced it is also recorded. The two categorical variables of interest are: shift and condition of the tire produced.

### Load data

In [9]:
df = pd.read_csv('shift_quality.txt', sep='\t')
df.index = ['Shift1', 'Shift2', 'Shift2']

### Observed table

In [10]:
observed = df.copy()
observed

Unnamed: 0,Perfect,Satisfactory,Defective
Shift1,106,124,1
Shift2,67,85,1
Shift2,37,72,3


### Calculate totals

In [11]:
df['row_totals'] = df.sum(axis=1)
df = df.append(df.sum(axis=0), ignore_index=True)
df.index = ['Shift1', 'Shift2', 'Shift2', 'col_totals']

df

Unnamed: 0,Perfect,Satisfactory,Defective,row_totals
Shift1,106,124,1,231
Shift2,67,85,1,153
Shift2,37,72,3,112
col_totals,210,281,5,496


### Calculate expected table

In [12]:
row_totals = df['row_totals'][:-1]
col_totals = df.loc['col_totals', :][:-1]
totals = df.iloc[-1, -1]

expected = np.outer(row_totals, col_totals) / (1.0 * totals)
expected = pd.DataFrame(expected)
expected.columns = observed.columns
expected.index = observed.index

expected

Unnamed: 0,Perfect,Satisfactory,Defective
Shift1,97.802419,130.868952,2.328629
Shift2,64.778226,86.679435,1.542339
Shift2,47.419355,63.451613,1.129032


### Calculate chi-square test statistic and p value

In [13]:
ret = stats.chi2_contingency(observed=observed)

print('Result: {}'.format(ret))
print('Chi-square test statistic: {}'.format(ret[0]))
print('P value: {}'.format(ret[1]))

Result: (8.646695992462913, 0.07056326693766583, 4, array([[ 97.80241935, 130.86895161,   2.32862903],
       [ 64.77822581,  86.67943548,   1.54233871],
       [ 47.41935484,  63.4516129 ,   1.12903226]]))
Chi-square test statistic: 8.646695992462913
P value: 0.07056326693766583


In the above example, we don't have a significant result at 5% significance level since the p-value (0.07056326693766583) is greater than 0.05. Even if we did have a significant result, we still cannot trust the result, because there are 3 (33.3% of) cells with expected counts < 5.0.

## Reference

1. [9.1 - Chi-Square Test of Independence](https://onlinecourses.science.psu.edu/stat500/node/56)
2. [Python for Data Analysis Part 25: Chi-Squared Tests](http://hamelg.blogspot.hk/2015/11/python-for-data-analysis-part-25-chi.html)