** Chi-square Test of Independence **

How to test the independence of two categorical variables? It will be done using the Chi-square test of independence.

As will all prior statistical tests we need to define null and alternative hypotheses.
The null hypothesis is what is assumed to be true until we have evidence to go against it.

Here we are interested in researching if two categorical variables are related or associated (i.e. dependent).  Therefore, until we have evidence to suggest that they are we must assume that they are not.  This is the motivation behind the hypothesis for the Chi-square Test of Independence:

H0: In the population, the two categorical variables are independent.
Ha: In the population, two categorical variables are dependent.

Once we have gathered our data we summarize the data in the two-way contingency table.  

This table represents the observed counts and is called the Observed Counts Table or simply the Observed Table. 


That is, under the null hypothesis that the two variables are independent, what would we expect to find in our data if the two variables (e.g. Party Affiliation and Opinion) were not related?  
We need to find what is called the Expected Counts Table or simply the Expected Table.  
This table displays what the counts would be for our sample data if there were no association between the variables.


Finding Expected Counts from Observed Counts
Once we have the observed counts we need to compute the expected counts under the null hypothesis that the two categorical variables are independent.  This is done using the marginal totals and overall total to compute expected counts for each cell of the table.  In words, to find the expected count for each cell in the table we take multiply the marginal row and column totals for that cell and divide by the overall total.  Formulaically for each cell this is:

E = (row total × column total) /sample size


In [11]:
import pandas as pd
import numpy as np

In [2]:
party = ['democrat','democrat','democrat','republican','republican','republican']
opinion = ['favor','indifferent','opposed','favor','indifferent','opposed']
value = [138,83,64,64,67,84]
df = pd.DataFrame({'party':party,'opinion':opinion,'value':value})

In [3]:
df

Unnamed: 0,opinion,party,value
0,favor,democrat,138
1,indifferent,democrat,83
2,opposed,democrat,64
3,favor,republican,64
4,indifferent,republican,67
5,opposed,republican,84


In [4]:
df2 = pd.crosstab(df.party,df.opinion,df.value,aggfunc="sum",margins=True)

In [5]:
df2

opinion,favor,indifferent,opposed,All
party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
democrat,138.0,83.0,64.0,285.0
republican,64.0,67.0,84.0,215.0
All,202.0,150.0,148.0,500.0


In [6]:
df2.columns = ["favour","indifferent","opposed","row_totals"]

df2.index = ["democrat","republican","col_totals"]

In [7]:
df2

Unnamed: 0,favour,indifferent,opposed,row_totals
democrat,138.0,83.0,64.0,285.0
republican,64.0,67.0,84.0,215.0
col_totals,202.0,150.0,148.0,500.0


In [8]:
#create observed table
observed = df2.iloc[0:2,0:3]   # Get table without totals for later use
observed

Unnamed: 0,favour,indifferent,opposed
democrat,138.0,83.0,64.0
republican,64.0,67.0,84.0


In [24]:
df2["row_totals"][-1]

500.0

In [25]:
#Once we have the observed counts we need to compute the expected counts under the null hypothesis that the two categorical variables are independent.

expected =  np.outer(df2["row_totals"][0:2],
                     df2.loc["col_totals"][0:3]) / df2["row_totals"][-1]
expected = pd.DataFrame(expected)
expected.columns = ["favour","indifferent","opposed"]
expected.index = ["democrat","republican"]
expected

Unnamed: 0,favour,indifferent,opposed
democrat,115.14,85.5,84.36
republican,86.86,64.5,63.64


The statistical question becomes, "Are the observed counts so different from the expected counts that we can conclude a relationship between the two variables?"  

To conduct this test we compute a Chi-square test statistic where we compare each cell's observed count to it's respective expected count.  

This Chi-square test statistic is calculated as follows:

χ2∗ = ∑(Oi−Ei)** 2/Ei



we make our decision by either comparing the value of the test statistic to a critical value (rejection region approach), or by finding the probability of getting this test statistic value or one more extreme (p-value approach).  The critical value for our Chi-square test is 
χ2α with degree of freedom = (r - 1) (c - 1), 

while the p-value is found by P(χ2>χ2∗) with degrees of freedom = (r - 1)(c - 1)

calculate the chi-square statistic, the critical value and the p-value:

*Note: We call .sum() twice: once to get the column sums and a second time to add the column sums together, returning the sum of the entire 2D table.

In [30]:
chi_squared_stat = (((observed-expected)**2)/expected).sum().sum()

print(chi_squared_stat)

22.1524686459


** Note: The degrees of freedom for a test of independence equals the product of the number of categories in each variable minus 1. In this case we have a 2x3 table so df = 1x2 = 2**

** Method 1 using critical value to evaluate the hypotheses **

In [33]:
# Find the critical value for 95% confidence* for degree of freedom = 2

crit = stats.chi2.ppf(q = 0.95, df = 2)   

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,df=2)
print("P value")
print(p_value)



Critical value
5.99146454711
P value
1.5475780214e-05


** Note **

The output shows the chi-square statistic = 22.1524686459, the p-value as 0.000 and the degrees of freedom as 2 followed by the expected counts.

The critical value with 2 degree of freedom is 5.99146454711. Since 22.1524686459 > 5.99146454711, therefore we reject the null hypothesis and 
conclude that the two categorical values are dependent. 

Hence we conclude that that political affiliation and their opinion on a tax reform bill are dependent.

** Method 2 using p value to evaluate null hypothesis **

Use stats.chi2_contingency() function to conduct a test of independence automatically given a frequency table of observed counts:

In [23]:
import scipy.stats as stats
F_stat,p_value,d_freedom,expected = stats.chi2_contingency(observed= observed)
print("Critical value \n ",F_stat)
print(" p value \n {:.4f}".format(p_value))
print(" degree of freedom \n",  d_freedom)
print(" Expected table \n",expected)

Critical value 
  22.1524686459
 p value 
 0.0000
 degree of freedom 
 2
 Expected table 
 [[ 115.14   85.5    84.36]
 [  86.86   64.5    63.64]]


** The p-value is found by P(χ2>22.152) with degrees of freedom = (2-1)(3-1) = 2 = 0.00 **

Given this p-value of 0.000 is less than alpha of 0.05(Significant value), 

we reject the null hypothesis that political affiliation and their opinion on a tax reform bill are independent. 

We conclude that they are dependent, that there is an association between the two variables.


** Independence from Summarized Data'.**

Condition for Using the Chi-square Test
Exercise caution when there are small expected counts.Some statisticians hesitate to use the chi-square test if more than 20% of the cells have expected frequencies below five, especially if the p-value is small and these cells give a large contribution to the total chi-square value.

Example: Tire Quality
    
The operations manager of a company that manufactures tires wants to determine whether there are any differences in the quality of workmanship among the three daily shifts. She randomly selects 496 tires and carefully inspects them. Each tire is either classified as perfect, satisfactory, or defective, and the shift that produced it is also recorded. The two categorical variables of interest are: shift and condition of the tire produced. The data (shift_quality.txt) can be summarized by the accompanying two-way table. Do these data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among the three shifts?

In [42]:
shift = ['Shift 1','Shift 1','Shift 1','Shift 2','Shift 2','Shift 2','Shift 3','Shift 3','Shift 3']
obs = ['Perfect','Satisfactory','Defective','Perfect','Satisfactory','Defective','Perfect','Satisfactory','Defective']
data = [106,124,1,67,85,1,37,72,3]
df_1 = pd.DataFrame({'shift':shift,'obs':obs,'data':data})
df_1  = df_1[['shift','obs','data']]
df_1

Unnamed: 0,shift,obs,data
0,Shift 1,Perfect,106
1,Shift 1,Satisfactory,124
2,Shift 1,Defective,1
3,Shift 2,Perfect,67
4,Shift 2,Satisfactory,85
5,Shift 2,Defective,1
6,Shift 3,Perfect,37
7,Shift 3,Satisfactory,72
8,Shift 3,Defective,3


In [46]:
df_pivot = pd.pivot_table(df_1,index='shift',columns='obs',values='data',aggfunc=[np.sum],margins=True)

In [49]:
df_pivot

Unnamed: 0_level_0,sum,sum,sum,sum
obs,Defective,Perfect,Satisfactory,All
shift,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Shift 1,1.0,106.0,124.0,231.0
Shift 2,1.0,67.0,85.0,153.0
Shift 3,3.0,37.0,72.0,112.0
All,5.0,210.0,281.0,496.0


In [50]:
df_pivot.columns = ["Defective","Perfect","Satisfactory","row_count"]

df_pivot.index = ["Shift 1","Shift 2","Shift 3","col_count"]

In [51]:
df_pivot

Unnamed: 0,Defective,Perfect,Satisfactory,row_count
Shift 1,1.0,106.0,124.0,231.0
Shift 2,1.0,67.0,85.0,153.0
Shift 3,3.0,37.0,72.0,112.0
col_count,5.0,210.0,281.0,496.0


In [53]:
#create observed table
observed = df_pivot.iloc[0:3,0:3]   # Get table without totals for later use
observed

Unnamed: 0,Defective,Perfect,Satisfactory
Shift 1,1.0,106.0,124.0
Shift 2,1.0,67.0,85.0
Shift 3,3.0,37.0,72.0


In [54]:
import scipy.stats as stats
F_stat,p_value,d_freedom,expected = stats.chi2_contingency(observed= observed)
print("Critical value \n ",F_stat)
print(" p value \n {:.4f}".format(p_value))
print(" degree of freedom \n",  d_freedom)
print(" Expected table \n",expected)

Critical value 
  8.64669599246
 p value 
 0.0706
 degree of freedom 
 4
 Expected table 
 [[   2.32862903   97.80241935  130.86895161]
 [   1.54233871   64.77822581   86.67943548]
 [   1.12903226   47.41935484   63.4516129 ]]


In the above example, we don't have a significant result at 5% significance level since the p-value (0.071) is greater than 0.05. Even if we did have a significant result, we still cannot trust the result, because there are 3 (33.3% of) cells with expected counts < 5.0.