# Chi$^2$ ($\chi^2$) Test for Independence

aka Pearson's Chi$^2$ test. Pronounced as 'Ki' as in kite.


https://docs.google.com/presentation/d/13V7cMcgbM6bIQL2fbMtONre15iiNKpxnX7ECiWTxrVI/edit?usp=sharing

Lets us test the hypothesis that one group is independent of another
- $H_0$ is always that there is no association between the groups (they are independent)
- $H_a$ is that there is a association (they are not independent) between the groups


The null hypothesis assumes that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable

## The Quick Way To Run a Chi$^2$ Test

1. form hypothesis
2. make contigency table
3. use stats.chi2_contingency

## Example 1 - Tips Data

In [1]:
import pandas as pd
import numpy as np

from pydataset import data
from scipy import stats

In [None]:
#load tips dataset


In [None]:
#check out the dataset


## Is smoking independent of time of day?

### 1. Form hypothesis

- $H_o$: There is no association between the smoker and time of the day (independence)
- $H_a$: There is that there is a association between smoker and time of day

### 2. Make contigency table

In [None]:
#set our alpha


In [None]:
#look at smoker data


In [None]:
#look at time data


In [None]:
#make 'contingency' table using pandas crosstab


### 3. Use stats.chi2_contingency

In [None]:
#use stats chi2_contingency


In [None]:
#chi2_contingency prints out 4 values - chi2, p-value, degrees of freedom, expected values
chi2, p, degf, expected

In [None]:
#output values
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

In [None]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

> a low chi^2 value typically leads to a high p-value

## Example 2 - Attrition Data

In [2]:
#get data
df = pd.read_csv("https://gist.githubusercontent.com/ryanorsinger/6ba2dd985c9aa92f5598fc0f7c359f6a/raw/b20a508cee46e6ac69eb1e228b167d6f42d665d8/attrition.csv")

In [4]:
#always look at your data!!
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


### What could be categorical data???

Let's guess by looking at how many different values each columm has

In [None]:
#.unique counts the number of unique values in each column


## Ex. Is Attrition independent from Business Travel amount?

### 1. Form hypothesis

$H_0$: Attrition and Business travel have no association (They are independent)

$H_a$: Attrition and Business travel are associated (they are dependent)

### 2. Make contigency table

Let's scope out our columns and see what categories we have

In [None]:
#get unique values and counts from Attrition


In [None]:
#get unique values and counts from BusineesTravel


In [None]:
#make contigency table


### 3. Use stats.chi2_contingency

In [None]:
#calculate chi2 values
chi2, p, degf, expected

In [None]:
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

> typically a high chi^2 value leads to a low p-value, depends on degrees of freedom

In [None]:
if p < alpha:
    print('We reject the null')
else:
    print("we fail to reject the null")

## Mini Exercise:
### Is Attrition independent from Department?

### 1. Form hypothesis

- $H_0$: There is no association between Attrition and Department (They are independent)
- $H_a$: There is an association between Attrition and Department (They are not independent)

### 2. Make contigency table

In [6]:
# how many categories we have in 'Department' column? (hint: value_counts())

df.Department.value_counts()

Research & Development    961
Sales                     446
Human Resources            63
Name: Department, dtype: int64

In [15]:
# crosstab for observed values between Attrition and Depts
observed=pd.crosstab(df.Department,df.Attrition)
observed

Attrition,No,Yes
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Human Resources,51,12
Research & Development,828,133
Sales,354,92


### 3. Use stats.chi2_contingency

In [13]:
# use stats.chi2_contingency test 

chi2, p, degf, expected=stats.chi2_contingency(observed)


In [16]:
#output values
print('Observed')
print(observed.values)
print('\nExpected')
print(expected.astype(int))
print('\n----')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

Observed
[[ 51  12]
 [828 133]
 [354  92]]

Expected
[[ 52  10]
 [806 154]
 [374  71]]

----
chi^2 = 10.7960
p     = 0.0045


In [None]:
if p < alpha:
    print("We reject the hypothesis")
else:
    print("We fail to reject the null hypothesis")

## Correlation Extra FYI

> Expected values all need to be greater than 5, we normally don't run into this issue

> If not greater than 5, use fisher's exact test