# Chi-Squared Test

Determine if two categorical variables are associated with one another and whether a categorical variable follows an expected distribution

In [16]:
import numpy as np
import scipy.stats as stats

## Goodness of Fit Test

Determine whether an observed categorical variable follows an expected distribution

- Null Hypothesis: Follow expected distribution
- Alternative Hypothesis: Do not follow expected distribution

In [2]:
observations = [650, 570, 420, 480, 510, 380, 490]
expectations = [500, 500, 500, 500, 500, 500, 500]
result = stats.chisquare(f_obs=observations, f_exp=expectations)
result

Power_divergenceResult(statistic=97.6, pvalue=7.943886923343835e-19)

## Chi-Squared Test for Independence

Determine if two categorical variables are related to each other

In [4]:
observations = np.array([[850, 450],[1300, 900]])
result = stats.contingency.chi2_contingency(observations, correction=False)
result

Chi2ContingencyResult(statistic=13.660757846804358, pvalue=0.00021898310129108426, dof=1, expected_freq=array([[ 798.57142857,  501.42857143],
       [1351.42857143,  848.57142857]]))

# Analysis of Variance (ANOVA)

A group of statistical techniques that test the difference of means between three or more groups. 

In [17]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import f_oneway
import seaborn as sns

## One Way

Compare the means of one continuous dependent variable based on three or more groups of one categorical variable
- Null Hypothesis: All groups are equal
- Alternative Hypothesis: Not all groups are equal

In [13]:
diamonds = sns.load_dataset("diamonds")
model = ols(formula = "price ~ C(color)", data = diamonds).fit()

sm.stats.anova_lm(model, typ = 2)
sm.stats.anova_lm(model, typ = 1)
sm.stats.anova_lm(model, typ = 3)

Unnamed: 0,sum_sq,df,F,PR(>F)
Intercept,68079330000.0,1.0,4415.122902,0.0
C(color),26849110000.0,6.0,290.205881,0.0
Residual,831624000000.0,53933.0,,


In [12]:
performance1 = [89, 89, 88, 78, 79]
performance2 = [93, 92, 94, 89, 88]
performance3 = [89, 88, 89, 93, 90]
performance4 = [81, 78, 81, 92, 82]

f_oneway(performance1, performance2, performance3, performance4)

F_onewayResult(statistic=4.625000000000002, pvalue=0.01633645983978022)

## Two Way

Compare the means of one continuous dependent variable based on three or more groups of two categorical variables

C1/C2
- Null Hypothesis: No difference
- Alternative Hypothesis: Have difference

Interaction
- Null Hypothesis: C1 and C2 are independent
- Alternative Hypothesis: C1 and C2 are not independent

In [14]:
model = ols(formula = "price ~ C(color) + C(cut) + C(color):C(cut)", data = diamonds).fit()

sm.stats.anova_lm(model, typ = 2)
sm.stats.anova_lm(model, typ = 1)
sm.stats.anova_lm(model, typ = 3)

Unnamed: 0,sum_sq,df,F,PR(>F)
Intercept,19589000000.0,1.0,1287.312574,1.262195e-278
C(color),9758118000.0,6.0,106.877566,1.888935e-134
C(cut),1548399000.0,4.0,25.438686,4.357543e-21
C(color):C(cut),1653455000.0,24.0,4.527442,1.00078e-12
Residual,820270900000.0,53905.0,,


## Assumptions
- The dependent variables for each group come from normal - distributions
- The variances across groups are equal
- Observations are independent of each other


# Post Hoc Test
Pairwise comparison between all available groups while controlling for the error rate

## Tukey's HSD (Honestly Significantly Different) Test

In [19]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_oneway = pairwise_tukeyhsd(endog = diamonds["price"], groups = diamonds["color"], alpha = 0.05)
tukey_oneway.summary()

group1,group2,meandiff,p-adj,lower,upper,reject
D,E,-93.2016,0.7437,-276.1437,89.7404,False
D,F,554.9323,0.0,370.9936,738.871,True
D,G,829.1816,0.0,651.2593,1007.1038,True
D,H,1316.7151,0.0,1127.1688,1506.2614,True
D,I,1921.9209,0.0,1710.9515,2132.8902,True
D,J,2153.8639,0.0,1894.0127,2413.7152,True
E,F,648.1339,0.0,481.6095,814.6584,True
E,G,922.3832,0.0,762.5293,1082.2371,True
E,H,1409.9167,0.0,1237.2183,1582.6151,True
E,I,2015.1225,0.0,1819.1505,2211.0945,True
