### **Chi-Squared Goodness-Of-Fit Test**



In our study of t-tests, we introduced the one-way t-test to check whether a sample mean differs from the an expected (population) mean. The chi-squared goodness-of-fit test is an analog of the one-way t-test for categorical variables: it tests whether the distribution of sample categorical data matches an expected distribution. For example, you could use a chi-squared goodness-of-fit test to check whether the race demographics of members at your church or school match that of the entire U.S. population or whether the computer browser preferences of your friends match those of Internet uses as a whole. <br/>
When working with categorical data, the values themselves aren't of much use for statistical testing because categories like "male", "female," and "other" have no mathematical meaning. Tests dealing with categorical variables are based on variable counts instead of the actual value of the variables themselves. <br/>
Let's generate some fake demographic data for U.S. and Minnesota and walk through the chi-square goodness of fit test to check whether they are different: <br/>


In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats

In [2]:
national = pd.DataFrame(["bluecollar"]*100000 + ["student"]*60000 +\
                        ["whitecollar"]*50000 + ["old"]*15000 + ["other"]*35000)
           

city = pd.DataFrame(["bluecollar"]*600 + ["student"]*300 + \
                         ["whitecollar"]*250 +["old"]*75 + ["other"]*150)

national_table = pd.crosstab(index=national[0], columns="count")
city_table = pd.crosstab(index=city[0], columns="count")

print("National")
print(national_table)
print(" ")
print("City")
print(city_table)

National
col_0         count
0                  
bluecollar   100000
old           15000
other         35000
student       60000
whitecollar   50000
 
City
col_0        count
0                 
bluecollar     600
old             75
other          150
student        300
whitecollar    250


 $\\X^2$ tests are based on the so-called chi-squared statistic. You calculate the chi-squared statistic with the following formula:

* sum((observed−expected)**2/expected)
 
In the formula, observed is the actual observed count for each category and expected is the expected count based on the distribution of the population for the corresponding category. Let's calculate the chi-squared statistic for our data to illustrate:

In [5]:
observed = city_table

national_ratios = national_table/len(national)  # Get population ratios

expected = national_ratios * len(city)   # Get expected counts

chi_squared_stat = (((observed-expected)**2)/expected).sum()

print(chi_squared_stat)

col_0
count    18.194805
dtype: float64


*Note: The chi-squared test assumes none of the expected counts are less than 5. <br/>
Similar to the t-test where we compared the t-test statistic to a critical value based on the t-distribution to determine whether the result is significant, in the chi-square test we compare the chi-square test statistic to a critical value based on the chi-square distribution. The scipy library shorthand for the chi-square distribution is chi2. Let's use this knowledge to find the critical value for 95% confidence level and check the p-value of our result: <br/>


In [6]:
crit = stats.chi2.ppf(q = 0.95, # Find the critical value for 95% confidence*
                      df = 4)   # Df = number of variable categories - 1

print("Critical value")
print(crit)

p_value = 1 - stats.chi2.cdf(x=chi_squared_stat,  # Find the p-value
                             df=4)
print("P value")
print(p_value)

Critical value
9.487729036781154
P value
[0.00113047]



*Note: we are only interested in the right tail of the chi-square distribution. Read more on this here.
Since our chi-squared statistic exceeds the critical value, we'd reject the null hypothesis that the two distributions are the same.
You can carry out a chi-squared goodness-of-fit test automatically using the scipy function scipy.stats.chisquare():


In [8]:
stats.chisquare(f_obs= observed,   # Array of observed counts
                f_exp= expected)   # Array of expected counts

#  The test results agree with the values we calculated above.

Power_divergenceResult(statistic=array([18.19480519]), pvalue=array([0.00113047]))

**Chi-Squared Test of Independence**


Independence is a key concept in probability that describes a situation where knowing the value of one variable 
tells you nothing about the value of another. For instance, the month you were born probably 
doesn't tell you anything about which web browser you use, 
so we'd expect birth month and browser preference to be independent. 
On the other hand, your month of birth might be related to whether 
you excelled at sports in school, so month of birth and sports performance might not be independent. <br/>
The chi-squared test of independence tests whether two categorical variables are independent. 
The test of independence is commonly used to determine whether variables like education, 
political views and other preferences vary based on demographic factors like gender, 
race and religion. Let's generate some fake voter polling data and perform a test of independence: <br/>


In [11]:
np.random.seed(10)

# Sample data randomly at fixed probabilities
voter_profession = np.random.choice(a= ["a","b","c","d","e"],
                              p = [0.05, 0.15 ,0.25, 0.05, 0.5],
                              size=1000)

# Sample data randomly at fixed probabilities
voter_party = np.random.choice(a= ["democrat","independent","republican"],
                              p = [0.4, 0.2, 0.4],
                              size=1000)

voters = pd.DataFrame({"profession":voter_profession, 
                       "party":voter_party})

voter_tab = pd.crosstab(voters.profession, voters.party, margins = True)

voter_tab.columns = ["democrat","independent","republican","row_totals"]

voter_tab.index = ["a","b","c","d","e","col_totals"]

observed = voter_tab.iloc[0:5,0:3]   # Get table without totals for later use
voter_tab

Unnamed: 0,democrat,independent,republican,row_totals
a,21,7,32,60
b,65,25,64,154
c,107,50,94,251
d,15,8,15,38
e,189,96,212,497
col_totals,397,186,417,1000
