# The Chi-square goodness of fit test

#### WHAT IS IT?
The Chi-square goodness of fit test is a hypothesis test used to determine whether a variable is likely to come from a specified distribution or not. Because the reference distribution for determining the p-value is a chi-squared
distribution, the procedure is called a chi-squared goodness-of-fit test.

#### HOW DOES IT WORK? 

The test statistic is based on how different the numbers of observations in the various categories are from the corresponding expected numbers when H0 is true.

chi-square test statistic = how much the observed value differs from the expected value

Null hypothesis: the variable has a predetermined distribution    
Alternative hypotheses: the variable deviates from the expected distribution.

#### WHEN CAN I USE THE TEST?

When you have counts of values for a categorical variable,

you can use it to check whether the variable comes from a specified distribution. 


In other words, if you have a single measurement variable, you use a chi-square goodness of fit test. The degree of freedom is the number of categories minus one.

#### WHY DOES IT WORK? 
The observed probability distribution is compared with the expected probability distribution.

More specifically, if the calculated value of Chi-Square goodness of fit test is greater than the table value, then the test will reject the null hypothesis and conclude that there is a significant difference between the observed and the expected frequency.

We use chisquare() function from the scipy.stats module. <br> 
It helps us determine chi-square goodness-of-fit test statistic and p-value.

In [4]:
import scipy.stats as stat
import numpy as np

The parameters of the chisquare function are: <br> 
obs_data: an array of observed data, and <br> 
exp_data: an array of expected values.

In [37]:
obs_data = [8, 6, 10, 7, 8, 11, 9]
exp_data = [9, 8, 11, 8, 10, 7, 6]

Recall that every statistical test produces: <br>
(1) a test statistic (that indicates how closely your data match the null hypothesis), and <br>
(2) a corresponding p-value that tells you the probability of obtaining this result if the null hypothesis is true.

<br>
The p value determines statistical significance: <br>
low p-value indicates high statistical significance, and <br>
high p-value means low or no statistical significance.

In [38]:
# Chi-Square Goodness of Fit Test
statistic, p_value = stats.chisquare(obs_data, exp_data)

# print the values:
print('The chi-square test statistic is ' + str(statistic))
print('The p-value of this hypothesis test is ' + str(p_value))

The chi-square test statistic is 5.0127344877344875
The p-value of this hypothesis test is 0.542180861413329


#### What can we conclude about the goodness of fit of the observed data?

?????

A chi-square critical value is a threshold for statistical significance for the hypothesis test and defines confidence intervals for the parameter. <br>
The degree of freedom in a chi-square goodness-of-fit test is: <br>
df = (number of groups) - 1

In [13]:
# find Chi-Square critical value
critical_value = stats.chi2.ppf(1-0.05, df=6)

print('The chi-square critical value is ' + str(critical_value))

The chi-square critical value is 12.591587243743977


### Group Exercise

The article “Racial Stereotypes in Children’s Television Commercials” (J. of Adver. Res., 2008: 80–93) reported
the following frequencies with which ethnic characters appeared in recorded commercials that aired on Philadelphia television stations.

Ethnicity (Frequency):

African American  (57) <br>
Asian (11) <br>
Caucasian (330) <br>
Hispanic (6) <br>
 
The 2000 census proportions for these four ethnic groups
are .177, .032, .734, and .057, respectively. Does the data
suggest that the proportions in commercials are different
from the census proportions? Carry out a test of appropriate
hypotheses using a significance level of .01.

***here we have a single categorical variables

In [49]:
obs_data = [57, 11, 330, 6]
sample_size = sum(obs_data)
proportions = [.177, .032, .734, .057]

# expected counts under the null hypothesis
exp_data = sample_size*np.array(proportions)

print('The expected counts under H_0 are ' + str(exp_data))

The expected counts under H_0 are [ 71.508  12.928 296.536  23.028]


In [60]:
# chi-squared goodness-of-fit test
statistic, p_value = stats.chisquare(obs_data, exp_data)

print('The resulting chi-squared goodness-of-fit test is the test statistic ' + str(round(statistic,2)) 
      + ' and the p-value ' + str(round(p_value,4)))

The resulting chi-squared goodness-of-fit test is the test statistic 19.6 and the p-value 0.0002


#### What's your conclusion?

Hence, we strongly reject H_0 and conclude that at least one of the racial proportions in commercials is not a match to the census proportions.