# Non-Parametric Statistics: Chi-Square (Instructional Worksheet)

## One Variable Chi-Square: Goodness of Fit

Goal: Compare categorical data with a theoretical distribution
We want to test the following 2 hypotheses:
(1) data follows the theoretical distribution
(2) data does not follow the theoretical distribution

Suppose we have a yard full of flowers, we count and there are 705 red flowers and 224 white flowers. The expected proportion is 3/4 red flowers and 1/4 white flowers. Let's run a one variable chi-square test to see if the flowers in our yard follow the expected proportions or not. 

In [1]:
import pandas as pd
from scipy import stats

In [2]:
flower = pd.DataFrame(
    {'color': ['red', 'white'],
     'freq' : [705, 224]})



In [3]:
flower

Unnamed: 0,color,freq
0,red,705
1,white,224


In [11]:
# This function needs expected frequencies in each group
# instead of percentage of expected frequencies.
stats.chisquare(flower.freq, 
                [.75*flower.freq.sum(),
                 .25*flower.freq.sum()])

Power_divergenceResult(statistic=0.3907427341227126, pvalue=0.5319092473839089)

In the output, we have the Chi-square statitics, the degrees of freedom, and the p-value. Based on these results, we fail to reject the null hypothesis because the p-value is greater than 0.05. This means that the data follows the theoretical distribution (i.e., the data follows the expected proportions).

## Two Variable Chi-Square: Test of Independence

Goal: Determine whether or not two or more variables are related (i.e., not independent) We want to test the following 2 hypotheses: (1) variables are independent (2) variables are not independent

Using our flower example from above, suppose that in addition to flower color, we also know whether or not the plant survived for the season. We are interested in whether or not survival is related to flower color. Therefore, our null hypothesis is that there is no relationship between flower color and plant survival, and our alternative hypothesis is that there is a relationship between flower color and plant survival.

In [19]:
flower['surv'] = [448,103]
flower[['freq', 'surv']]

Unnamed: 0,freq,surv
0,705,448
1,224,103


In [20]:
# If you press Shift-Tab on the chi2 function you will
# see it what its return values are
stats.chi2_contingency(flower[['freq', 'surv']])

(5.58918333753006,
 0.018071719777705584,
 1,
 array([[723.74121622, 429.25878378],
        [205.25878378, 121.74121622]]))

In [24]:
flower

Unnamed: 0,color,freq,surv
0,red,705,448
1,white,224,103


Looking at the results, we again are given the Chi-squared statistic (Chisq), the degrees of freedom (df), and the p-value. With a p-value of 0.01807, we reject our null hypothesis (i.e., support the alternative hypothesis). We can conclude that there is a relationship between flower color and plant survival.

## Problem Set

1. You are studying wild dog and hyenas in a national park in Africa. You complete surveys of the park and encounter 27 packs of wild dog and 43 packs of hyena. 10 years before you completed the same survey of the park and encountered 16 packs of wild dog and 44 packs of hyena. Does the current year survey encounters match the expected encounters based on the 10-year prior survey, or is there a difference between the observed and expected values? How do you know? (Hint: your expected values have to be probabilities - so make sure you take the number of animals of each type encountered divided by the total number of animals encountered)

2. Suppose that during our survey, we also collected data on the number of prey species that was encountered for both wild dog and hyena. During the survey, we encountered 122 prey species for wild dog, and 201 prey species for hyena. Is the number of encounters of wild dog and hyena dependent on the number of prey species encountered (i.e., are the two variables independent)? How do you know?

In [21]:
animal = pd.DataFrame({'animal': ['wildDog', 'hyena'],
                       'freq'  : [27, 43]})

stats.chisquare(animal.freq, 
                [16/60*animal.freq.sum(),
                 44/60*animal.freq.sum()])
# encounter rates in current survey are not the same as expected encounter rates from 10 years before
# p-value 0.0243 - reject null hypothesis

Power_divergenceResult(statistic=5.073051948051946, pvalue=0.02430056331812505)

In [22]:
animal['prey'] = [122, 201]
stats.chi2_contingency(animal[['freq', 'prey']])
#p-value 0.9914
#the two variables are independent - fail to reject null hypothesis
#the number of wild dog and hyena encountered was not dependent on the number of prey species

(0.00011486277111559088,
 0.9914489116752431,
 1,
 array([[ 26.5394402, 122.4605598],
        [ 43.4605598, 200.5394402]]))

In [23]:
animal

Unnamed: 0,animal,freq,prey
0,wildDog,27,122
1,hyena,43,201
