In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import chi2
from pylab import fill_between

# stats60 specific
figsize = (8,8)


## Is the roulette game fair?

* In an earlier example, we considered the question about whether a roulette wheel was rigged based on the results from betting in  RED.
* We came up with a test of "rigged" or "not rigged" based on one bet on the wheel.
* But, we could have looked at more than just this bet, we could have looked at all of the outcomes.
* Tests like this are called "goodness of fit" tests.

## A simpler example than roulette

* Suppose we have a die and we want to
decide whether it is fair or not.

* We roll the die 60 times. These are the outcomes:

In [None]:
%%capture
from ipy_table import make_table
data_table = make_table([('Value', 'Observed'), (1,4), (2,6), (3,17), (4,16), (5,8), (6,9), ('Total', 60)])

In [None]:
data_table


* Looks like the number of 3's and 4's might be a little high (though
we already decided we were going to form this test...)

## Comparing to expected

* We can add another column: the expected
counts **if the die is fair**.
* This is the null hypothesis: $H_0: \text{die is fair}$.

In [None]:
%%capture
data_table2 = make_table([('Value', 'Observed', 'Expected'), (1,4,10), (2,6,10), (3,17,10), (4,16,10), (5,8,10), (6,9,10), ('Total', 60,60)])

In [None]:
data_table2

## Comparison to expected

* If the die is unfair, the expected
counts in some cells might be higher or lower.

* We use the square of the difference instead of the difference.
  

In [None]:
%%capture
data_table3 = make_table([('Value', 'Observed', 'Expected', '(Observed - Expected)^2'), 
                          (1,4,10, '(4-10)^2=36'), 
                          (2,6,10, '(6-10)^2=16'), 
                          (3,17,10, '(17-10)^2=49'), 
                          (4,16,10, '(16-10)^2=36'), 
                          (5,8,10, '(8-10)^2=4'), 
                          (6,9,10, '(9-10)^2=1'), ('Total', 60,60, '' )])

In [None]:
data_table3

## Pearson's $X^2$ statistic

* To get an overall test, we combine the rows into *Pearson’s $X^2$*
   $$
   \begin{aligned}
X^2 &= \sum_i \frac{\text{(observed[i]-expected[i])}^2}{\text{expected[i]}}\\
 &= \sum_i \frac{(O_i-E_i)^2}{E_i}\\
\end{aligned}
$$
* In our die example,
$$
\begin{aligned}
X^2 &= \frac{36}{10} + \frac{16}{10} + \frac{49}{10} + \frac{36}{10} + \frac{4}{10} + \frac{1}{10} \\
&= \frac{142}{10} \\
&= 14.2
\end{aligned}
$$

* Is this big, or could the statistic be this big by chance?

In [None]:
%%capture

def tail_chi2(observed, df, upper_lim=None):
    if upper_lim is None:
        upper_lim = 10*df

    X = np.linspace(1.e-10, upper_lim, 201)
    D = chi2.pdf(X, df)
    fig = plt.figure(figsize=(6,6))
    ax = fig.gca()
    ax.plot(X, D, 'k', linewidth=5)
    cutoff = chi2.ppf(0.95, df)
    x = np.linspace(cutoff, upper_lim, 501)
    ax.fill_between(x, 0, chi2.pdf(x, df), hatch='\\', facecolor='green', label='5% cutoff',
                    alpha=0.5)
    x = np.linspace(observed, upper_lim, 501)
    ax.fill_between(x, 0, chi2.pdf(x, df), hatch='\\', facecolor='red', label='observed',
                    alpha=0.5)
    ax.set_xlabel('$\chi^2$ units', fontsize=15)
    ax.set_ylabel('Percent per $\chi^2$ units', fontsize=15)
    ax.set_xlim([0, upper_lim])
    ax.legend(loc='upper right')
    return fig, ax

die_fig, die_ax = tail_chi2(14.2, 5, upper_lim=20)

### What are the chances?

In [None]:
die_fig


The $\chi^2_5$ probability histogram, the <font color='red'> red area </font> is 1.4%.
The <font color="green"> green area </font> is the 5% rejection rule for $\chi^2_5$.

## What was that last histogram?

* It is a new kind of probability histogram, called a $\chi^2$ probability histogram or curve.
* The $\chi^2$ probability histogram or curve also has *degrees of freedom*
   associated to it.
* To figure out the degrees of freedom, we need a box.

## Degrees of freedom

* Our box is [1,2,3,4,5,6].
* Our goal is to see if our observed data fit the box. Our data is supposed to be 60 draws with replacement from our box.
* There are 6 different objects in the box and we have an observation for each object. Maybe the degrees of freedom is 6?
* Not quite, it is 6-1=5. Why the -1? Because when we roll 60 times, the observed counts must sum to 60 – there are only 5 free variables.

## The $\chi^2$ curve

* Even if the die is fair, the $X^2$ statistic will have some variability in it.
* The $\chi^2_5$ probability histogram describes this variability under $${H_0: \text{the die is fair}}.$$
* The 1.4% is the  $P$-value, or the observed significance level.
* It is the probability we would observe a $X^2$ statistic as large as our observed  value
   if ${H_0}$ is true.
* **It is not the probability $H_0$ is true.**
  

### $\chi^2$ curves

In [None]:
%%capture
df5_fig, df5_ax = tail_chi2(chi2.ppf(0.95,5), 5, upper_lim=20)
df5_ax.set_title(r'5%% rejection rule for $\chi^2_5$: %0.1f' % chi2.ppf(0.95,5),
                  fontsize=15)

In [None]:
df5_fig

In [None]:
%%capture
df10_fig, df10_ax = tail_chi2(chi2.ppf(0.95,10), 10, upper_lim=30)
df10_ax.set_title(r'5%% rejection rule for $\chi^2_{10}: %0.1f$' % chi2.ppf(0.95,10),
                  fontsize=15)

In [None]:
df10_fig

## Using the $\chi^2$ test

* A general rule of thumb: every expected value should be 5 or more for the $\chi^2$ curve to approximate the probability histogram of the $X^2$ statistic.
* Would not apply to 100 draws from the box below:

In [None]:
box = [0,1,2,3] + [4]*96

## Difference between $\chi^2$ test and $z$ test

* The $z$ test is a statement about the average of the box.
* The $\chi^2$ is a test whether the observed data follow the box model.
* If there are only two values in the box, then the $\chi^2$ test is identical to the (two-sided) $z$ test.

## Example

* Suppose the box is `[A,A,B,B,B]`.
* In 100 draws with replacement, we observe 46 `A`’s (and 54 `B`’s).
* The $X^2$ test statistic is 
$$X^2 = \frac{(46-40)^2}{40} + \frac{(54-60)^2}{60} = 1.5$$
* The $z$ statistic for testing $H_0:$ the expected proportion of `A`’s = 0.4 is
$$z = \frac{0.46-0.40}{\sqrt{0.4 \times 0.6} / \sqrt{100}} = 1.224$$
* Finally, $z^2 = (1.224)^2 = 1.5$. This is not a coincidence...

In [None]:
%%capture
df1_fig, df1_ax = tail_chi2(chi2.ppf(0.95, 1), 1, upper_lim=5)
df1_ax.set_title(r'5%% rejection rule for $\chi^2_{1}: %0.2f$' % chi2.ppf(0.95,1),
                  fontsize=15)
df1_ax.set_ylim([0,.3])

In [None]:
df1_fig

In [None]:
1.96**2, chi2.ppf(0.95,1)

## Structure of a $\chi^2$ test

### Basic Data 

- The number of draws, $N$ and the resulting draws.

- Data: 

Value | Observed Count
----|----
  1 | 4
  2 | 6
  3 | 17
  4 | 16
  5 | 8
  6 | 9
  Total | 60 (=$N$)

- Box: [1,2,3,4,5,6]

- Degrees of freedom: In our example, this was 5 which was the number of "free parameters." Call this number
`df`. This number is 5 in our example.

- $P$-value: Computed using the $\chi^2_{df}$ curve. This was about 1.4% in our example.

## Testing independence: another $\chi^2$ test

### Handedness and gender

* Data example from book:
 
Handedness   | Male | Female
-------------|------|-------
Right        | 934  | 1070
Left         | 113  | 92
Ambidextrous | 20   | 8

* Is handedness related to gender (or not)?

## Marginal totals

* There are both "row totals" and "column totals", these are called *marginals*.
   

Handedness   | Male | Female | Total(Handedness)
-------------|------|--------|----------
Right        | 934  | 1070   | 2004
Left         | 113  | 92     | 205
Ambidextrous | 20   | 8      |   28
Total(Gender)| 1067 | 1170   | 2237


## Test of independence

* The null hypothesis is  **$H_0$: handedness is independent from gender.**
* This means that the probability a person (drawn at random) from the population is, say, a left-handed male, is the product of two probabilities: the probability a person is left-handed and the probability a person is male.
* Or, 
$$P(\text{left-handed and male}) = P(\text{left-handed}) \times P(\text{male})$$

## Expected value under $H_0$

* Having specified the box, we can express ${H_0}$ via some equalities about the tickets in the box: $$\begin{aligned}
       p_{L,M} &= p_L \times p_M \\
       p_{R,M} &= p_R \times p_M \\
       p_{A,M} &= p_A \times p_M \\
       p_{L,F} &= p_L \times p_F \\
       p_{R,F} &= p_R \times p_F \\
       p_{A,F} &= p_A \times p_F \\
       \end{aligned}$$
* Some of these equalities are redundant. This affects the degrees of freedom.

## Idea behind the test of independence

* If $H_0$ is true, the observed counts should follow a similar structure: the proportion of left-handed males should be close to the proportion of left-handed females, etc.
* This is our model for the Expected
   or $E$ part of the $\chi^2$ statistic which we use to construct the frequency table.
* The sample proportion of men is  48%
  , the sample proportion of left-handed is  9.2%
  .
* Under $H_0$, the independence model, we estimate that, in a sample of size 2237, we would see $2237 \times 0.48 \times 0.092 \approx 98$ left handed males.

## Expected counts under $H_0$

* Continuing for all 6 cases yields a table of "Expected Counts"




Handedness   | Male | Female | Total(Handedness)
-------------|------|--------|----------
Right        | 956  | 1048   | 2004
Left         | 98   | 107    | 205
Ambidextrous | 13   | 15     |   28
Total(Gender)| 1067 | 1170   | 2237



## Computing the $X^2$ statistic

* The $X^2$ statistic is computed in exactly the same way $\begin{aligned}
           \chi^2 &= \frac{(934-956)^2}{956} + \frac{(1070-1048)^2}{1048} +
            \frac{(113-98)^2}{98} \\
           & \qquad +  \frac{(92-107)^2}{107} + \frac{(20-13)^2}{13} + \frac{(8-15)^2}{15}  \\
           &\approx 12
         \end{aligned}$
* In symbols, $\chi^2 = \sum_{i=1}^3 \sum_{j=1}^2 \frac{(O_{ij}-E_{ij})^2}{E_{ij}}$

### Degrees of freedom and $P$-value

* This leaves the last two parts of the $\chi^2$ test: degrees of freedom and the $P$-value.
* The degrees of freedom for this test are actually only 2. This can be seen in the difference table
* 


Handedness   | Male | Female | Total(Handedness)
-------------|------|--------|----------
Right        | -22  | 22     | 0
Left         | 15   | -15    | 0
Ambidextrous | 7    | 7      | 0
Total(Gender)| 0    | 0      | 0


* By construction all the marginal totals of the difference table are 0. So, we can only set two of the values freely.

### $\chi^2_2$ probability histogram

In [None]:
%%capture
df2_fig, df2_ax = tail_chi2(chi2.ppf(0.95, 2), 2, upper_lim=8)
df2_ax.set_title(r'5%% rejection rule for $\chi^2_{2}: %0.1f$' % chi2.ppf(0.95,2),
                  fontsize=15)
df2_ax.set_ylim([0,.3])

In [None]:
df2_fig

At level 5%, we reject the independence null hypothesis and conclude
that handedness is related to gender (in this population).

## Tests of independence in two-way tables

- We could have looked at a different table. The table
may have more than 2 columns or three rows.

- For example, instead of gender
we might have looked at sexual orientation (even though handedness may not be an
interesting question). This would add more columns to our table.

- In general, we might have a $R \times C$ table with $R$ categories
in the rows and $C$ categories in the columns.

- The calculation of the $X^2$ is identical:
     

- The degrees of freedom is $(R-1)*(C-1)$.

## Details of the calculations

- Let $O[i,j]$ be the observations and
$$
N = \sum_{i=1}^R \sum_{j=1}^C O[i,j]
$$
- Compute marginal proportions
$$
\begin{aligned}
\pi_R[i] &= \frac{\sum_{j=1}^C O[i,j]}{N} 
\pi_C[j] &= \frac{\sum_{i=1}^R O[i,j]}{N} \\
\end{aligned}
$$
- Compute expected values
$$
E[i,j] = N \pi_R[i] \pi_C[j].
$$
- The statistic:
$$
\sum_{i=1}^R \sum_{j=1}^C \frac{\left(O[i,j] - E[i,j]\right)^2}{E[i,j]}.
$$
- Compute chances with $\chi^2_{(R-1)*(C-1)}$ curve.