In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.mlab import csv2rec
import numpy as np
from numpy import mean, sqrt, std
import matplotlib
from scipy.stats import norm as ndist
from ipy_table import make_table

# stats60 specific
from code.utils import (sample_density, 
                        probability_histogram)
from code.week1 import (stylized_density, 
                        standardize_right,
                        standardize_left,
                        standardize_interval,
                        CAdensity,
                        normal_curve,
                        SD_rule_of_thumb_normal,
                        percentile_figure)
from code.week7 import studentT_curve
from code.probability import (Sample,
                              BoxModel, 
                              SampleMean, 
                              SampleSD,
                              Binomial,
                              Uniform)
figsize = (8,8)

## Presidential approval ratings

* If you ever watch CNN / Fox News / MSNBC you don't have
to wait too long until your hear reports about opinion polls.

* [Rasmussen](http://www.rasmussenreports.com/public_content/politics/obama_administration/daily_presidential_tracking_poll) tracks presidential approval rating.
 
* So does [Gallup](http://www.gallup.com/poll/113980/gallup-daily-obama-job-approval.aspx)

* Both polls are trying to measure the same thing.
* Why don’t they agree?
* Polls are sometimes wrong: [Truman / Dewey 1948](http://www.chicagotribune.com/news/politics/chi-chicagodays-deweydefeats-story,0,6484067.story)
    - Part of the error in this poll was due to the sampling: *quota sampling*.
    - A random sample might have reduced this bias.

## Model for percentages

* An opinion poll to estimate a percentage is based on a sample of size $N$. 

* We assume this is a **simple random sample**: a sample of size $N$ without replacement.

* We will draw from boxes with `B`=blue and `R`=red marbles. Our goal
is to estimate the percentage of blue marbles in the box.

* The sample is from a box with one ball per population member, each having a 0-1 label with `B` corresponding to 1 and `R` corresponding to 0.

* The estimated percentage is 

       estimated percentage of B = (sum of N draws from box) / N
  
 
  
  

## Estimating a percentage     

- In our boxes, the proportion of blue to red is fixed: there
     are 50% blue and 50% red. Our goal is to estimate
     the proportion of blue.
- When we sample, we don't recover exactly 50%. Our observed
     proportion is a *chance process* (i.e. a
     random variable).
     
- We also call this random variable a *statistic*. A statistic is a 
chance process (random variable) computed on a sample.
     
- Here is a (hopefully familiar) model:

      estimated percentage of B = 50% + chance error

  
- **Note:** If poll is biased, a better model is

      estimated percentage of B = 50% + bias + chance error

## Accuracy of percentages


- How big is the chance error?
   
- Since the observed proportion is random, it has its own SE.
- The rule for computing the SE of a percentage
     for a simple random sample is related to SE of drawing
     from a box of 0's and 1's.
- The standard error after drawing `N=1000` times with replacement
from a box with p=50% `B` marbles
would be:

      SE(estimated percentage of B) = sqrt(p * (1 - p)) / sqrt(N)
                                    = sqrt(1/2 * (1 - 1/2)) / 
                                        sqrt(1000)

- The standard error after drawing `N=1000` times with replacement
from a box with p=57% `B` marbles
would be:

      SE(estimated percentage of B) = sqrt(p * (1 - p)) / sqrt(N)
                                    = sqrt(0.57 * (1 - 0.57)) / 
                                         sqrt(1000)


- If sampling with replacement, the SE does not depend on how many tickets are in the box.

- When sampling without replacement, this SE is approximate (but quite accurate
if the population is large enough).


## Correction factor for SE

* To get the actual SE when we sample without replacement, we should multiply by a correction factor.
* If there are $N_{pop}$ people in the population and we sample $N_{sample}$
then
$$
\begin{aligned}
\text{correction factor} &= \sqrt{\frac{N_{pop}-N_{sample}}{N_{pop}-1}} \\
&= \sqrt{\frac{10000-1000}{9999}} \\
\end{aligned}$$
* For example, in our largest boxes, $N_{pop}=10000, N_{sample}=1000$
the correction factor is 0.95.
* An organization like Gallup samples from a MUCH BIGGER population, so this
correction factor is virtually 1.

          SE(drawing WITHOUT replacement) = 
               correction factor * (SE drawing WITH replacement)

## Example

- Suppose  57% of voting age Californians approve of Governorn Brown’s job performance. 

- If we sample 100 voting age Californians at random, we expect the percentage who approve of Brown’s performance $\bbox[5px,border:2px solid red]{57\%}$ to be give or take $\bbox[5px,border:2px solid red]{\sqrt{0.57 ( 1 - 0.57) / 100} * 100\%}$.

## Example

- **Answer** The expected percentage or proportion is 57%. The SE is (ignoring the correction factor) 

$$\text{ SE(estimated percentage)} = 
\sqrt{0.57 * 0.43} / \sqrt{100} \approx 5\%$$  
     

* We expect the percentage to be 57% give or take 5%.

## Example (continued)

- What if we had sampled 1000 voting age Californians?
- The expected proportion is the same. The SE is now (ignoring the correction factor) 

$$\text{ SE(estimated percentage)} = 
\sqrt{0.57 * 0.43} / \sqrt{1000} \approx 1.6\%$$

- A sample size of 1000 is quite commonly used because in a balanced box, the
SE is roughly 1.6% so the 2 SD rule is $\pm 3.2\%$.

## A more realistic picture

* In practice, when a poll is carried out, we don’t know how many `B` or `R`
there are in the population.

* Taking a sample of size 500, say, gives us information on 500 of this population.
* Our goal, as statisticians, is to give the politicians some idea of the *true*
   proportion of `B` vs. `R` 
   in the box

* This is our first true *statistics*
   problem …

### Box we will poll …

In [None]:
%%capture
brown_approval = Sample(5700, 4300, 500)
brown_approval.alpha = 0.1
brown_approval.ptsize = 10
brown_approval.draw(color={'R':'gray','B':'gray'})

In [None]:
brown_approval.figure

### Poll sample

In [None]:
prop = brown_approval.trial(bgcolor={'R':'gray','B':'gray'},
                            color={'R':'red','B':'blue'})
brown_approval.figure

## Can we use the normal approximation?

- Applying the rule $np > 10$ and $n(1-p)>10$ to the sample proportion
suggest we should have at least 10 votes for Brown and 10 votes against Brown in our poll.
- This is not quite the same rule because we don't know the true $p$, we only
observe $\hat{p}$...
- Nevertheless, this is roughly how points were awarded on assignments.

## Estimating SE when proportions unknown


- Given a poll, we can work out the *observed*
   proportion of 1’s. Call this 
   $$\bbox[5px,border:2px solid orange]{\widehat{p}} = \text{observed proportion} = \frac{\text{$\#$ B's in sample}}{\text{$\#$ in sample}}$$

- If we knew the true proportion of Brown's supportors, say, $p$ we would use
$$
 \text{SE}(\bbox[5px,border:2px solid orange]{\widehat{p}}) \approx  \sqrt{\bbox[5px,border:2px solid blue]{p} \times (1 - \bbox[5px,border:2px solid blue]{p})} / \sqrt{\text{$\#$ in sample}}.
$$

- We do not know $p$. BUT we can **estimate** the SE of observed proportion of B’s 
$$\begin{aligned}
     \text{SE}(\bbox[5px,border:2px solid orange]{\widehat{p}})& \approx  \sqrt{\bbox[5px,border:2px solid orange]{\widehat{p}} \times (1 - \bbox[5px,border:2px solid orange]{ \widehat{p}})} / \sqrt{\text{$\#$ in sample}} \\
         \end{aligned}$$

- The quantity $\sqrt{\widehat{p} \times (1 - \widehat{p})}$ is an estimate of the SD of the box. 

     - It depends on the poll we used to estimate $\bbox[5px,border:2px solid blue]{p}$. 
     - It is not a number because it changes with the poll.


- We call this the **bootstrap** estimate of the SD of the box.



## Our first confidence interval

- Since we don't know $\bbox[2px,border:2px solid blue]{p}$, we can't compute 
$SE(\widehat{p})$.

- We use the **boostrap estimate** instead.

- We call the interval:
$$
\bbox[2px,border:2px solid red]{
\left[\widehat{p} - \frac{2\sqrt{\widehat{p}(1-\widehat{p})}}{\sqrt{1000}},                                                                                            
    \widehat{p} + \frac{2\sqrt{\widehat{p}(1-\widehat{p})}}{\sqrt{1000}}\right] }
$$
an (approximate) 95% confidence interval for the true proportion  $\bbox[2px,border:2px solid blue]{p}$.
 

## Example

- Find a 95% confidence interval for Governor Brown’s approval rating given 549 out of 1000 voting age Californians polled approved of his job performance.


         We already saw the estimated SE of the proportion 
         was approximately 1.6%.
         
         A 95% confidence interval is therefore: 
         
 $$\bbox[5px,border:2px solid red]{[54.9-2 \times 1.6, 54.9+2\times 1.6] = [51.7,58.1]}$$
 

### An illustration of confidence intervals

In [None]:
%%capture
from random import sample as random_sample

def interval(approval, nsample=1000):
    prop = np.array(random_sample(approval, nsample)).mean()
    SE = np.sqrt(prop * (1 - prop) / nsample)
    return 100 * (prop-2*SE), 100 * (prop+2*SE)

def confidence_intervals(num_intervals=100, true_proportion=.57,
                         npop=100000,
                         nsample=1000):
    blue = int(npop*true_proportion)
    approval = [0] * (npop-blue) + [1] * blue

    confidence_intervals = plt.figure(figsize=figsize)
    ax = confidence_intervals.gca()
    intervals = []
    missed = 0
    for i in range(num_intervals):
        L, U = interval(approval, nsample=nsample)
        in_interval = (L <= 100*true_proportion) * (U >= 100*true_proportion)
        if in_interval:
            ax.plot([L,U], [i,i], color='gray', linewidth=2)
        else:
            missed += 1
            ax.plot([L,U], [i,i], color='red', linewidth=5)
        

    ax.plot([100*true_proportion]*2,[0,num_intervals], 'k--', linewidth=4)
    ax.set_yticks([])
    ax.set_ylabel('Different polls of size 1000', rotation='vertical', fontsize=15)
    ax.set_xlabel('Percent who support Governorn Brown', fontsize=15)
    ax.set_title("# covering %0.2d%%=%d, # not covering %0.2d%%=%d" % (round(100*true_proportion),
                                                                     num_intervals-missed,
                                                                     round(100*true_proportion),
                                                                     missed))

In [None]:
confidence_intervals()

## Average of draws

- The **average of n draws** (or sample average) is

         average of n draws = (sum of n draws) / n
         
- The expected value of the average of n draws is

         expected(average of n draws) = average(box)
         
- The SE for the average of n draws is

         SE(average of n draws) = SD(box) / sqrt(n)
- Just like the SE of sample proportion – it decreases as # draws increase.

## Bootstrap estimate


- Given a sample ${[X_1, \dots, X_n]}$ of $n$ draws, we can compute the {\em sample mean}
   Call this
   $$
   { \bar{X}}  = \frac{{ \text{ sum of draws}}}{{ \text{$\#$ of draws}}}  = \frac{{ \sum_{i=1}^n X_i}}{{ n}}
   $$
   - We know
   $$
   \begin{aligned}
   { \text{SE}(\bar{X})} = \frac{{\text{SD( box)}}}{{ \sqrt{\text{$\#$ of draws}}}}
   \end{aligned}
   $$

   - Unfortunately, we don't know ${\text{SD( box)}}$.

   - Use **plug-in / bootstrap** estimate
   $$
   {\widehat{ \text{SE}(\bar{X})}} = \frac{{\text{SD($[X_1, \dots, X_n]$)}}}{{\sqrt{ \text{$\#$ of draws}}}} =  \frac{{ \sqrt{\frac{1}{n}\sum\
_{i=1}^n (X_i - \bar{X})^2}}}{ \sqrt{n}}
   $$




## Using our bootstrap estimate of SE

- Even though we don't know  SD( box) we estimated
$$
   \frac{ \text{SD( box)}}{ \sqrt{100}}                                                                   $$     
   by the bootstrap estimate of SE  
   $$ \frac{\text{SD}([X_1, \dots, X_{100}])}{ \sqrt{100}}$$
   
   
- If we can plug this in (and we can if sample is large enough), we see that
   $$
   \begin{aligned}
   \text{P} \left(\text{$\mu$ between ${ \bar{X} \pm 2 \times \; \text{SD}([X_1, \dots, X_{100}]) / \sqrt{100\
}}$}\right) &\approx 95\% \\
   \end{aligned}
   $$


## Different rules of computing SE

- We have seen various different rules for computing SE.

         SE(sum of B marbles in n draws) = \
               sqrt(n) * sqrt(p[B] * (1 - p[B])) 
         
         SE(proportion of B marbles in n draws) = \
               sqrt(p[B] * (1 - p[B])) / sqrt(n)
         
         SE(sum of n draws) = sqrt(n) * SD(box)
         
         SE(average of n draws) = SD(box) / sqrt(n)
         
         
- They are all examples of the first rule for $\text{SE( sum of draws)}$ followed by unit conversion.

- Once we have figured out the appropriate SE and expected value, we can use normal approximation if # of draws is large enough.

## Normal approximation for $\bbox[5px,border:2px solid orange]{ \widehat{\theta}}$

* Suppose we are trying to estimate *something about a chance
process*
   called $\bbox[5px,border:2px solid blue]{\theta}$ with an estimator $\bbox[5px,border:2px solid orange]{ \widehat{\theta}}$.
* Under the appropriate conditions, a normal approximation may hold $\bbox[5px,border:2px solid orange]{ \widehat{\theta}}$.
* If $E(\bbox[5px,border:2px solid orange]{\widehat{\theta}} )= \bbox[5px,border:2px solid blue]{ \theta}$ and the normal approximation holds, then $$P \left(\frac{\bbox[5px,border:2px solid orange]{  \widehat{\theta}} - \bbox[5px,border:2px solid blue]{ \theta}}{SE(\bbox[5px,border:2px solid orange] { \widehat{\theta}})} \leq c \right)$$ 
can be expressed as the area under the standard normal curve, i.e. it can be computed using table A-104 from the book.
* For example, suppose $c=-1.5$, then …

In [None]:
%%capture
normal_fig = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(-4,-1.5,101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval), 
                hatch='+', facecolor='red')
ax.set_title('Area is about 6.7%', fontsize=15)


In [None]:
normal_fig

## Normal approximation and confidence intervals

- If a normal approximation holds for 
$\bbox[5px,border:2px solid orange]{ \widehat{\theta}}$
 then, for example, 
 $$ \bbox[5px,border:2px solid orange]{ \widehat{\theta}}
  \pm 1.65 \times SE(\bbox[5px,border:2px solid orange]{ \widehat{\theta}})
  $$
   is a 90% confidence interval for $\bbox[5px,border:2px solid blue]{ \theta}$.
- Often, we only have an estimate $\bbox[5px,border:2px solid orange]
{\widehat{\text{SE}(\widehat{\theta})}}$

- We can then compute an approximate 90% confidence interval:
$$ \bbox[5px,border:2px solid orange]{ \widehat{\theta}}
  \pm 1.65 \times \bbox[5px,border:2px solid orange]
{\widehat{\text{SE}(\widehat{\theta})}}
  $$

- An approximate 95% confidence interval is:
$$ \bbox[5px,border:2px solid orange]{ \widehat{\theta}}
  \pm 2 \times \bbox[5px,border:2px solid orange]
{\widehat{\text{SE}(\widehat{\theta})}}
  $$
  
- **Caution: if the normal approximation does not hold, then we cannot use these rules for confidence intervals.**
  

## Gauss model

- The Gauss model assume that each measurement has the form

         measurement = true value + chance error
         
      
- When the Gauss model holds, taking a measurement corresponds to drawing from an  error box and adding a  true value.

- If the measurement is biased, the Gauss model is


         measurement = bias + true value
                       + chance error


## Sampling from the Gauss model

- Suppose we observe a sample of $n$ draws $[X_1, \dots, X_n]$ from the Gauss model.
- Then, $$\begin{aligned}
       E(\bar{X}) &= \text{true value} \\
       \text{SE}(X_1) &= \text{SE(one draw from error box)} \\
       \text{SE}(\bar{X}) &= \frac{1}{\sqrt{n}}  \text{SE(one draw from error box)}
       \end{aligned}$$
- A reasonable estimate of $\text{SE}(\bar{X})$ is
$$
\text{SE}(\bar{X}) \approx \frac{1}{\sqrt{n}} \text{SD}([X_1, \dots, X_n]).
$$

- If you know the SE from previous data, use the true SE rather than the bootstrap estimate.

## A special case of the Gauss model

- A special case of the Gauss model is when the errors
follow a normal curve.

- The normal curve is also called the *Gaussian* distribution.

- The book does not assume the errors follow the normal curve, but
tells you when they do.


## Sample averages with normal errors


In [None]:
other_model = Uniform(3, 2)
sample_mean = SampleMean(other_model, 3)
sample_mean.trial()

In [None]:
std(sample_mean.sample(5000)), 2 / sqrt(3)

In [None]:
%%capture
sample_mean_fig = plt.figure(figsize=figsize)
ax = sample_density(sample_mean.sample(15000), bins=30, facecolor='orange')[0]
ax.set_title(r'True value = 3, sample size 3, sample mean SE = 2 / $\sqrt{3}$')

In [None]:
sample_mean_fig

# Testing hypotheses 

## Null and alternative hypotheses

- The naming of the hypotheses corresponds to an "innocent until proven guilty" approach.
- Since our observations (in standardized units) seem attributable to chance variation, we decided we cannot declare the null hypothesis to be false. Or, we cannot reject the null hypothesis.
- In legalese, "there is reasonable doubt to the guilt of the roulette game so we do not convict".

## $Z$-scores

- In this hypothesis test, the quantity
$$
Z = \frac{\text{observed} - \text{expected}}{\text{SE(observed)}}
$$
is called a **$z$-score**.

- The quantities **expected, SE(observed)** are computed **assuming the null hypothesis is true.**

- It measures how many standardiazed units the **observed** value is from what
is expected (if the null hypothesis were true).

## $P$-values

- The chances we computed are the chances, if the roulette game was fair, that we would observe a standardized less than our observed standardized value of  -2.2.
- In general, if we test a null hypothesis with some  observed data
  or  observed test statistic, the  $P$-value
   is the chance, assuming the null hypothesis is true, that we would observe such an extreme test statistic.
* When computing chances using a $z$ score, the test is called **$z$ tests.**
* **Note:**  $\bbox[5px,border:2px solid orange]{P-value}$ is random!
** The P-value is NOT the chances that the null hypothesis is correct!**

## One-sided vs. two-sided

* If we want to conclude a one-sided alternative like $H_a$:"the average difference in blood pressure is less than -7 mm Hg".
* Then, we can take the null hypothesis to be $H_0$:"the average difference in blood pressure is greater than or equal to -7 mm Hg". We reject for $z$-scores that are negative and large in absolute value.
* On the other hand, if we want to conclude a two-sided alternative like $H_a$:"the average difference in blood pressure is not -7 mm Hg".
* Then, we can take the null hypothesis to be $H_0$:"the average difference in blood pressure is equal to -7 mm Hg". We reject for large $z$-scores in absolute value.


## Normal approximation and hypothesis tests

* If a normal approximation holds for $\bbox[5px,border:2px solid orange]{\widehat{\theta}}$
(i.e. $E(\widehat{\theta}) \approx \bbox[5px,border:2px solid blue]{\theta}$ and $\widehat{\theta}-\theta$ follows a normal curve with an SE we can approximate). 

* Then, we can test the null hypothesis $H_0:  \theta=\theta_0$ against $H_a:  \theta \neq \theta_0$ (or any variation of one-sided vs. two-sided).
* For instance, our first null hypothesis was $\theta_0=0$. In the second, $\theta_0=-7$.
* The test statistic, called a  $z$ score
   for testing $H_0: \theta=\theta_0$ is 
   $$z = \frac{\bbox[5px,border:2px solid orange]{\widehat{\theta}} - \bbox[5px,border:2px solid blue]{\theta_0}}{\text{SE}(\bbox[5px,border:2px solid orange]{\widehat{\theta}})}
   $$
   
* We call $z$ a $Z$-statistic or a $Z$-score.
* If $H_0$ is true, then $ z$ follows the standard normal curve.


## Normal approximation and hypothesis tests

* If $H_0$ is not true, then $Z$ does not usually follow the standard normal curve. If it does, you have a very poor test...
* It may follow a normal curve with mean $\neq 0$.
* The logic of the hypothesis test is as follows: if $H_0$ is true, then our observed test statistic should be a "typical value" under $H_0$.
* The  $P$-value
   depends on what $H_a$ is.
* It is often easier to use the rejection rule instead of the $P$-value.
* For null hypotheses like $H_0:\theta \leq \theta_0$ and $H_0:{ \theta \geq \theta_0}$ we use the rejection rules with the *same $z$-score* but whether we reject or not depends on whether the $z$-score is positive or negative.


### One sided test (alternative negative)

In [None]:
%%capture
normal_fig3 = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(-4,-1.65, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
ax.set_title('The green area is %0.0f%%' % (100 * ndist.cdf(-1.65)), fontsize=20, color='green')


In [None]:
normal_fig3


 5% rejection rule
 for $H_0:\theta \geq \theta_0, H_a: \theta < \theta_0.$

### One sided test (alternative positive)

In [None]:
%%capture
normal_fig5 = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(1.65,4, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
ax.set_title('The green area is %0.0f%%' % (100 * ndist.sf(1.65)), fontsize=20, color='green')


In [None]:
normal_fig5


 5% rejection rule
 for $H_0:\theta \leq \theta_0, H_a: \theta > \theta_0.$

### Two sided test

In [None]:
%%capture
normal_fig4 = plt.figure(figsize=figsize)
ax = normal_curve()
interval = np.linspace(-4,-2, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)
interval = np.linspace(2,4, 101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval),
                hatch='+', color='green', alpha=0.5)

ax.set_title('The green area is %0.0f%%' % (2 * 100 * ndist.cdf(-2)), fontsize=20, color='green')


In [None]:
normal_fig4


 5% rejection rule
 for $ H_0: \theta = \theta_0, H_a: \theta \neq \theta_0.$

## Interpretation of 5% rejection rules

- Call the rejection rules 
$$
\begin{aligned}
R^+ &= [\theta_0 + 1.65 \cdot SE(\hat{\theta}), \infty) \\
R^- &= (-\infty, \theta_0 - 1.65 \cdot SE(\hat{\theta})] \\
R^{\pm} &= (-\infty, \theta_0 - 2 \cdot SE(\hat{\theta})] \cup [\theta_0 + 2 \cdot SE(\hat{\theta}), \infty)
\end{aligned}
$$
- So, $R^+$ corresponds to the pair $H_0: \theta \leq \theta_0, 
H_a: \theta > \theta_0$.
- The rejection rules are set up so that, for instance,
$$
\begin{aligned}
P(\hat{\theta} \in R^+) \leq 5\%, &\qquad \text{if $H_0: \theta \leq \theta_0$ is true.} \\
P(\hat{\theta} \in R^-) \leq 5\%, &\qquad \text{if $H_0: \theta \geq \theta_0$ is true.} \\
P(\hat{\theta} \in R^{\pm}) = 5\%, &\qquad \text{if $H_0: \theta = \theta_0$ is true.}
\end{aligned}
$$

- In other words, the rejection regions are set up so that there is less than a 5% chance of declaring a false positive.

- Here's an illustration of the $R^+$ rejection region. 

## Relation between hypothesis tests and confidence intervals

* Which values are reasonable?
* Well, -7.0 is certainly a reasonable value if the true average difference were -7 because our $z$ score would be 0.
* Hence, we would not reject $H_0$:"the average difference is -7" if we observed a sample average of -7
* The set of all values $\theta$ we would not reject $H_0$: "the average difference is $\theta$" at level 5% is basically the standard 95% confidence interval!
* Therefore, one can test $H_0:$"the average difference is 0" by checking to see whether 0 is in the confidence interval.

## Testing fairness via a confidence interval

- Let's go back to our roulette example. Suppose we make an additional 10 bets
and won 3 more times, making a total of 13 successes in 30 bets.
- An approximate 95% confidence for the true  RED
   success rate (fair or not) based on our 20 bets is $$ \frac{13}{30} \pm 2 * \sqrt{\frac{13}{30} \times \frac{17}{30} \frac{1}{30}} =  0.43 \pm 0.18$$
      
- (This assumes the online roulette game is doing independent trials, thought not necessarily fair trials)
- The success rate for  RED
   in the fair model is ${ 18/38 \approx 0.47}$.
- We see that 0.47 is within our 95% confidence interval. Therefore, we would not reject $H_0$:"the roulette table is fair" at level 5%.
* **Note:** we should ensure that we have enough trials so the normal approximation holds. 

## Tests and confidence intervals for small samples

* Our tests (and confidence intervals) have so far relied on normal approximations (i.e. we have used A-104 to compute all chances).
* If the sample size is small, the normal approximations may not be very good.
* If the sample size is small, we can sometimes get good confidence intervals using something called a $T$ statistic.
* The formula for the $T$ statistic is almost identical to the $z$ statistic, it is the *chances*
   that can be quite different.

## Tests and confidence intervals for small samples

* Suppose the Gauss model holds 

           measurement = true value + chance error

* **And, the histogram of the error box is not too different from a normal probability histogram or curve!**
  
* Then, there are very good confidence intervals even for very small samples.
* If the histogram of the error box is exactly a normal probability histogram, then these tests and confidence intervals are *exact*.

## The $T$ statistic

* Suppose we observed only 5 blood pressure changes: [-4,-6,-8,-2,-1].
* The average is -4.2 mm Hg, and the SD of the list is 2.6 mm Hg.
* Our usual $z$ score to test $H_0$: average difference $\geq 0$ against $H_a$: average difference $<0$ $${ z = \frac{-4.2}{2.6 / \sqrt{5}} \approx -3.7}$$
* The $T$ statistic replaces the SD of the list with SD$^+$ of the list which is 2.9 mm Hg. 
* The $T$ statistic is $${ \bbox[5px,border:2px solid orange]{ T} = \frac{-4.2}{2.9 / \sqrt{5}} \approx -3.3}$$

## What’s different about the $T$ statistic?

* For one thing, it uses $\text{SD}^+$ instead of $\text{SD}$.
* Why does it use $\text{SD}^+$?
* For small samples, $\text{SD}^+$ it is a better estimate of SD(box) than SD.
* Unfortunately, though, the $T$ statistic does not follow the normal curve. This is the biggest difference.

## Computing the chances for the $T$ test

* It *almost*
   follows the normal curve. For large samples, it gets closer and closer.
* For each sample size, there is a different curve, or probability histogram.
* These curves are indexed by what we call *degrees of freedom*.
* In this example, the degrees of freedom are $n-1$.

### Student’s $T$

In [None]:
%%capture
df=4
normal_fig6 = plt.figure(figsize=figsize)
ax = normal_fig6.gca()
normal_curve(ax=ax, label='Normal', color='blue', alpha=0.)
studentT_curve(ax=ax, label='$T_{%d}$' % df, color='green', alpha=0., df=df)
ax.set_title('Comparison of normal curve to $T_{%d}$' % df, fontsize=15)
ax.legend()

In [None]:
normal_fig6

## SE of a difference

* The denominator should be 
$$ \text{SE[average(2004 sample) -
     average(1990 sample)]}.$$
* The SE of the difference of two **independent**
   quantities can be found from the individual SEs.
* For our two box model, the  **average(1990 sample)**
   is independent of the  **average(2004 sample)**.
* The rule is 
$$\begin{aligned}
   \text{SE[average(2004 sample) -
     average(1990 sample)]} = \sqrt{\text{SE(1990 sample)^2 + SE(2004 sample)^2}}
     \end{aligned}$$

## An example with proportions

* In 1999, 13% of the 17 year-old students had taken calculus compared to 17% in 2004 according to the NAEP samples. Is the difference real or chance variation?
* This question asks us to compare two proportions. We can take the null hypothesis to be "the proportion of 17 year olds who took calculus in 1999 is equal to (or greater than or equal to) the proportion who took calculus in 2004."
* The alternative is "the proportion of 17 year olds who took calculus in 1999 is less than the proportion who took calculus in 2004."
* This is another example of choosing the hypothesis after seeing the data. Bad.

## SE of the difference of two propotions

* We could rewrite this as $H_0: p_{1999} \geq p_{2004}$ vs. $H_a:p_{1999} < p_{2004}$.
* Our sample estimates are $\widehat{p}_{1999}=13\%, \widehat{p}_{2004}=17\%$.
* They have estimated standard errors: $$\text{SE}(\widehat{p}_{1999}) \approx \sqrt{0.13 \times 0.87 / 1000} = 1.1\%,$$
$$\text{SE}(\widehat{p}_{2004}) \approx \sqrt{0.17 \times 0.83 / 1000} = 1.2\%.$$

* The of the difference is $\text{SE}(\widehat{p}_{1999} - \widehat{p}_{2004}) \approx \sqrt{1.1^2 + 1.2^2} \approx 1.6\%.$

* The $z$-score is $$z = \frac{13 - 17 - 0}{1.6} \approx -2.5$$
* The one-sided $P$-value is about 1%.
* The two-sided $P$-value (the better one to use, in my opinion) is about 2%.

## Randomized experiments

* This *two-sample $z$-test*
   can also be used for randomized controlled experiments.
* Example: 200 subjects are split randomly into treatment and placebo in a study on vitamin C on the number of colds.
* In the treatment group **average(treatment group)=2.3, SD(treatment group) = 3.1.**
* In the placebo group **average(placebo group)=2.6, SD(placebo group) = 2.9.**
* Is there a difference in the number of colds in treatment vs. placebo group?

## Box model for experiment

* The best box model for this randomized experiment is a box with 200 tickets.
* Each ticket has two responses for each subject:
*     * $A$: the number of colds if they receive the vitamin C;
      * $B$: the number of colds if they receive the placebo.
* In the experiment, we only observe $A$ or $B$ for each subject.
* Which one we see depends on the randomization.
* Statistical theory says the $z$-test is applicable if the sample is large enough to decide between the hypotheses
    - $H_0: \text{average of all $A$'s $\geq$ average of all $B$'s}$ 
    - $H_a: \text{average of all $A$'s $<$ average of all $B$'s}$ 
    

## Complete data

In [None]:
%%capture
sample_fig = plt.figure(figsize=(8,8))
ax = sample_fig.gca()
np.random.seed(0)
X = np.mgrid[0:10:10j,0:20:20j].reshape((2,200))   + np.random.sample((2,200)) * 0.05
X = X.T
sample = np.random.sample(200) 

data = csv2rec('data/vitaminC.csv')
treatment_sample = data['treatment']
placebo_sample = data['placebo']
placebo = np.array([t == 'placebo' for t in data['group']], np.bool)

idx = np.arange(200)
np.random.shuffle(idx)

placebo = placebo[idx]
placebo_sample = placebo_sample[idx]
treatment_sample = treatment_sample[idx]

for i in range(200):
   ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
   ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
   ax.text(X[i,0]-0.2, X[i,1], '%d' % treatment_sample[i], ha='center', va='center', size=10)
   ax.text(X[i,0]+0.2, X[i,1], '%d' % placebo_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
ax.set_title('Complete data (unobserved)', fontsize=15)

In [None]:
sample_fig

## Placebo group

In [None]:
%%capture
placebo_fig = plt.figure(figsize=(8,8))
ax = placebo_fig.gca()
for i in range(200):
   if placebo[i]:
       ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
       ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
       ax.text(X[i,0]-0.2, X[i,1], '%d' % placebo_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
plt.title("Placebo sample: average(placebo)=%0.1f, SD(placebo)=%0.1f" % (np.around(placebo_sample[placebo].mean(),1), np.around(placebo_sample[placebo].std(),1)))

In [None]:
placebo_fig

## Treatment group

In [None]:
%%capture
treatment_fig = plt.figure(figsize=(8,8))
ax = treatment_fig.gca()
for i in range(200):
   if not placebo[i]:
       ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
       ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
       ax.text(X[i,0]+0.2, X[i,1], '%d' % treatment_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
plt.title("Treatment sample: average(treatment)=%0.1f, SD(treatment)=%0.1f" % (treatment_sample[~placebo].mean(), treatment_sample[~placebo].std()))


In [None]:
treatment_fig

## Observed data

In [None]:
%%capture
observed_fig = plt.figure(figsize=(8,8))
ax = observed_fig.gca()
for i in range(200):
   ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
   ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
   if not placebo[i]:
       ax.text(X[i,0]-0.2, X[i,1], '%d' % treatment_sample[i], ha='center', va='center', size=10)
   elif placebo[i]:
       ax.text(X[i,0]+0.2, X[i,1], '%d' % placebo_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])

In [None]:
observed_fig

## Carrying out the test

* We know $$\begin{aligned}
       \text{average(treatment)} &= 2.6 \\
       \text{SE(average(treatment))} &= 2.9 / \sqrt{100} = 0.29 \\
       \text{average(placebo)} &= 2.3 \\
       \text{SE(average(placebo))} &= 3.1 / \sqrt{100} = 0.31 \\
       \end{aligned}$$
       
* Therefore, $$\begin{aligned}
       \text{SE(average(treatment)-average(placebo))}
       & = \sqrt{0.31^2+0.29^2} \\
       &= 0.42
       \end{aligned}$$
 and $$z = \frac{2.3-2.6}{0.42} = -0.7$$
       
* The one-sided $P$-value is 18%, when we test $$H_0: \text{average of all $A$'s $\geq$ average of all $B$'s}.$$


## Is the die fair?

* Suppose we have a die and we want to
decide whether it is fair or not.

* We roll the die 60 times. These are the outcomes:

In [None]:
%%capture
data_table = make_table([('Value', 'Observed'), (1,4), (2,6), (3,17), (4,16), (5,8), (6,9), ('Total', 60)])

In [None]:
data_table


* Looks like the number of 3's and 4's might be a little high (though
we already decided we were going to form this test...)

## Comparison to expected

* If the die is unfair, the expected
counts in some cells might be higher or lower.

* We use the square of the difference instead of the difference.
  

In [None]:
%%capture
data_table3 = make_table([('Value', 'Observed', 'Expected', '(Observed - Expected)^2'), 
                          (1,4,10, '(4-10)^2=36'), 
                          (2,6,10, '(6-10)^2=16'), 
                          (3,17,10, '(17-10)^2=49'), 
                          (4,16,10, '(16-10)^2=36'), 
                          (5,8,10, '(8-10)^2=4'), 
                          (6,9,10, '(9-10)^2=1'), ('Total', 60,60, '' )])

In [None]:
data_table3

## Pearson's $X^2$ statistic

* To get an overall test, we combine the rows into *Pearson’s $X^2$*
   $$
   \begin{aligned}
X^2 &= \sum_i \frac{\text{(observed[i]-expected[i])}^2}{\text{expected[i]}}\\
 &= \sum_i \frac{(O_i-E_i)^2}{E_i}\\
\end{aligned}
$$
* In our die example,
$$
\begin{aligned}
X^2 &= \frac{36}{10} + \frac{16}{10} + \frac{49}{10} + \frac{36}{10} + \frac{4}{10} + \frac{1}{10} \\
&= \frac{142}{10} \\
&= 14.2
\end{aligned}
$$

* Is this big, or could the statistic be this big by chance?

In [None]:
%%capture
from scipy.stats import chi2
from pylab import fill_between

def tail_chi2(observed, df, upper_lim=None):
    if upper_lim is None:
        upper_lim = 10*df

    X = np.linspace(1.e-10, upper_lim, 201)
    D = chi2.pdf(X, df)
    fig = plt.figure(figsize=(6,6))
    ax = fig.gca()
    ax.plot(X, D, 'k', linewidth=5)
    cutoff = chi2.ppf(0.95, df)
    x = np.linspace(cutoff, upper_lim, 501)
    ax.fill_between(x, 0, chi2.pdf(x, df), hatch='\\', facecolor='green', label='5% cutoff',
                    alpha=0.5)
    x = np.linspace(observed, upper_lim, 501)
    ax.fill_between(x, 0, chi2.pdf(x, df), hatch='\\', facecolor='red', label='observed',
                    alpha=0.5)
    ax.set_xlabel('$\chi^2$ units', fontsize=15)
    ax.set_ylabel('Percent per $\chi^2$ units', fontsize=15)
    ax.set_xlim([0, upper_lim])
    ax.legend(loc='upper right')
    return fig, ax

die_fig, die_ax = tail_chi2(14.2, 5, upper_lim=20)

### What are the chances?

In [None]:
die_fig


The $\chi^2_5$ probability histogram, the <font color='red'> red area </font> is 1.4%.
The <font color="green"> green area </font> is the 5% rejection rule for $\chi^2_5$.

## Using the $\chi^2$ test

* A general rule of thumb: every expected value should be 5 or more for the $\chi^2$ curve to approximate the probability histogram of the $X^2$ statistic.
* Would not apply to 100 draws from the box below:

In [None]:
box = [0,1,2,3] + [4]*96

## Difference between $\chi^2$ test and $z$ test

* The $z$ test is a statement about the average of the box.
* The $\chi^2$ is a test whether the observed data follow the box model.
* If there are only two values in the box, then the $\chi^2$ test is identical to the (two-sided) $z$ test.

## Structure of a $\chi^2$ test

- The number of draws, $N$ and the resulting draws.

- Data: 

Value | Observed Count
----|----
  1 | 4
  2 | 6
  3 | 17
  4 | 16
  5 | 8
  6 | 9
  Total | 60 (=$N$)

- Box: [1,2,3,4,5,6]

- Degrees of freedom: In our example, this was 5 which was the number of "free parameters." Call this number
`df`. This number is 5 in our example.

- $P$-value: Computed using the $\chi^2_{df}$ curve. This was about 1.4% in our example.

## Testing independence: another $\chi^2$ test

### Handedness and gender

* Data example from book:
 
Handedness   | Male | Female
-------------|------|-------
Right        | 934  | 1070
Left         | 113  | 92
Ambidextrous | 20   | 8

* Is handedness related to gender (or not)?

## Test of independence

* The null hypothesis is  **$H_0$: handedness is independent from gender.**
* This means that the probability a person (drawn at random) from the population is, say, a left-handed male, is the product of two probabilities: the probability a person is left-handed and the probability a person is male.
* Or, 
$$P(\text{left-handed and male}) = P(\text{left-handed}) \times P(\text{male})$$

## Expected counts under $H_0$

* Continuing for all 6 cases yields a table of "Expected Counts"




Handedness   | Male | Female | Total(Handedness)
-------------|------|--------|----------
Right        | 956  | 1048   | 2004
Left         | 98   | 107    | 205
Ambidextrous | 13   | 15     |   28
Total(Gender)| 1067 | 1170   | 2237



## Computing the $X^2$ statistic

* The $X^2$ statistic is computed in exactly the same way $\begin{aligned}
           \chi^2 &= \frac{(934-956)^2}{956} + \frac{(1070-1048)^2}{1048} +
            \frac{(113-98)^2}{98} \\
           & \qquad +  \frac{(92-107)^2}{107} + \frac{(20-13)^2}{13} + \frac{(8-15)^2}{15}  \\
           &\approx 12
         \end{aligned}$
* In symbols, $\chi^2 = \sum_{i=1}^3 \sum_{j=1}^2 \frac{(O_{ij}-E_{ij})^2}{E_{ij}}$

## Tests of independence in two-way tables

- We could have looked at a different table. The table
may have more than 2 columns or three rows.

- For example, instead of gender
we might have looked at sexual orientation (even though handedness may not be an
interesting question). This would add more columns to our table.

- In general, we might have a $R \times C$ table with $R$ categories
in the rows and $C$ categories in the columns.

- The calculation of the $X^2$ is identical:
     

- The degrees of freedom is $(R-1)*(C-1)$.