In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.mlab import csv2rec
import numpy as np
from numpy import mean, std, fabs

# stats60 specific

figsize = (8,8)

## NAEP reading test

* In 1990, sample average was 290, SD 37 (sample size 1000).
* In 2004, sample average was 285, SD 40 (sample size 1000).
* Is this chance variation, or has something changed?

## A two-box model

* One box for 1990, average(1990 sample) = 290, SD(1990 sample) = 37.
* A second box for 2004, average(2004 sample) = 285, SD(2004 sample) = 40.
* So, $$\begin{aligned}
     \text{SE(average(1990 sample))} &\approx 37 / \sqrt{1000} \\
      &\approx 1.2\\
     \text{SE(average(2004 sample))} &\approx 40 / \sqrt{1000} \\
     &\approx 1.3\\
     \end{aligned}$$

## Two boxes

* The null hypothesis is:  "the average of the 1990 equals (or is less than or equal to) the average of the 2004 box."
  
* The alternative is  "the average of the 1990 box is greater than the average of the 2004 box."

* Again, this is not quite correct to choose the alternative after seeing
a decrease.
  
* Our  $z$-score
   should  have numerator
   $$285 - 290 - 0.$$

* We are using $$ \text{average(2004 sample) - average(1990 sample)}$$ to estimate the difference between the average of the 2004 and 1990 box averages.

* What about the denominator?

## SE of a difference

* The denominator should be 
$$ \text{SE[average(2004 sample) -
     average(1990 sample)]}.$$
* The SE of the difference of two **independent**
   quantities can be found from the individual SEs.
* For our two box model, the  **average(1990 sample)**
   is independent of the  **average(2004 sample)**.
* The rule is 
$$\begin{aligned}
   \text{SE[average(2004 sample) -
     average(1990 sample)]} = \sqrt{\text{SE(1990 sample)^2 + SE(2004 sample)^2}}
     \end{aligned}$$

## Back to the NAEP

* Applying the rule, 
$$
 \text{SE[average(2004 sample) -
     average(1990 sample)]} = \sqrt{1.2^2+1.3^2} \approx 1.8
$$

* The $z$-statistic is $$z = \frac{-5}{1.8} \approx -2.8.$$
* The one-sided $P$-value is about 0.3%.
* The two-sided $P$-value (the better one to use, in my opinion) is 0.6%.

## An example with proportions

* In 1999, 13% of the 17 year-old students had taken calculus compared to 17% in 2004 according to the NAEP samples. Is the difference real or chance variation?
* This question asks us to compare two proportions. We can take the null hypothesis to be "the proportion of 17 year olds who took calculus in 1999 is equal to (or greater than or equal to) the proportion who took calculus in 2004."
* The alternative is "the proportion of 17 year olds who took calculus in 1999 is less than the proportion who took calculus in 2004."
* This is another example of choosing the hypothesis after seeing the data. Bad.

## SE of the difference of two propotions

* We could rewrite this as $H_0: p_{1999} \geq p_{2004}$ vs. $H_a:p_{1999} < p_{2004}$.
* Our sample estimates are $\widehat{p}_{1999}=13\%, \widehat{p}_{2004}=17\%$.
* They have estimated standard errors: $$\text{SE}(\widehat{p}_{1999}) \approx \sqrt{0.13 \times 0.87 / 1000} = 1.1\%,$$
$$\text{SE}(\widehat{p}_{2004}) \approx \sqrt{0.17 \times 0.83 / 1000} = 1.2\%.$$

* The of the difference is $\text{SE}(\widehat{p}_{1999} - \widehat{p}_{2004}) \approx \sqrt{1.1^2 + 1.2^2} \approx 1.6\%.$

* The $z$-score is $$z = \frac{13 - 17 - 0}{1.6} \approx -2.5$$
* The one-sided $P$-value is about 1%.
* The two-sided $P$-value (the better one to use, in my opinion) is about 2%.

## Randomized experiments

* This *two-sample $z$-test*
   can also be used for randomized controlled experiments.
* Example: 200 subjects are split randomly into treatment and placebo in a study on vitamin C on the number of colds.
* In the treatment group **average(treatment group)=2.3, SD(treatment group) = 3.1.**
* In the placebo group **average(placebo group)=2.6, SD(placebo group) = 2.9.**
* Is there a difference in the number of colds in treatment vs. placebo group?

## Is it OK to use the difference formula here?

* Naively applying the two sample $z$-test to this situation, yields $z = \frac{(2.6 - 2.3) - 0}{\sqrt{\left(\frac{2.9}{\sqrt{100}}\right)^2 + \left(\frac{3.1}{\sqrt{100}}\right)^2}} = \frac{-0.3}{0.42} \approx -0.7.$
* If we had taken a one-sided alternative, this would be a $P$-value of about 25%.
* Our two-box model does not *quite* apply here because the groups were selected from the same "box" of 200 subjects.
* The two samples are not independent: if a patient is in treatment, he/she cannot also be in the control group.
* However, the short answer is **yes, the SE of a difference formula is OK here
because this was a randomized controlled experiment.**

## Box model for experiment

* The best box model for this randomized experiment is a box with 200 tickets.
* Each ticket has two responses for each subject:
*     * $A$: the number of colds if they receive the vitamin C;
      * $B$: the number of colds if they receive the placebo.
* In the experiment, we only observe $A$ or $B$ for each subject.
* Which one we see depends on the randomization.
* Statistical theory says the $z$-test is applicable if the sample is large enough to decide between the hypotheses
    - $H_0: \text{average of all $A$'s $\geq$ average of all $B$'s}$ 
    - $H_a: \text{average of all $A$'s $<$ average of all $B$'s}$ 
    

## Complete data

In [None]:
%%capture
sample_fig = plt.figure(figsize=figsize)
ax = sample_fig.gca()
np.random.seed(0)
X = np.mgrid[0:10:10j,0:20:20j].reshape((2,200)) + np.random.sample((2,200)) * 0.05
X = X.T
sample = np.random.sample(200) 

data = csv2rec('data/vitaminC.csv')
treatment_sample = data['treatment']
placebo_sample = data['placebo']
placebo = np.array([t == 'placebo' for t in data['group']], np.bool)

idx = np.arange(200)
np.random.shuffle(idx)

placebo = placebo[idx]
placebo_sample = placebo_sample[idx]
treatment_sample = treatment_sample[idx]

for i in range(200):
   ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
   ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
   ax.text(X[i,0]-0.2, X[i,1], '%d' % treatment_sample[i], ha='center', va='center', size=10)
   ax.text(X[i,0]+0.2, X[i,1], '%d' % placebo_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
ax.set_title('Complete data (unobserved)', fontsize=15)

In [None]:
sample_fig

## Placebo group

In [None]:
%%capture
placebo_fig = plt.figure(figsize=figsize)
ax = placebo_fig.gca()
for i in range(200):
   if placebo[i]:
       ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
       ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
       ax.text(X[i,0]-0.2, X[i,1], '%d' % placebo_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
plt.title("Placebo sample: average(placebo)=%0.1f, SD(placebo)=%0.1f" % (np.around(placebo_sample[placebo].mean(),1), np.around(placebo_sample[placebo].std(),1)))

In [None]:
placebo_fig

## Treatment group

In [None]:
%%capture
treatment_fig = plt.figure(figsize=figsize)
ax = treatment_fig.gca()
for i in range(200):
   if not placebo[i]:
       ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
       ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
       ax.text(X[i,0]+0.2, X[i,1], '%d' % treatment_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
plt.title("Treatment sample: average(treatment)=%0.1f, SD(treatment)=%0.1f" % (treatment_sample[~placebo].mean(), treatment_sample[~placebo].std()))


In [None]:
treatment_fig

## Observed data

In [None]:
%%capture
observed_fig = plt.figure(figsize=figsize)
ax = observed_fig.gca()
for i in range(200):
   ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
   ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
   if not placebo[i]:
       ax.text(X[i,0]-0.2, X[i,1], '%d' % treatment_sample[i], ha='center', va='center', size=10)
   elif placebo[i]:
       ax.text(X[i,0]+0.2, X[i,1], '%d' % placebo_sample[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])

In [None]:
observed_fig

## Carrying out the test

* We know $$\begin{aligned}
       \text{average(treatment)} &= 2.6 \\
       \text{SE(average(treatment))} &= 2.9 / \sqrt{100} = 0.29 \\
       \text{average(placebo)} &= 2.3 \\
       \text{SE(average(placebo))} &= 3.1 / \sqrt{100} = 0.31 \\
       \end{aligned}$$
       
* Therefore, $$\begin{aligned}
       \text{SE(average(treatment)-average(placebo))}
       & = \sqrt{0.31^2+0.29^2} \\
       &= 0.42
       \end{aligned}$$
 and $$z = \frac{2.3-2.6}{0.42} = -0.7$$
       
* The one-sided $P$-value is 18%, when we test $$H_0: \text{average of all $A$'s $\geq$ average of all $B$'s}.$$


## Randomized experiments for binary responses

* The randomization model is not applicable only to quantitative things like the number of colds. The tickets can have 0-1 values.
* In an example discussed in the book, doctors are asked to 
read information about a surgical procedure and decide whether or
not to recommend surgery or radiation.
* There are two groups, each given different forms (A and B) which have
the same numerical data but worded slightly differently.
* We assume that there are two responses for each doctor: one
if they had read form A, the other if they had read form B.
* In the experiment, we only observe ${A}$ or ${B}$ for each subject.
* Which one we see depends on the randomization.
* Statistical theory says the $z$-test is applicable if the sample is large enough to test $$H_0: \text{proportion of all 1's in $A$} \geq 
 \text{proportion of all 1's in $B$}$$
against the alternative
 $$H_0: \text{proportion of all 1's in $A$} < 
 \text{proportion of all 1's in $B$}$$

### Box model for binary outcome

In [None]:
%%capture
forms = np.array([1]*80 + [0]*87, np.bool)
box_fig = plt.figure(figsize=figsize)
ax = box_fig.gca()
np.random.seed(0)
X = np.mgrid[0:1:2j,0:1:2j].reshape((2,4))  # + np.random.sample((2,4)) * 0.05
X = X.T

sampleA = np.array([0,0,1,1])
sampleB = np.array([0,1,0,1])
for i in range(4):
    
   ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.4,X[i,1]+.4], facecolor='green', alpha=0.3)
   ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.4,X[i,1]+.4], facecolor='red', alpha=0.3)
   ax.text(X[i,0]-0.2, X[i,1], '%d' % sampleA[i], ha='center', va='center', size=20)
   ax.text(X[i,0]+0.2, X[i,1], '%d' % sampleB[i], ha='center', va='center', size=20)
ax.set_xticks([]);    ax.set_xlim([-0.8,1.8])
ax.set_yticks([]);    ax.set_ylim([-0.6,1.6])
ax.set_title('Box model', fontsize=15)

In [None]:
box_fig

### Box model

- Tickets are drawn repeatedly at random from the above box 
with **weights**.

- Alternatively, box could have many more tickets and the **weights**
represent the proportion the above 4 choices appear in the big box.

### Complete data

In [None]:
%%capture
forms = np.array([1]*80 + [0]*87, np.bool)
binary_fig = plt.figure(figsize=figsize)
ax = binary_fig.gca()
np.random.seed(0)
X = np.mgrid[0:10:10j,0:20:17j].reshape((2,170))   + np.random.sample((2,170)) * 0.05
X = X.T

while True:
    sample1 = np.random.binomial(1, 0.5, size=80)
    sample2 = np.random.binomial(1, 0.84, size=87)
    if sample1.sum() == 40 and sample2.sum() == 73:
        break
sampleA = np.hstack([sample1, np.random.binomial(1,0.6, size=87)])
sampleB = np.hstack([np.random.binomial(1,0.6, size=80), sample2])
idx = np.arange(sampleA.shape[0])
np.random.shuffle(idx)
forms = forms[idx]
sampleA = sampleA[idx]
sampleB = sampleB[idx]

for i in range(sampleA.shape[0]):
   ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
   ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
   ax.text(X[i,0]-0.2, X[i,1], '%d' % sampleA[i], ha='center', va='center', size=10)
   ax.text(X[i,0]+0.2, X[i,1], '%d' % sampleB[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
ax.set_title('Complete data (unobserved)', fontsize=15)

In [None]:
binary_fig

## Doctors receiving form A

In [None]:
%%capture
formA_fig = plt.figure(figsize=figsize)
ax = formA_fig.gca()
for i in range(forms.shape[0]):
   if forms[i]:
       ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
       ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
       ax.text(X[i,0]-0.2, X[i,1], '%d' % sampleA[i], ha='center', va='center', size=10)

ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
ax.set_title("Form A: average(surgery)=%0.2f" % (sampleA[forms].mean()))


In [None]:
formA_fig

## Doctors receiving form B

In [None]:
%%capture
formB_fig = plt.figure(figsize=figsize)
ax = formB_fig.gca()
for i in range(forms.shape[0]):
   if not forms[i]:
       ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
       ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
       ax.text(X[i,0]+0.2, X[i,1], '%d' % sampleB[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
ax.set_title("Form B: average(surgery)=%0.2f" % (sampleB[~forms].mean()))


In [None]:
formB_fig

### Observed data

In [None]:
%%capture
observed_fig = plt.figure(figsize=figsize)
ax = observed_fig.gca()
for i in range(forms.shape[0]):
   ax.fill_between([X[i,0]-.4,X[i,0]], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='green', alpha=0.3)
   ax.fill_between([X[i,0],X[i,0]+.4], [X[i,1]-.4,X[i,1]-.4], [X[i,1]+.3,X[i,1]+.3], facecolor='red', alpha=0.3)
   if forms[i]:
       ax.text(X[i,0]-0.2, X[i,1], '%d' % sampleA[i], ha='center', va='center', size=10)
   else:
       ax.text(X[i,0]+0.2, X[i,1], '%d' % sampleB[i], ha='center', va='center', size=10)
ax.set_xticks([]);    ax.set_xlim([-1,11])
ax.set_yticks([]);    ax.set_ylim([-1,21])
ax.set_title('Observed data: average(surgery,A), average(surgery, B) = (%0.2f,%0.2f)' % (sampleA[forms].mean(), sampleB[~forms].mean()))

In [None]:
observed_fig

## Carrying out the test

* The numerator is $0.84 - 0.5$. What about the denominator?

* Our estimate of SE is
$$
\sqrt{0.5 * 0.5 / 80 + 0.87 * 0.13 / 87} \approx 6.7%
$$
(The book gets 6.8% by doing some different rounding.)

    - The quantity $0.5 * 0.5 / 80$ is our bootstrap estimate of the SE (squared)
    of those doctors who read form A.
    
    - The quantity $0.87 * 0.13 / 87$ is our bootstrap estimate of the SE (squared)
    of those doctors who read form B.
      
- Our $z$-score is 
$$z = \frac{0.5-0.84 - 0}{0.067} \approx -5.1$$
(The book gets 5 by doing different rounding and using opposite sign for
numerator).
- The one-sided $P$-value is almost 0 (as is the two-sided $P$-value which is the
more appropriate quantity to report here).



## When is the $z$-test OK to use?

- The book discusses when it is OK to use the $z$-test.
- Here, the book is specifically referring to when the $z$-test can be
used for comparing two groups.
- It is OK to use the $z$-test to compare two groups
where the data are *independent*.
- It is also OK to use the $z$-test to compare two groups
in a randomized controlled experiment with a treatment and a placebo.
- For the randomized problems, the box model is that each ticket has
two values: one if the subject receives treatment, a different
one if the subject receives placebo.
