In [None]:
%matplotlib inline
%load_ext rpy2.ipython
import matplotlib.pyplot as plt
import numpy as np
from numpy import mean, sqrt, std, fabs

# stats60 specific
from code import roulette
from code.probability import Normal, SampleMean, Uniform
from code.utils import sample_density
figsize = (8,8)

## Measurement models and draws from a box

- We have seen how to deal with average from a box.
- Not all measurements in reality fit this model, so our SE rules do not apply.
- Examples for which our rule does not apply:
     - [Population of U.S.](http://www.census.gov/popclock/) by year: it is always increasing.
     - Daily max temperature in Palo Alto: there is a seasonal trend in it.

In [None]:
%%R -h 800 -w 800
PA.temp <- read.table('http://stats191.stanford.edu/data/paloaltoT.table', header=F, skip=2)
plot(PA.temp[,3], xlab='Day', ylab='Average Max Temp (F)', pch=23, bg='orange')

## Gauss model

- The Gauss model assume that each measurement has the form

         measurement = true value + chance error
         
      
- When the Gauss model holds, taking a measurement corresponds to drawing from an  error box and adding a  true value.

- If the measurement is biased, the Gauss model is


         measurement = bias + true value
                       + chance error


## Sampling from the Gauss model

- Suppose we observe a sample of $n$ draws $[X_1, \dots, X_n]$ from the Gauss model.
- Then, $$\begin{aligned}
       E(\bar{X}) &= \text{true value} \\
       \text{SE}(X_1) &= \text{SE(one draw from error box)} \\
       \text{SE}(\bar{X}) &= \frac{1}{\sqrt{n}}  \text{SE(one draw from error box)}
       \end{aligned}$$
- A reasonable estimate of $\text{SE}(\bar{X})$ is
$$
\text{SE}(\bar{X}) \approx \frac{1}{\sqrt{n}} \text{SD}([X_1, \dots, X_n]).
$$

- If you know the SE from previous data, use the true SE rather than the bootstrap estimate.

## No box, no inference

- If you can’t accurately describe your chance process as drawing from a box you can’t use these formulae for SE because they were all based on drawing from a box.
- Example: suppose that you have some problem with your computer and instead of inserting the draw from a box only once in your list, it inserts it twice. 
- Suppose the box is [1,3,5,7] and you observe $[1,1,3,3,5,5]$. The usual estimate for the SE for a sum of 6 draws will yield an estimated SD(box) of 
of 
$$\widehat{\text{SD(box)}} = \sqrt{\frac{1}{6} (2 \times (-2)^2 + 2 \times 0^2 + 2 \times 2^2)} = 1.63$$
- The bootstrap rule for estimating  SE(sum of 6 draws from box)
   will yield $$\widehat{\text{SE(sum of 6 draws)}} = \sqrt{6} \times 1.63 = \sqrt{6} \times \sqrt{\frac{16}{6}} = 4.$$

## Example (continued)

- ** But**, the sum of these 6 draws is actually like twice the sum of 3 draws. So its SE is $$2 \times \sqrt{3} \times  \text{SD(box)} = 2 \times \sqrt{3} \times \sqrt{8} = 5.65$$
- So we will have underestimated the actual SE.
- This is not an artifact of only taking 6 draws.
- Ignoring the duplicates will yield an estimate that is too small by a factor of $1/\sqrt{2}$.
- The normal approximation will still hold for the sum of draws with duplicates, but we will have the wrong SE.
- Our confidence intervals will be too small!

## A special case of the Gauss model

- A special case of the Gauss model is when the errors
follow a normal curve.

- The normal curve is also called the *Gaussian* distribution.

- The book does not assume the errors follow the normal curve, but
tells you when they do.

- I often use the Gaussian distribution in the Gauss model and the Gauss model interchangeably.

In [None]:
true_value = 3
SE_error_box = 2
normal_model = Normal(true_value, SE_error_box)
print normal_model.trial()
mean(normal_model.sample(2000)), std(normal_model.sample(2000))

In [None]:
%%capture
normal_model_fig = plt.figure(figsize=figsize)
ax = sample_density(normal_model.sample(15000), bins=30, facecolor='orange')[0]
ax.set_title('True value = 3, normal error box SE = 2')

In [None]:
normal_model_fig

## Sample averages with normal errors

- A sample of any size has a normal histogram if the errors
in the box follow the normal curve.

In [None]:
sample_mean = SampleMean(normal_model, 3)
sample_mean.trial()

In [None]:
std(sample_mean.sample(5000)), 2 / sqrt(3)

In [None]:
%%capture
sample_mean_fig = plt.figure(figsize=figsize)
ax = sample_density(sample_mean.sample(15000), bins=30, facecolor='orange')[0]
ax.set_title(r'True value = 3, sample size 3, sample mean SE = 2 / $\sqrt{3}$')

In [None]:
sample_mean_fig

## A different measurement error

- Not all chance processes will have errors that follow the
normal curve.

In [None]:
other_model = Uniform(true_value, SE_error_box)
mean(other_model.sample(2000)), std(other_model.sample(2000))

In [None]:
%%capture
other_model_fig = plt.figure(figsize=figsize)
ax = sample_density(other_model.sample(15000), bins=30, facecolor='orange')[0]
ax.set_title('True value = 3, error box SE = 2')

In [None]:
other_model_fig

In [None]:
%%capture
other_mean = SampleMean(other_model, 3)
other_mean_fig = plt.figure(figsize=figsize)
ax = sample_density(other_mean.sample(15000), bins=30, facecolor='orange')[0]
ax.set_title(r'True value = 3, sample size 3, sample mean SE = 2 / $\sqrt{3}$')

In [None]:
other_mean_fig

In [None]:
%%capture
other_mean = SampleMean(other_model, 20)
other_mean_fig20 = plt.figure(figsize=figsize)
ax = sample_density(other_mean.sample(15000), bins=30, facecolor='orange')[0]
ax.set_title(r'True value = 3, sample size 3, sample mean SE = 2 / $\sqrt{3}$')

In [None]:
other_mean_fig20

## 2 SD rule revisited

- Earlier in the quarter, we saw the 2SD rule for lists that said:

        For many lists, 95% of the entries will be within 2SD(list) of
        the average(list).
        
- How does that relate to the Gauss model? Suppose we make
$n$ measurements with the Gauss model.

- If the errors
are normally distributed then this statement is true **for every $n$** if
we replace `average(list)` with `true value` and `SD(list)` with `SE(error box)`

- Even with the sample quantities, this statement holds if we take enough samples.

In [None]:
def twoSD_proportion(sample_list):
    return mean([fabs(sample_list - mean(sample_list)) < 2 * std(sample_list)])
twoSD_proportion(normal_model.sample(500))

It is generally conservative for small $n$, but by $n=25$ its coverage is
pretty accurate.

In [None]:
print 'sample size 5', mean([twoSD_proportion(normal_model.sample(5)) for _ in range(1000)])
print 'sample size 25', mean([twoSD_proportion(normal_model.sample(25)) for _ in range(1000)])