In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from numpy import mean, sqrt, std
from scipy.stats import norm as ndist

# stats60 specific
from code.probability import Sample
from code.week1 import standardize_left, normal_curve
figsize = (8,8)

## A more realistic picture

* In practice, when a poll is carried out, we don’t know how many `B` or `R`
there are in the population.

* Taking a sample of size 500, say, gives us information on 500 of this population.
* Our goal, as statisticians, is to give the politicians some idea of the *true*
   proportion of `B` vs. `R` 
   in the box

* This is our first true *statistics*
   problem...
   


## Box we will poll: 57% support Gov. Brown

- But we don't know which are his supporters.

- A sample of 1000 should estimate the percentage with an SE of
$$
\sqrt{.57 * .43 / 1000.} \approx 1.6\%.
$$

In [None]:
%%capture
brown_approval = Sample(5700, 4300, 500)
brown_approval.alpha = 0.1
brown_approval.ptsize = 10
brown_approval.draw(color={'R':'gray','B':'gray'})


In [None]:
brown_approval.figure

### Poll sample

In [None]:
prop = brown_approval.trial(bgcolor={'R':'gray','B':'gray'},
                            color={'R':'red','B':'blue'})
brown_approval.figure


## Normal approximation for estimated percentages

- Estimated percentages are just averaes.

- We can use the normal approximation on them: we need to know the
**expected(estimated percentage)** and **SE(estimated percentage)**.


- What are the chances a poll of 1000 voting age Californians would show an estimated approval rating for Brown less than 54%? (Remember, his true approval is  57% for the purposes of these notes.)

In [None]:
%%capture
with plt.xkcd():
    fig = plt.figure(figsize=(10,3))
    ax = fig.gca() 
    standardize_left(54, 57, sqrt(57*43./1000), data=False, standardized=True)

In [None]:
fig

In [None]:
%%capture
normal_fig = plt.figure(figsize=(10,10))
ax = normal_curve()
interval = np.linspace(-4,-1.9,101)
ax.fill_between(interval, 0*interval, ndist.pdf(interval), 
                hatch='+', facecolor='red')
ax.set_title('Area is about 2.8%', fontsize=15)


In [None]:
normal_fig

## Estimating SE when proportions unknown


- Given a poll, we can work out the *observed*
   proportion of 1’s. Call this 
   $$\bbox[5px,border:2px solid orange]{\widehat{p}} = \text{observed proportion} = \frac{\text{$\#$ B's in sample}}{\text{$\#$ in sample}}$$

- If we knew the true proportion of Brown's supportors, say, $p$ we would use
$$
 \text{SE}(\bbox[5px,border:2px solid orange]{\widehat{p}}) \approx  \sqrt{\bbox[5px,border:2px solid blue]{p} \times (1 - \bbox[5px,border:2px solid blue]{p})} / \sqrt{\text{$\#$ in sample}}.
$$

- We do not know $p$. BUT we can **estimate** the SE of observed proportion of B’s 
$$\begin{aligned}
     \text{SE}(\bbox[5px,border:2px solid orange]{\widehat{p}})& \approx  \sqrt{\bbox[5px,border:2px solid orange]{\widehat{p}} \times (1 - \bbox[5px,border:2px solid orange]{ \widehat{p}})} / \sqrt{\text{$\#$ in sample}} \\
         \end{aligned}$$

- The quantity $\sqrt{\widehat{p} \times (1 - \widehat{p})}$ is an estimate of the SD of the box. 

     - It depends on the poll we used to estimate $\bbox[5px,border:2px solid blue]{p}$. 
     - It is not a number because it changes with the poll.


- We call this the **bootstrap** estimate of the SD of the box.




## Example

-  A poll is undertaken to measure Governor Brown’s job approval rating: 1000 Californians of voting age are polled, of whom 549 approve of his job performance. What is the estimated approval rating? Estimate its SE.

- The estimated approval rating is 54.9%. We can estimate its SE as $$\bbox[5px,border:2px solid red]{\begin{aligned}
      \text{SE( approval rating)} & \approx \frac{\sqrt{0.549 \times 0.451} }{\sqrt{1000}} 
     & \approx 1.6 \%.
     \end{aligned}}$$
     
- Note that this is *very close* to the actual SE which is $\sqrt{.57 * .43 / 1000}$.

- If we were to compute the area under the normal curve above using the *estimated SE* our answer would be almost the same!

## A new statistical concept: confidence intervals

- If we knew Governor Brown’s true approval rating (57%), we know that when we poll 1000 voting age Californians, we expect a 57% approval rating give or take 1.6%.

- Using the normal curve, there is 95% probability that a poll of 1000 voting age Californians will yield an estimated approval rating in the interval $\bbox[2px,border:2px solid red]{[53.8,60.2]}$. So,
$$
P (\text{$\bbox[5px,border:2px solid orange]{\widehat{p}}$ between [53.8,60.2]}) \approx 95\%.
$$

- A confidence interval reverses this process: **given a poll of 1000 voting age Californians, what can we say about Governor Brown’s true approval rating?**

## Reversing the picture

- Recall our model 

$$\text{observed proportion} =  \text{true proportion} + \text{chance error}.$$
      
- Or, 

$$ \bbox[5px,border:2px solid orange]{\widehat{p}} = \bbox[5px,border:2px solid blue]{p} + \text{chance error}.
$$

- We know (if we have a simple random sample):
     - $\text{expected}(\bbox[5px,border:2px solid orange]{\widehat{p}}) = E(\bbox[5px,border:2px solid orange]{\widehat{p}}) = \bbox[5px,border:2px solid blue]{p}.$
     - $\text{SE}(\bbox[5px,border:2px solid orange]{\widehat{p}}) = \sqrt{\bbox[5px,border:2px solid blue]{p} * (1-\bbox[5px,border:2px solid blue]{p}) / 1000}.$

## Reversing the picture

- The normal approximation says:

$$
   \begin{aligned}
   \text{P} \left(\text{$\bbox[5px,border:2px solid orange]{\widehat{p}}$ greater than $\bbox[5px,border:2px solid blue]{{p}} + \frac{2\sqrt{\bbox[5px,border:2px solid blue]{{p}} \times (1-\bbox[5px,border:2px solid blue]{{p}})}}{\sqrt{1000}}$}\right) &\approx 2.5\% \\
   \text{P} \left(\text{$\bbox[5px,border:2px solid orange]{\widehat{p}}$ less than $\bbox[5px,border:2px solid blue]{{p}} - \frac{2\sqrt{\bbox[5px,border:2px solid blue]{{p}} \times (1-\bbox[5px,border:2px solid blue]{{p}})}}{\sqrt{1000}}$}\right) &\approx 2.5\% \\
   \end{aligned}
   $$
   
- This is the same as saying                                                                                                                                                
   $$
   \begin{aligned}
   \text{P} \left(\text{$\bbox[5px,border:2px solid blue]{{p}}$ less than $\bbox[5px,border:2px solid orange]{\widehat{p}} - \frac{2\sqrt{\bbox[5px,border:2px solid blue]{{p}} \times (1-\bbox[5px,border:2px solid blue]{{p}})}}{\sqrt{1000}}$}\right) &\approx 2.5\% \\
   \text{P} \left(\text{$\bbox[5px,border:2px solid blue]{{p}}$ greater than $\bbox[5px,border:2px solid orange]{\widehat{p}} + \frac{2\sqrt{\bbox[5px,border:2px solid blue]{{p}} \times (1-\bbox[5px,border:2px solid blue]{{p}})}}{\sqrt{1000}}$}\right) &\approx 2.5\% \\
   \end{aligned}
   $$
   
- Or,
in other words,
   $$
   \begin{aligned}
   \text{P} \left(\text{$\bbox[5px,border:2px solid blue]{{p}}$ between $\bbox[5px,border:2px solid orange]{\widehat{p}} \pm \frac{2\sqrt{\bbox[5px,border:2px solid blue]{{p}} \times (1-\bbox[5px,border:2px solid blue]{{p}})}}{\sqrt{1000}}$}\right) &\approx 95\% \\
   \end{aligned}
   $$   
   
- If we knew $\bbox[2px,border:2px solid blue]{p}$ we would have an interval on the right based on the  observed proportion
   that says something about  true proportion.

- But we don't know $\bbox[2px,border:2px solid blue]{p}$.

## Our first confidence interval

- Since we don't know $\bbox[2px,border:2px solid blue]{p}$, we can't compute 
$SE(\widehat{p})$.

- We use the **boostrap estimate** instead.

- We call the interval:
$$
\bbox[2px,border:2px solid red]{
\left[\widehat{p} - \frac{2\sqrt{\widehat{p}(1-\widehat{p})}}{\sqrt{1000}},                                                                                            
    \widehat{p} + \frac{2\sqrt{\widehat{p}(1-\widehat{p})}}{\sqrt{1000}}\right] }
$$
an (approximate) 95% confidence interval for the true proportion  $\bbox[2px,border:2px solid blue]{p}$.
 
- We emphasize that this depends only on the estimated proportion and not the true proportion:
$$
\bbox[2px,border:2px solid red]{
\left[\bbox[2px,border:2px solid orange]{\widehat{p}} - \frac{2\sqrt{\bbox[2px,border:2px solid orange]{\widehat{p}}(1-\bbox[2px,border:2px solid orange]{\widehat{p}})}}{\sqrt{1000}},   
\bbox[2px,border:2px solid orange]{\widehat{p}} + \frac{2\sqrt{\bbox[2px,border:2px solid orange]{\widehat{p}}(1-\bbox[2px,border:2px solid orange]{\widehat{p}})}}{\sqrt{1000}} \right] }
$$

- The chances that the true $\bbox[2px,border:2px solid blue]{p}$ is in this  random interval
   are about 95%.
   
- **Note:**
   I will often drop the *approximate*
   when we talk about confidence intervals.

## Example

- Find a 95% confidence interval for Governor Brown’s approval rating given 549 out of 1000 voting age Californians polled approved of his job performance.


         We already saw the estimated SE of the proportion 
         was approximately 1.6%.
         
         A 95% confidence interval is therefore: 
         
 $$\bbox[5px,border:2px solid red]{[54.9-2 \times 1.6, 54.9+2\times 1.6] = [51.7,58.1]}$$
 
- This interval does include the 57% (which is his  true approval).
We say this interval "covers" the true proportion. 

- Not all 95% confidence intervals do: only about 95% of them do.

- Recall the other interval we computed $[53.8,60.2]$: this was an interval in which there is a 95% probability the  observed proportion
   would be in if the  true proportion
   was  57%.
- These intervals are different: one is based on an  observed proportion, the other is based on a box model with the true proportion known to be 57%.

### An illustration of confidence intervals

In [None]:
%%capture
from random import sample as random_sample

def interval(approval, nsample=1000):
    prop = np.array(random_sample(approval, nsample)).mean()
    SE = np.sqrt(prop * (1 - prop) / nsample)
    return 100 * (prop-2*SE), 100 * (prop+2*SE)

def confidence_intervals(num_intervals=100, true_proportion=.57,
                         npop=100000,
                         nsample=1000):
    blue = int(npop*true_proportion)
    approval = [0] * (npop-blue) + [1] * blue

    confidence_intervals = plt.figure(figsize=figsize)
    ax = confidence_intervals.gca()
    intervals = []
    missed = 0
    for i in range(num_intervals):
        L, U = interval(approval, nsample=nsample)
        in_interval = (L <= 100*true_proportion) * (U >= 100*true_proportion)
        if in_interval:
            plt.plot([L,U], [i,i], color='gray', linewidth=2)
        else:
            missed += 1
            ax.plot([L,U], [i,i], color='red', linewidth=5)
        

    ax.plot([100*true_proportion]*2,[0,num_intervals], 'k--', linewidth=4)
    ax.set_yticks([])
    ax.set_ylabel('Different polls of size 1000', rotation='vertical', fontsize=15)
    ax.set_xlabel('Percent who support Governorn Brown', fontsize=15)
    ax.set_title("# covering %0.2d%%=%d, # not covering %0.2d%%=%d" % (round(100*true_proportion),
                                                                     num_intervals-missed,
                                                                     round(100*true_proportion),
                                                                     missed))

In [None]:
confidence_intervals()

- The  true proportion
 doesn’t change for each poll.
 
- The interval changes with each poll. 

- **The confidence intervals are random.**


- Each  random interval
 either covers  true proportion
 or not.

## Interpreting a confidence interval

- Recall that our confidence interval above was $[51.7,58.1]$.

- It is a 95% confidence interval, but what does that mean?

- It is tempting to say

         There is a 95% chance that the true proportion
         of Brown supporters is in the interval [51.7,58.1].
         
- **This is wrong!**

- Either the true proportion (57%) is in the interval $[51.7,58.1]$ or it
is not. In this case, it happens to be in the interval.

- The 95% refers to making more intervals: if we go back and sample another
1000 voters, and form intervals each time. Then, over many repetitions
of this process 95% of the intervals will contain 57%.