In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from numpy import mean, sqrt, std

# stats60 specific
from code.utils import sample_density
from code.probability import Sample
figsize = (8,8)

## Presidential approval ratings

* If you ever watch CNN / Fox News / MSNBC you don't have
to wait too long until your hear reports about opinion polls.

* [Rasmussen](http://www.rasmussenreports.com/public_content/politics/obama_administration/daily_presidential_tracking_poll) tracks presidential approval rating.
 
* So does [Gallup](http://www.gallup.com/poll/113980/gallup-daily-obama-job-approval.aspx)

* Both polls are trying to measure the same thing.
* Why don’t they agree?
* Polls are sometimes wrong: [Truman / Dewey 1948](http://www.chicagotribune.com/news/politics/chi-chicagodays-deweydefeats-story,0,6484067.story)
    - Part of the error in this poll was due to the sampling: *quota sampling*.
    - A random sample might have reduced this bias.

## Sources of bias

* Some would say that Rasmussen is biased (others might say 
Gallup is biased).

* In the last presidential election, Mitt Romney thought
the [polls were wrong](http://www.thewire.com/politics/2012/11/whole-romney-ticket-believed-unskewed-polls/58852/)

* It is possible the polls were somewhat biased.
* Some examples of bias:
     * If what you measure differs from employed and unemployed, then calling during the daytime can bias sample towards unemployed.
     * If what you measure differs from cell-phone users and land line user then using a telephone book can bias sample towards land-line owners.

* Cannot be remedied by taking a bigger poll: see [1936 election](http://en.wikipedia.org/wiki/United_States_presidential_election,_1936).

* **Read Chapter 19 of the textbook for discussion of sampling bias.**

## Model for percentages

* An opinion poll to estimate a percentage is based on a sample of size $N$. 

* We assume this is a **simple random sample**: a sample of size $N$ without replacement.

* We will draw from boxes with `B`=blue and `R`=red marbles. Our goal
is to estimate the percentage of blue marbles in the box.

* The sample is from a box with one ball per population member, each having a 0-1 label with `B` corresponding to 1 and `R` corresponding to 0.

* The estimated percentage is 

       estimated percentage of B = (sum of N draws from box) / N
  
 
  
  

## A population with 500 Red, 500 Blue

In [None]:
opinion_poll = Sample(500, 500, 20)


In [None]:
opinion_poll.nblue, opinion_poll.nred, opinion_poll.ndraw

In [None]:
opinion_poll.trial()

In [None]:
opinion_poll.figure

## Sampling variability in opinion polls

In [None]:
opinion_poll.draw_fig = False
opinion_poll.sample(5)

In [None]:
fig = plt.figure(figsize=figsize)
ax = sample_density(opinion_poll.sample(10000), facecolor='orange')[0]
ax.set_title('10000 repeats of drawing 20 from box', fontsize=15)

## Sampling variability

- As expected, with a larger sample, there is less variability.

- There may also be an effect due to the fact that we are sampling without
replacement.

- We will see later that this effect is relatively small

In [None]:
opinion100 = Sample(500, 500, 100)
opinion100.sample(5)

In [None]:
opinion100.draw_fig = False
fig100 = plt.figure(figsize=figsize)
ax100 = sample_density(opinion100.sample(10000), facecolor='orange')[0]
ax100.set_title('10000 repeats of drawing 100 from box', fontsize=15)

## Larger box: 5000 blue, 5000 red

In [None]:
opinion100_larger = Sample(5000, 5000, 100)
opinion100_larger.alpha = 0.1
opinion100_larger.ptsize = 10
opinion100_larger.sample(5)

In [None]:
opinion100_larger.draw_fig = False
fig100_larger = plt.figure(figsize=figsize)
ax100_larger = sample_density(opinion100_larger.sample(10000), facecolor='orange')[0]
ax100_larger.set_title('10000 repeats of drawing 100 from larger box', fontsize=15)

## Taking a sample of size 1000

In [None]:
opinion1000 = Sample(5000, 5000, 1000)
opinion1000.alpha = 0.1
opinion1000.ptsize = 3
opinion1000.sample_ptsize = 30
opinion1000.sample(5)

In [None]:
opinion1000.draw_fig = False
fig1000 = plt.figure(figsize=figsize)
ax1000 = sample_density(opinion1000.sample(2000), facecolor='orange')[0]
ax1000.set_title('2000 repeats of drawing 1000 from larger box', fontsize=15)

## Estimating a percentage     

- In our boxes, the proportion of blue to red is fixed: there
     are 50% blue and 50% red. Our goal is to estimate
     the proportion of blue.
- When we sample, we don't recover exactly 50%. Our observed
     proportion is a *chance process* (i.e. a
     random variable).
     
- We also call this random variable a *statistic*. A statistic is a 
chance process (random variable) computed on a sample.
     
- Here is a (hopefully familiar) model:

      estimated percentage of B = 50% + chance error

  
- **Note:** If poll is biased, a better model is

      estimated percentage of B = 50% + bias + chance error

## Accuracy of percentages


- How big is the chance error?
   
- Since the observed proportion is random, it has its own SE.
- The rule for computing the SE of a percentage
     for a simple random sample is related to SE of drawing
     from a box of 0's and 1's.
- The standard error after drawing `N=1000` times with replacement
from a box with p=50% `B` marbles
would be:

      SE(estimated percentage of B) = sqrt(p * (1 - p)) / sqrt(N)
                                    = sqrt(1/2 * (1 - 1/2)) / 
                                        sqrt(1000)

- The standard error after drawing `N=1000` times with replacement
from a box with p=57% `B` marbles
would be:

      SE(estimated percentage of B) = sqrt(p * (1 - p)) / sqrt(N)
                                    = sqrt(0.57 * (1 - 0.57)) / 
                                         sqrt(1000)


- If sampling with replacement, the SE does not depend on how many tickets are in the box.

- When sampling without replacement, this SE is approximate (but quite accurate
if the population is large enough).


## Unbalanced boxes

- Not all populations are evenly split between `B` and `R`.

- How do we compute the expected percentage and its SE for unbalanced
boxes?

         expected(estimated percentage of B) = proportion B in box
                                             = p[B]
         
- As for the SE:

         SE(estimated percentage of B) = sqrt(p[B] * (1-p[B])) 
                                         / sqrt(N).

## Correction factor for SE

* To get the actual SE when we sample without replacement, we should multiply by a correction factor.
* If there are $N_{pop}$ people in the population and we sample $N_{sample}$
then
$$
\begin{aligned}
\text{correction factor} &= \sqrt{\frac{N_{pop}-N_{sample}}{N_{pop}-1}} \\
&= \sqrt{\frac{10000-1000}{9999}} \\
\end{aligned}$$
* For example, in our largest boxes, $N_{pop}=10000, N_{sample}=1000$
the correction factor is 0.95.
* An organization like Gallup samples from a MUCH BIGGER population, so this
correction factor is virtually 1.

          SE(drawing WITHOUT replacement) = 
               correction factor * (SE drawing WITH replacement)

In [None]:
sqrt((10000 - 1000) / (10000 - 1.))

## Example

- Suppose  57% of voting age Californians approve of Governorn Brown’s job performance. 

- If we sample 100 voting age Californians at random, we expect the percentage who approve of Brown’s performance $\bbox[5px,border:2px solid red]{57\%}$ to be give or take $\bbox[5px,border:2px solid red]{\sqrt{0.57 ( 1 - 0.57) / 100} * 100\%}$.

## Example

- **Answer** The expected percentage or proportion is 57%. The SE is (ignoring the correction factor) 

$$\text{ SE(estimated percentage)} = 
\sqrt{0.57 * 0.43} / \sqrt{100} \approx 5\%$$  
     

* We expect the percentage to be 57% give or take 5%.

## Example (continued)

- What if we had sampled 1000 voting age Californians?
- The expected proportion is the same. The SE is now (ignoring the correction factor) 

$$\text{ SE(estimated percentage)} = 
\sqrt{0.57 * 0.43} / \sqrt{1000} \approx 1.6\%$$

- A sample size of 1000 is quite commonly used because in a balanced box, the
SE is roughly 1.6% so the 2 SD rule is $\pm 3.2\%$.

## A more realistic picture

* In practice, when a poll is carried out, we don’t know how many `B` or `R`
there are in the population.

* Taking a sample of size 500, say, gives us information on 500 of this population.
* Our goal, as statisticians, is to give the politicians some idea of the *true*
   proportion of `B` vs. `R` 
   in the box

* This is our first true *statistics*
   problem …

### Box we will poll …

In [None]:
%%capture
brown_approval = Sample(5700, 4300, 500)
brown_approval.alpha = 0.1
brown_approval.ptsize = 10
brown_approval.draw(color={'R':'gray','B':'gray'})

In [None]:
brown_approval.figure

### Poll sample

In [None]:
prop = brown_approval.trial(bgcolor={'R':'gray','B':'gray'},
                            color={'R':'red','B':'blue'})
brown_approval.figure

## Can we use the normal approximation?

- Applying the rule $np > 10$ and $n(1-p)>10$ to the sample proportion
suggest we should have at least 10 votes for Brown and 10 votes against Brown in our poll.
- This is not quite the same rule because we don't know the true $p$, we only
observe $\hat{p}$...
