In [1]:
# Run this cell to set up your notebook

import numpy as np
from scipy import stats
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines make warnings look nicer
import warnings
warnings.simplefilter('ignore', FutureWarning)

# Week 8 Part 4 #

## "Back-calculating" Covariance ##
We've been focusing on using covariance to find variance. Sometimes you can also "work backwards" and use variance to find covariance. Here's an example.

A random number generator draws $n$ times at random with replacement from 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Let $N_0$ be the number of 0s and let $N_9$ be the number of 9s. Find $r(N_0, N_9)$.

First note that the answer should be negative. The number of draws is fixed at $n$, so the more 0s you get, the fewer 9s you will get.

We know: 

- $r(N_0, N_9) = \frac{Cov(N_0, N_9)}{SD(N_0)SD(N_9)}$

- Each of $N_0$ and $N_9$ is binomial $(n, 0.1)$, so $Var(N_0) = Var(N_9) = n\cdot0.1\cdot0.9$

All we need is the numerator, $Cov(N_0, N_9)$. You can write each one as a sum of indicators and use bilinearity (carefully: the same draw can't be both 0 and 9). But here's another way.

$N_0 + N_9$ is the total number of 0s and 9s, and thus has the binomial $(n, 0.2)$ distribution. So in the formula

$$
Var(N_0 + N_9) ~ = ~ Var(N_0) + Var(N_9) + 2Cov(N_0, N_9)
$$

we know all the variance terms and can therefore solve for the covariance term:

$$
n\cdot0.2\cdot0.8 ~ = ~ n\cdot0.1\cdot0.9 + n\cdot0.1\cdot0.9 + 2Cov(N_0, N_9)
$$

so

$$
Cov(N_0, N_9) ~ = ~ \frac{1}{2}(n\cdot0.2\cdot0.8 - 2n\cdot0.1\cdot0.9)
$$

This gives us the correlation:

$$
r(N_0, N_9) = \frac{Cov(N_0, N_9)}{SD(N_0)SD(N_9)} = \frac{\frac{1}{2}(n\cdot0.2\cdot0.8 - 2n\cdot0.1\cdot0.9)}{n\cdot0.1\cdot0.9} = \frac{0.8 - 0.9}{0.9} = -\frac{1}{9}
$$

## Census ##
To be able to apply the method above, you have to already know the variance of the sum as well as the individual variances. 

In the context of simple random sampling, there is one extreme case in which the variance of the sample sum is known, and that is when you have a census. That's when your simple random sample size $n$ equals the population size $N$.

In this case, there is only one possible sample of that size. **For this sample size, there's no variation in any statistic from sample to sample, because there's only one possible sample.** 

So suppose $X_1, X_2, \ldots, X_n$ is a simple random sample from a population of size $N$. Denote the mean of the popualation by $\mu$ and the variance of the population by $\sigma^2$. 

Now let $S_n = X_1 + X_2 + \cdots + X_n$ be the sample sum.
Then by an old symmetry argument, $X_1, X_2, \ldots, X_n$ are identically distributed, each with the same distribution as the population. So $E(S_n) = n\mu$.

We don't yet have a formula for $Var(S_n)$, **but in the special case $n = N$**, we know that it's 0. That is, $Var(S_N) = 0$.

A lovely piece of math shows us how to use this fact to find $Var(S_n)$ for any $n$.

## Reading: Variance of Simple Random Sample Sum ##

You're all set. Work through [this](http://prob140.org/textbook/Chapter_13/03_Sums_of_Simple_Random_Samples.html#Variance-of-a-Simple-Random-Sample-Sum).

It's just beautiful. It puts together

$$
Var(S_n) ~ = ~ n\sigma^2 + n(n-1)Cov(X_1, X_2) ~~~ \text{ for all } n
$$

and 

$$
0 = Var(S_N) ~ = ~ N\sigma^2 + N(N-1)Cov(X_1, X_2)
$$

to show that

$$
Var(S_n) ~ = ~ n\sigma^2\frac{N-n}{N-1}
$$

Hence

$$
SD(S_n) ~ = ~ \sqrt{n}\sigma\sqrt{\frac{N-n}{N-1}}
$$

Notes:

- The formula is the SD of the sample sum when sampling **with** replacement ($\sqrt{n}\sigma$ from Part 2), times a factor that is less than 1. That factor is called the *finite population correction*. You are "correcting" the formula that based on with-replacement sampling for the case where you are sampling without replacement.
- The variance of the hypergeometric in Part 3 is the special case when the population consists of $G$ 1s and $B = N-G$ 0s, and hence $\sigma^2 = \frac{G}{N}\cdot\frac{B}{N}$.

## Vitamins ##

**1.** A die is rolled 20 times. Let $X$ be the number of threes and let $Y$ be the number of sixes. True or false: $r(X, Y) \ge 0$.

**2.** All 52 cards are dealt at random without replacement from a standard deck. Let $X$ be the number of aces dealt. Find $Var(X)$.

**3.** A population consists of 1000 households. The number of people in these households has an average of 2.4 and an SD of 1.8. Let $X$ be the total number of people in a simple random sample of 30 households from this population. Find $E(X)$ and $SD(X)$. You should get 72 and roughly 9.71.

In [None]:
# Vitamin 3



## Congratulations! You're done with Chapter 13 ##
We are skipping [Section 13.4](http://prob140.org/textbook/Chapter_13/04_Finite_Population_Correction.html) which takes a closer look at the finite population correction. But it's an easy read, and it has a very useful summary table of variances at the start, so you might want to glance through.

### Note on Lab ###
Some of it can be done based on counting (and hypothesis testing from Data 8), but for the later sections you will need the formula for the variance of the simple random sum derived in this Part, applied to a simple random sample from the population of numbers $1, 2, \ldots, N$. Its mean and variance are in [Chapter 12](http://prob140.org/textbook/Chapter_12/01_Definition.html#Uniform), but be careful about what $n$ and $N$ mean in all the different places.