In [1]:
# Run this cell to set up your notebook

import numpy as np
from scipy import stats
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines make warnings look nicer
import warnings
warnings.simplefilter('ignore', FutureWarning)

# Week 8 Part 3 #

## Handling Dependence ##
We know:

- $E(X_1 + X_2 + \cdots + X_n) = E(X_1) + E(X_2) + \cdots + E(X_n)$ for **all** $X_1, X_2, \ldots, X_n$
- $Var(X_1 + X_2 + \cdots + X_n) = Var(X_1) + Var(X_2) + \cdots + Var(X_n)$ for **independent** $X_1, X_2, \ldots, X_n$

If $X_1, X_2, \ldots, X_n$ are *not* independent then we do have to calculate the covariance terms in the formula for the variance of the sum:

$$
Var(\sum_{i=1}^n X_i) ~ = ~ \sum_{i=1}^n Var(X_i) ~ + ~ \mathop{\sum\sum}_{1 \le i\ne j \le n} Cov(X_i, X_j)
$$

There are two situations in which the covariance terms are manageable:

- If the random variables being added are indicators
- If there is symmetry

Let's start with indicators.

## Reading 1: Indicators ##
Before you read, remember that if you have two 0/1 valued functions on the same space, then their product is also a 0/1 valued function. At any point in the domain, product is 1 if and only if both functions have the value 1 at that point.

Now work through how to find the [covariance of two indicators](http://prob140.org/textbook/Chapter_13/03_Sums_of_Simple_Random_Samples.html#Indicators), including the discussion that connects the sign of the covariance and the direction of the association. 

You might want to make this set of rules the home screen on your phone or something equally important:

For indicators $I_A$ and $I_B$:

- $E(I_A) = P(A)$
- $Var(I_A) = P(A)(1-P(A))$
- $Cov(I_A, I_B) = P(AB) - P(A)P(B)$

## Example (Not in Textbook) ##
Suppose you make $n$ draws at random with replacement from a population in which 60% of the individuals are red, 30% are blue, and 10% are green.

Let $X$ be the number of colors that don't appear. Find $E(X)$ and $Var(X)$.

The expected number of categories that don't appear ... that's a familiar calculation using indicators.

$X = I_r + I_b + I_g$ where $I_r$ is the indicator of the event that the color red doesn't appear in the $n$ draws and the other two indicators are defined analogously.

Then 

$$
E(X) ~ = ~ 0.4^n + 0.7^n + 0.9^n
$$

$Var(X)$ is "the sum of all the variances and all the covariances". 

- $Var(I_r) = 0.4^n(1 - 0.4^n)$
- $Cov(I_r, I_b) = 0.1^n - 0.4^n0.7^n$

Add all the analogous terms:

$$
Var(X) = 0.4^n(1 - 0.4^n) + 0.7^n(1 - 0.7^n) + 0.9^n(1 - 0.9^n) + \\
2\big{(}(0.1^n - 0.4^n0.7^n) + (0.3^n - 0.4^n0.9^n) + (0.6^n - 0.7^n0.9^n)\big{)}
$$

This is do-able, though a bit clunky to write out.

When there is some symmetry in the problem, however, the variance formula simplifies greatly, as follows.

## Counting Terms ##
On the right hand side of the formula

$$
Var(\sum_{i=1}^n X_i) ~ = ~ \sum_{i=1}^n Var(X_i) ~ + ~ \mathop{\sum\sum}_{1 \le i\ne j \le n} Cov(X_i, X_j)
$$

- The sum of the variances has $n$ terms.
- The sum of the covariances has $n(n-1)$ terms.

(No, not $\binom{n}{2}$ terms; you should think about why.)

**If you have enough symmetry** so that the following are true:

- all the individual variances are the same
- for all pairs, the covariances are the same

then the variance of the sum is just

$$
Var(\sum_{i=1}^n X_i) ~ = ~ nVar(X_1) + n(n-1)Cov(X_1, X_2)
$$

The most signficant example, making use of symmetry in simple random sampling:

## Reading 2: Variance of Hypergeometric (Huge Example)

Work through [this](http://prob140.org/textbook/Chapter_13/03_Sums_of_Simple_Random_Samples.html#Variance-of-the-Hypergeometric) and stop after the first line of calculation:

$$
n\frac{G}{N}\cdot\frac{B}{N} + n(n-1)\big{(}\frac{G}{N}\cdot\frac{G-1}{N-1} - \frac{G}{N}\cdot\frac{G}{N}\big{)}
$$

You should be able to understand where every piece of that came from.

Ignore the subsequent algebra: we'll avoid it in the next Part by doing a more general calculaton.

Just go to the final result:

$$
Var(X) ~ = ~ npq\frac{N-n}{N-1} ~~~~~~ \text{ where } p = \frac{G}{N}, ~q = 1-p
$$

Look at that – the same as the variance of the binomial, but multiplied by a factor that is less than 1. For the same $n$ and $p$ ($=G/N$), the hypergeometric histogram is narrower than the binomial. This confirms our intuitive sense that sampling without replacement should be more accurate (less deviation from expected value) than sampling with replacement.

### Example ###
Applications are numerous: if you recognize that a random variable has a hypergeoemtric distribution, you can just plug into formulas for mean and variance.

- If $X$ is the number of kings in a 5-card poker hand dealt from a standard deck, then $X$ is hypergeometric $(52, 4, 5)$. So $E(X) = 5\cdot\frac{4}{52}$ and $Var(X) = 5\cdot\frac{4}{52}\cdot\frac{48}{52}\cdot\frac{52-5}{52-1}$

## Vitamins ##

**1.** True or false: If $Cov(I_A, I_B) < 0$ then $P(AB) < P(A)P(B)$

**2.** In the formula for $Var(X_1 + X_2 + \cdots + X_n)$, how many terms are there in the sum $\sum_{i=1}^n Var(X_i)$?

**3.** In the formula for $Var(X_1 + X_2 + \cdots + X_n)$, how many terms are there in the sum $\mathop{\sum\sum}_{1 \le i\ne j \le n} Cov(X_i, X_j)$?

**4.** A bowl contains 20 marbles, 14 of which are red. A simple random sample of 8 marbles is drawn. Use the code cell below to find the expectation and variance of the number of red marbles drawn. You should get 5.6 and roughly 1.061.

In [None]:
# scratch work for vitamins


## Break time. Very clever argument coming up. ##

### Note on Review Set ###
In [Review Set 3](http://prob140.org/textbook/Chapter_15/06_Review_Problems_Set_3.html) you can now do 7, 23-27 (some of which you could also have done after Part 2), 29, 30.