# Chapter 4: Expectation

In [None]:
# The source of the content is freely available online
# https://drive.google.com/file/d/1VmkAAGOYCTORq1wxSQqy255qLJjTNvBI/view
# https://projects.iq.harvard.edu/stat110/

Story 4.3.1 (Geometric Distribution)

Consider a sequence of independent Bernoulli trials, each with the same success probability, with trials performed until a success occurs. Let $X$ be the number of failures before the first successful trial. Then, $X$ has the Geometric distribution with parameter $p$.

Imagine the Bernoulli trials as a string of zeroes (failures) ending in a single success ($1$). Each $0$ has probability $p$, so a string of $k$ failures followed by one success has probability $q^kp$.

The PDF of the Geometric is $P(X=k) = q^kp$ for $k = 0,1,2,\ldots,$ where $q=1-p$. This is a valid PMF because, summing a geometric series, we have:

$\sum_{k=0}^{\infty} = p \sum_{k=0}^{\infty} q^k = p \cdot \frac{1}{1-q} = 1$

There are different conventions for the definition of the Geometric distribution. Some sources define it as the total number of trials, including the success. In this book, the Geometric distribution excludes the success, and the FirstSuccess distribution includes the success. If $X \text{~} Geom(p)$, then $X+1 \text{~} FS(p)$.

<h4>Example 4.3.7 (First Success Expectation)</h4>

$E(Y) = E(X+1) = \frac{q}{p} + 1 = \frac{1}{p}$

<h4>Story 4.3.8 (Negative Binomial Distribution)</h4>

In a sequence of independent Bernoulli trials with success probability $p$, if $X$ is the number of failures before the $r^{th}$ success, then $X$ is said to have the Negative Binomial distribution with parameters $r$ and $p$.

<h4>Theorem 4.3.9</h4>

If $X \text{~} NBin(r,p)$, then the PMF of $X$ is:

$P(X=n) = \binom{n+r+1}{r-1} p^r q^n$

for $n = 0,1,2,\ldots$, where $q=1-p$.

<h4>Theorem 4.3.10</h4>

Let $X \text{~} Bin(r,p)$, viewed as the number of failures before the $r^{th}$ success in a sequence of independent Bernoulli trials with success probability $p$. Then we can write $X=X_1 + \ldots + X_r$, where the $X$ are IID Geom(p).

<h4>Example 4.3.11 (Negative Binomial Expectation)</h4>

By linearity:

$E(X) = E(X_1) + \ldots + E(X_r) = r \cdot \frac{q}{p}$

<h4>Example 4.3.12 (Coupon Collector)</h4>

There are $n$ types of toys, which you are collecting toys, the toy types are random (e.g., included in cereal boxes); it is equally likely to be any of the $n$ types. What is the expected number of toys needed until you have a complete set?

<i>Answer:</i>

Our strategy will be to break up N into a sum of simpler random variables so that we can apply linearity.

$N = N_1 + N_2 + \ldots + N_n$

where $N_1$ is the number of toys until the first toy type you haven't seen before, $N_2$ is the additional number of toys until the second toy you haven't seen before, and so forth.

**img pg 162**

By the story of the FS distribution, $N_2 \text{~} FS((n-1)/n))$. After collecting the first toy type, there is a $1/n$ chance of getting the same toy you already had (i.e. failure) and a $(n-1)/n$ chance you'll get something else (success). $N_3$ is distributed $FS((n-2/n))$.

$E(N) = E(N_1) + E(N_2) + \ldots + E(N_n) = n \sum_{j=1}^n \frac{1}{j}$

<h4>Example 4.5 Law of the Unconscious Statistician (LOTUS)</h4>

$E(g(X))$ does not equal $g(E(X))$ in general if $g$ is not linear. But it is possible to find $E(g(X))$ directly using the distribution of $X$, without having to find the distribution of $g(X)$.

<h4>Theorem 4.5.1 (LOTUS)</h4>

If $X$ is a discrete random variable and $g$ is a function from $\mathbb{R}$ to $\mathbb{R}$, 

$E(g(X)) = \sum_x g(x) P(X=x)$

This means that we can get the expected value of $g(X)$ knowing only $P(X=x)$, the PMF of $X$. We don't need to known the PMF of $g(X)$.

<h3>4.6 Variance</h3>

<h4>Definition 4.61 (Variance and Standard Deviation)</h4>

The variance of a random variable is:

$Var(X) = E(X - E(X))^2$

The square root is the standard deviation.

$SD(X) = \sqrt{Var(X)}$

An equivalent equation for variance is:

$Var(X) = E(X)^2 - E(X)^2$

<h4>Theorem 4.6.2 (Properties of Variance)</h4>

- $Var(X+c) = Var(X)$ for any constant $c$. $c$ shifts the distribution of $X$ to the left or right, but does not affect the level of spread.

- $Var(cX) = c^2 Var(X)$ for any constant $c$.

- If $X$ and $Y$ are independent, then $Var(X+Y) = Var(X) + Var(Y)$

- $Var(X)=0$; the only random variables which have zero variables are degenerate random variables, i.e., constants.

<h3>Poisson Distribution</h3>

<h4>Definition 4.7.1 (Poisson Distribution)</h4>

A random variable has the Poisson distribution with parmeter $\lambda$, where $\lambda \gt 0$, if the PMF is:

$P(X=k) = \frac{ e^{-\lambda} \lambda^k }{k!}, k = 0,1,2,\ldots$

This is a valid PMF because of the Taylor series:

$\sum_{k=0}^{\infty} = e^{\lambda}$

The mean and variance of the Poisson distribution are both equal to $\lambda$.

**get Poisson content from notes**

<h4>Example 4.7.5 (Birthday Problem Continued)</h4>

If we have $m$ people and make the usual assumptions, then each pair of people has $p=\frac{1}{365}$ of having the same birthday, and there are $\binom{m}{2}$ pairs. By the Poisson paradigm, the distribution of the number $X$ of birthday matches is approximately $Pois(\lambda)$, where $\lambda = \binom{m}{2} \frac{1}{365}$. Then, the probability of at least one match is:

$P(X \le 1) = 1 - P(X=0) \approx 1-e^{-\lambda}$

For $m=23$, $\lambda = \frac{253}{365}$ and $1-e^{\lambda} \approx 0.500002$ (which agrees with our earlier finding that we needd $23$ people to have a $50\%$ chance of a match).

<h4>Example 4.7.6 (Birthday Problem)</h4>

What if we want to find the number of people required in oorder to have a $50/50$ chance that two people have birthdays within $1$ day of each other (i.e., same day or one day behind or ahead).

<i>Answer:</i>

The probability that any two people have birthdays within one day of each other is $3/365$ (choose a birthday for the first person, and then the second person has to be born on that day, the day before, or the day after). There are $\binom{m}{2}$ possible pairs, so the number of within-one-day matches is approximately $Pois(\lambda)$ where:

$\lambda = \binom{m}{2} \frac{3}{365}$

A calculation similar to the one above tells us that we need m=14 or more.

<h4>Example 4.7.7 (Birth-Minute and Birth-Hour)</h4>

There are $1600$ sophomores at a certain college. Throughout this example, make the usual assumptions as in the birthday problem.

<b>Part A:</b>

Find a Poisson approximation for the probability that there are the sophomores born not only on the same day, but at the same hour as well (but not necessarily the same year).

<i>Answer:</i>

This is the birthday problem with $c = 365 \cdot 24 \cdot 60 = 525,600$ categories rather than 365.

Creating an indicator random variable for each pair of sophomores, by linearity, the expected number of pairs born on the same day-hour-minute combination is:

$\lambda_1 = \binom{n}{2} \frac{1}{c}$

By Poisson approximation, the probability of at least one match is approximately:

$1 - exp(-\lambda_1) \approx 0.9122$


<b>Part B:</b>

With assumptions as in part A, what is the probability that there are 4 sophommores born on the same day and hour?

<i>Answer:</i>

Now there are $b = 365 \cdot 24 = 8760$ categories.

Create an indicator for each set of $4$ sophomores. By linearity, the expected number of sets of $4$ sophomores born on the same day-hour is:

$\lambda_2 = \binom{n}{4} \frac{1}{b^3}$

Poisson approximation gives that the desired probability is approximately:

$a - exp(-\lambda_2) \approx 0.333$

<h4>Connections Between Poisson and Binomial</h4>

The Poisson and Binomial distributions are closely connected and their relationship is parallel to that of the Binomial and Hypergeometric distributions. We can get from the Poisson to the Binomial by conditioning and we can get from the Binomial to the Poisson by taking a limit.

<h4>Theorem 4.8.1</h4>

If $X \text{~} Pois(\lambda)$, $Y \text{~} Pois(\lambda_2)$, and $X$ is independent of $Y$, then $X+Y \text{~} Pois(\lambda_1 + \lambda_2)$.

<h4>Theorem 4.8.2 (Poisson Given a Sum of Poissons)</h4>

If $X \text{~} Pois(\lambda)$, $Y \text{~} Pois(\lambda_2)$, and $X$ is idependent of $Y$, then the conditional distribution of $X$ given $X+Y=n$ is $Bin(n, (\lambda_1 / \lambda_2))$.

<h4>Definition 4.9.4 (Binary Entropy Function)</h4>

For $0 \lt p \lt 1$, the binary entropy function $H$ is given by:

$H(p) = -p ~log_2p - (1-p) ~log_2(1-p)$

We define $H(0) = H(1) = 0$.

The interpretation in information theory is that it is a measure of how much information we get from observing a $Bern(p)$ random variable. $H(1/2)=1$ means that a fair coin flip provides $1$ bit of information, while $H(1)=0$ says that with a coin that always lands heads, there is no information gained from being told the result of the flip.

<h4>Theorem 4.9.5 (Shannon)</h4>

Consider a channel where each transmitted bit gets flipped with probability $p$, independently. Let $0 \lt p \lt \frac{1}{2}$ and $\varepsilon \gt 0$. There exists a code with rate at least $1-H(p) - \varepsilon$ that can be decoded with probability of error less than $\varepsilon$.

# Exercises

<h4>Named Distribution Exercises</h4>

Raindrops are falling at an average rate of $20$ drops per square inch per minute. What would be a reasonable distribution for the number of raindrops hitting a particular region measuring $5$ inches squared in $t$ minutes? Compute the probability that the region has no rain drops in a given 3-second interval.

<i>Answer:</i>

A reasonable choice of distribution is $Pois(\lambda t)$, where $\lambda = 20 \cdot 5 = 100$ (the average number of raindrops per minute hitting the region). Assuming this distribution:

$P(\text{no raindrops in 1/20 of a minute}) = e^{-100/20} (100/20)^0 / 0! \approx e^{-5} \approx 0.0067$

<h4>Exercise 22</h4>

Alice and Bob have just met, and wonder whether they have a mutual friend. Each has $50$ friends, out of $1000$ other people who live in their town. They think it's unlikely that they have a friend in common.

Assume that Alice's 50 friends are a random sample of the $1000$ people, and similarly for Bob. Assume independence.

<b>Part A:</b>

Compute the expected number of mutual friends.

<i>Answer:</i>

Let $I_j$ be the indicator random variable for the $j^{th}$ person being a mutual friend. Then:

$E \left( \sum_{j=1}^{1000} I_j \right) = 1000 E(I_1) = 1000 P(I_1=1) = 1000 \cdot \left( \frac{5}{100} \right)^2 = 2.5$


<b>Part B:</b>

Let $X$ be the number of mutual friends they have. Find the PMF of $X$.

<i>Answer:</i>

Condition on who Alice's friends are, then count the number of ways that Bob can be friends with exactly $k$ of them. This gives:

$P(X=k) = \frac{ \binom{50}{k} \binom{950}{50-k} }{ \binom{1000}{50} }$

i.e., the Hypergeometric distribution.

<h4>Exercise 24</h4>

Calvin and Hobbes play a match consisting of a series of games, where Calvin has a probability p of winning each game independently. The first player to win two games more than his opponent wins the match. Find the expected number of games played.

Hint: Consider the first two games as a pair, then the next two as a pair, etc.

<i>Answer:</i>

Think of the first 2 games, the 3rd and 4th, the 5th and 6th, etc., as mini-matches. The match ends after the first mini-match that isn't a tie. The probability of a mini-match not being a tie is $p^2 + q^2$, so the number of mini-matches needed is $1$ plus a $Geom(p^2 + q^2)$ random variable. Thus, the expected number of games is $\frac{2}{(p^2 + q^2)}$.

For $p=0$ or $p=1$, this reduces to $2$. The expected number of games is maximized when $p=\frac{1}{2}$, which makes sense intuitively. Also, it makes sense that the result is symmetric in $p$ and $q$.

<h4>Exercise 26</h4>

Let $X$ and $Y$ be $Pois(\lambda)$ random variables, and $T=X+Y$. Suppose that $X$ and $Y$ are not independent, and in fact $X=Y$. Prove or disprove the claim that $T \text{~} Pois(2 \lambda)$ in this scenario.

<i>Answer:</i>

The random variable $T=2X$ is not Poisson, and can only take even values $0,2,4,6, \ldots$, whereas any Poisson random variable has positive probabilities of being $0,1,2,3, \ldots$. Alternatively, we can compute the PMF of $2X$, or note that $Var(2X) = 4 \lambda \neq 2 \lambda = E(2X)$, whereas for any Poisson random variable the variance equals the mean.

<h4>Exercise 33</h4>

A total of $20$ bags of gummy bears are randomly distributed to $20$ students. Each bag is obtained by a random student, and the outcomes are independent. Find the average number of gummy bears that the first $3$ students get in total, and find the average number of students who get at least one bag.

Answer:

Let $X_j$ be the number of bags of gummy bears that the $j^{th}$ student gets, and let $I_j$ be the indicator of $X_j \gt 1$. Then $X_j \text{~} Bin(20,\frac{1}{20})$, so $E(X_j) \ge 1$. So $E(X_1 + X_2 + X_3) = 3$ by linearity.

The average number of students who get at least one bag is:

$E(I_1 + \ldots + I_{20}) = 20 E(I_1) = 20 P(I_1 = 1) = 20 \left( 1 - \left( \frac{19}{20} \right)^{20} \right)$

In [None]:
(new section name?)

<h4>Exercise 52</h4>

An urn contains red, green, and blue balls. Balls are chosen randomly with replacement. Let $r,g,b$ be the probabilities of drawing a red, green, or blue ball respectively ($r+g+b=1$).

<b>Part A:</b>

Find the expected number of balls chosen before obtaining the first red ball, not including the red ball itself.

<i>Answer:</i>

The distribution is $Geom(r)$, so the expected value is $\frac{1-r}{r}$.


<b>Part B:</b>

Find the expected number of different colors of balls obtained before getting the first red ball.

<i>Answer:</i>

Use indicator random variables: let $I_1$ be $1$ if green is obtained before red, and 0 otherwise, and define $I_2$ similarly for blue. Then:

$E(I_1) = P(\text{green before red}) = \frac{g}{g+r}$

since 'green before red' means that the first nonblue ball is green. Similarly, $E(I_2) = b/(b+r)$, so the expected number of colors obtained before getting red is:

$E(I_1 + I_2) = \frac{g}{g+r} + \frac{b}{b+r}$


<b>Part C:</b>

Find the probability that at least two of n balls drawn are red, given that at least 1 is red.

<i>Answer:</i>

By definition of conditional probability:

$P(\text{at least 2 red | at least 1 red}) = \frac{P \text{at least 2 red}}{P \text{at least 1 red}} = 
\frac{ 1-(1-r)^n - nr(1-r)^{n-1} }{ 1-(1-r)^n }$