<table align="left" style="border-style: hidden" class="table"> <tr><td class="col-md-2"><img style="float" src="http://prob140.org/assets/icon_sp22_ugarte.png" alt="Prob140 Logo" style="width: 120px;"/></td><td><div align="left"><h3 style="margin-top: 0;">Probability for Data Science</h3><h4 style="margin-top: 20px;">UC Berkeley, Fall 2022</h4><p>Ani Adhikari</p>CC BY-NC-SA 4.0</div></td></tr></table><!-- not in pdf -->

This content is protected and may not be shared, uploaded, or distributed.

# Homework 8 #

### Instructions

Your homeworks have two components: a written portion and a portion that also involves code.  Written work should be completed on paper, and coding questions should be done in the notebook.  You are welcome to LaTeX your answers to the written portions, but staff will not be able to assist you with LaTeX related issues. It is your responsibility to ensure that both components of the homework are submitted completely and properly to Gradescope. Refer to the bottom of the notebook for submission instructions.

In [None]:
# Run this cell to set up your notebook

# These lines make warnings go away
import warnings
warnings.filterwarnings('ignore')

import numpy as np
from scipy import stats
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## 1. Correlation ##
The *correlation coefficient* between random variables $X$ and $Y$ is defined as

$$
r(X, Y) ~ = \frac{Cov(X, Y)}{SD(X)SD(Y)}
$$

It is called the correlation, for short. The definition explains why $X$ and $Y$ are called *uncorrelated* if $Cov(X, Y) = 0$.

**a)** Let $X^*$ be $X$ in standard units and let $Y^*$ be $Y$ in standard units. Check that

$$
r(X, Y) = E(X^*Y^*)
$$

This is the random variable version of the Data 8 definition of the correlation between two data variables: convert each variable to standard units; multiply each pair; take the mean of the products.

**b)** Use the fact that $(X^* + Y^*)^2$ and $(X^* - Y^*)^2$ are non-negative random variables to show that $-1 \le r(X, Y) \le 1$.

[First find the numerical values of $E(X^*)$ and $E\left({X^*}^2\right)$. Then find $E\left((X^* + Y^*)^2\right)$.]

**c)** Show that if $Y = aX+b$ where $a \ne 0$, then $r(X, Y)$ is 1 or $-1$ depending on whether the sign of $a$ is positive or negative.

**d)** Consider a sequence of i.i.d. Bernoulli $(p)$ trials. For any positive integer $k$ let $X_k$ be the number of successes in trials 1 through $k$. **Use bilinearity** to find $Cov(X_n, X_{n+m})$ and hence find $r(X_n, X_{n+m})$.

**e)** Fix $n$ and find the limit of $r(X_n, X_{n+m})$ as $m \to \infty$. Explain why the limit is consistent with intuition.

#newpage

## 2. The Matching Problem ##

In the familiar setting of the matching problem, there are $n$ letters labeled 1 through $n$ and $n$ envelopes labeled 1 through $n$. The letters are distributed at random into the envelopes, one letter per envelope, such that all $n!$ permutations are equally likely.

Let $M$ be the number of letters that fall into envelopes with the corresponding label. That is, $M$ is the number of "matches" or fixed points of the permutation.

**(a)** Fill in the blank:

$$M = I_1 + I_2 + \ldots + I_n$$ 

where for each $j$ in the range 1 through $n$, $I_j = 1$ if $\underline{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}$, and $I_j = 0$ otherwise.

**(b)** Use Part **(a)** to show that $E(M)$ has the same numerical value for all $n$.

**(c)** Use Part **(a)** to show that $Var(M)$ has the same numerical value for all $n$.

**(d)** The approximate distribtion of $M$ when $n$ is large is given at the bottom of [Section 5.3](http://prob140.org/textbook/content/Chapter_05/03_The_Matching_Problem.html) in the textbook. Are your answers to Parts **(b)** and **(c)** consistent with that approximation?

#newpage

## 3. Collecting Distinct Values ##

In Homework 4 you found the expectation of each of the random variables below. **Go back and see how you did that, and then use the same ideas** to find the variance of each one. 

For one part you will need the fact that the SD of a geometric $(p)$ random variable is $\frac{\sqrt{q}}{p}$ where $q = 1-p$. We haven't proved that as the algebra takes a bit of work. We will prove it later in the course by conditioning.

**(a)** A die is rolled $n$ times. Find the variance of number of faces that *do not* appear.

**(b)** Use your answer to (a) to find the variance of the number of distinct faces that *do* appear in $n$ rolls of a die.

**(c)** Find the variance of the number of times you have to roll a die till you have seen all of the faces.

#newpage

## 4. Poisson-Binomial Distribution

For this exercise, please refer to the theory in [Section 14.1](http://prob140.org/textbook/content/Chapter_14/01_Exact_Distribution_of_a_Sum.html#) and the code in [Section 14.2](http://prob140.org/textbook/content/Chapter_14/02_PGFs_in_NumPy.html).

In Lab 1B you saw that a *Poisson-binomial* random variable is a sum of independent indicators that are not necessarily identically distributed: 

$X = I_1 + I_2 + \cdots + I_n$ where $I_j$ has the Bernoulli $(p_j)$ distribution and $I_1, I_2, \ldots, I_n$ are independent.

**(a)** What is the probability generating function of a Bernoulli $(p)$ random variable? Provide a formula and then use the code cell below to define a function `indicator_pgf` that takes $p$ as its argument and returns the probability generating function of a Bernoulli $(p)$ random variable as a `NumPy` polynomial. Use as many lines as you need. The last line of the cell is there for you to check that your function is working.

In [None]:

# Answer to 4a

def indicator_pgf(p):
    ...
    return ...

print(indicator_pgf(0.4))

**(b)** For $j = 1, 2, \ldots, 20$, let $p_j = 1/(j+1)$. Let $I_1, I_2, \ldots, I_{20}$ be independent indicators such that $I_j$ has the Bernoulli $(p_j)$ distribution, and let $X = I_1 + I_2 + \cdots + I_{20}$. Complete the code cell below so that `pgf_X` is the probability generating function of $X$ as a `NumPy` polynomial. Use as many lines as you need. The last two lines are there for you to check that your polynomial has the correct degree and that it is indeed a probability generating function.

In [None]:

# Answer to 4b

...

print(pgf_X)
sum(pgf_X.c) # sum of coefficients

**(c)** Complete the cell below to plot the probability histogram of $X$. Do not add any more lines.

In [None]:

# Answer to 4c

vals_X = ...
probs_X = ...
dist_X = Table()...
Plot(dist_X)

**(d)** Complete the cell below to find the expectation, variance, and SD of $X$ using `p_array`. Do not add any more lines. Then run the cell below that to check your answers.

In [None]:

p_array = 1/np.arange(2, 22)
ev_X = ...
var_X = ...
sd_X = ...
ev_X, var_X, sd_X

In [None]:
dist_X.ev(), dist_X.var(), dist_X.sd()

**(e)** Explain why the distribution of $X$ cannot be Poisson. Then show that the distribution of $X$ is not binomial either, as follows. If $X$ were binomial, what would $n$ have to be? Use that and your answer to Part (d) to see what $p$ would have to be. Use the code cell below to find the variance of that binomial distribution, and compare with your answer to Part (d).

In [None]:

n = ...
p = ...
binomial_variance = ... 
binomial_variance

#newpage

## 5. Widths of Confidence Intervals ##
In any part of this question that involves a sample size, you can assume the sample size is big enough for the Central Limit Theorem approximation to be good. You should answer on paper, but you can use the code cell provided below for arithmetic or to find standard normal percentages or percentiles. The appropriate libary has been imported at the top of this noteboo.

**a)** A survey organization has used the methods of our class to construct an approximate 95% confidence interval for the mean annual income of households in a county. The interval runs from $\$66,000$ to $\$70,000$. If possible, find an approximate 98% confidence interval for the mean annual income of households in the county. If this is not possible, explain why not.

**b)** A survey organization is going to take a simple random sample of $n$ voters from among all the voters in a state, to construct a 98% confidence interval for the proportion of voters who favor a proposition. Find an $n$ such that the total width of the confidence interval (left end to right end) will be no more than 0.06. Remember that you can bound the [variance of an indicator](http://prob140.org/textbook/content/Chapter_12/01_Definition.html#indicator).

In [None]:
# Code cell for scratch work for Exercise 5

#newpage

## 6. Testing Hypotheses in the Gauss Model

The [Gauss](https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss) model for measurement error says that repeated measurements $X_1, X_2, \ldots, X_n$ of the same quantity have the structure

$$
X_i = \mu + \epsilon_i, ~~~~~ 1 \leq i \leq n
$$

where $\mu$ is an unknown constant called the *true value* and $\epsilon_1, \epsilon_2, \ldots, \epsilon_n$ are random error terms assumed to be i.i.d. with mean 0 and variance $\sigma^2$. 

From a practical perspective, the true value $\mu$ comes from the quantity being measured (for example, the true weight of an object). The error terms come from the measuring process (for example, from the balance being used for weighing). Thus $\sigma$ is sometimes known because of extensive experience with the measuring process (for example, having used the same balance to weigh many different objects). 

So assume that the Gauss model holds with $\sigma = 1$, and let $n = 100$. Suppose a data scientist wants to test the following hypotheses:

- Null hypothesis $H_0$: $\mu = 20$
- Alternative hypothesis $H_A$: $\mu \neq 20$

Suppose the data scientist wants to use the average measurement $\bar{X}$ as the test statistic and reject the null hypothesis if $\vert \bar{X} - 20 \vert > 0.175$.

**(a)** Rewrite the decision rule by filling in the blanks with numbers: $\vert \bar{X} - 20 \vert > 0.175 \iff \bar{X} < \underline{~~~~~~~~} \text{ or } \bar{X} > \underline{~~~~~~~~}$

**(b) Level:** Find the approximate distribution of the test statistic $\bar{X}$ under $H_0$, and use this distribution to find the approximate probability that the test rejects the null hypothesis if the null hypothesis is true. This probability is called the *level* of the test. In Data 8 we called it the cutoff for the p-value. 

Please write out your answer, and use the code cell below for scratch work. Remember that `stats.norm.cdf(x, mean, SD)` evaluates to the cdf of the normal $(\text{mean, } \text{SD}^2)$ distribution at the point $x$. The necessary modules have been imported at the top of this notebook.

**(c) Power:** Suppose that in fact $\mu = 20.5$ though the data scientist doesn't know this and is still performing the same test as above. Find the approximate distribution of the test statistic $\bar{X}$ under the condition $\mu = 20.5$, and use this distribution to find the approximate probability that the test rejects the null hypothesis if $\mu = 20.5$. This probability is called the *power of the test against the fixed alternative $\mu = 20.5$*. 

Please write out your answer, and use the code cell below for scratch work.

In [None]:
# Scratch work for Exercises 6b and 6c


**(d)** Complete the code cell below to plot the graph of the power of the test under the fixed alternative $\mu = \mu_A$ for $\mu_A$ in the range `true_mu` below. Do not add any more lines.

Computational note: First study the code below and compare with the output of the cell. 

In [None]:
mu_list = [10, 15, 20]  # It's also fine for this to be an array.

# array of P(X_i < 12)
# for X_i normal with mean = ith element of mu_list
# and SD = 8
stats.norm.cdf(12, mu_list, 8)

In [None]:

# Answer to 6d

true_mu = np.arange(19, 21, 0.05)
power = ...

plt.plot(true_mu, power, color='darkblue', lw=2)
plt.xlabel('True value of $\mu$')
plt.title('Power of the Test');

**(e)** Interpret the graph. What is the test likely to do if the true value of $\mu$ is far from 20, and what does the power converge to (be careful!) when the true value gets close to $20$?

#newpage

## Submission Instructions ##

Many assignments throughout the course will have a written portion and a code portion. Please follow the directions below to properly submit both portions.

### Written Portion ###
*  Scan all the pages into a PDF. You can use any scanner or a phone using applications such as CamScanner. Please **DO NOT** simply take pictures using your phone. 
* Please start a new page for each question. If you have already written multiple questions on the same page, you can crop the image in CamScanner or fold your page over (the old-fashioned way). This helps expedite grading.
* It is your responsibility to check that all the work on all the scanned pages is legible.

### Code Portion ###
* Save your notebook using File > Save and Checkpoint.
* Generate a PDF file using File > Download as > PDF via LaTeX. This might take a few seconds and will automatically download a PDF version of this notebook.
    * If you have issues, please make a follow-up post on the general HW 8 Ed thread.
    
### Submitting ###
* Combine the PDFs from the written and code portions into one PDF.  [Here](https://smallpdf.com/merge-pdf) is a useful tool for doing so. 
* Submit the assignment to Homework 8 on Gradescope. 
* **Make sure to assign each page of your pdf to the correct question.**
* **It is your responsibility to verify that all of your work shows up in your final PDF submission.**

If you have questions about scanning or uploading your work, please post a follow-up to the [Ed thread](https://edstem.org/us/courses/24954/discussion/1695227) on this topic. 