<table align="left" style="border-style: hidden" class="table"> <tr><td class="col-md-2"><img style="float" src="http://prob140.org/assets/icon256.png" alt="Prob140 Logo" style="width: 120px;"/></td><td><div align="left"><h3 style="margin-top: 0;">Probability for Data Science</h3><h4 style="margin-top: 20px;">UC Berkeley, Fall 2018</h4><p>Ani Adhikari and Jim Pitman</p>CC BY-NC 4.0</div></td></tr></table><!-- not in pdf -->

# Homework 4 #

#### Rules for Written Homework ####

- Every answer should contain a calculation or reasoning. For example, a calculation such as $(1/3)(0.8) + (2/3)(0.7)$ is fine without further explanation or simplification. If we want you to simplify, we'll ask you to. But just ${5 \choose 2}$ is not fine; write "we need 2 out of the 5 frogs and they can appear in any order" or whatever reasoning you used. Reasoning can be brief and abbreviated, e.g. "product rule" or "not mut. excl."
- Unless otherwise specified, all infinite sums have to be simplified. Finite sums may be left in summation notation.
- You may consult others but you must write up your own answers using your own words, notation, and sequence of steps.
- In the interest of saving trees, you do not need to *solve* each question on a new piece of paper. Folding the paper to show just the relevant problem will suffice. To ensure the correct page size, we recommend placing the folded part on a blank page before scanning, or adjusting the page settings on your phone scanning app.
- You will submit a scanned PDF to Gradescope. **Each question should *start* on a new PDF page. No page should contain two questions.**

#### Rules for Coding ####

- Do not share, copy, or allow others to copy your code. You may discuss your approach and relevant methods or functions to use.
- A code cell (which may contain starter code) is provided for each question or subpart that requires coding. You are free to add additional cells as needed.
- You will submit a PDF to Gradescope. See the bottom of the notebook for more instructions.
- Here are some code references:
    - [Prob 140 Code Reference Sheet](http://prob140.org/assets/prob140_code_reference.pdf)
    - [Data 8 Python Reference](http://data8.org/fa18/python-reference.html)
    - [`scipy.stats` Documentation](https://docs.scipy.org/doc/scipy/reference/stats.html)

In [None]:
# Run this cell to set up your notebook

import numpy as np
from scipy import stats
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines make warnings look nicer
import warnings
warnings.simplefilter('ignore', FutureWarning)

## 1. Aces and Face Cards ##
A standard deck consists of 52 cards of which 4 are aces, 4 are kings, and 12 (including the four kings) are "face cards" (Jacks, Queens, and Kings).

Cards are dealt at random without replacement from a standard deck till all the cards have been dealt. 

Find the expectation of the following. Each can be done with almost no calculation if you use symmetry. 

**a) [WRITTEN]** The number of aces among the first 5 cards

**b) [WRITTEN]** The number of face cards that *do not* appear among the first 13 cards

**c) [WRITTEN]** The number of aces among the first 5 cards minus the number of kings among the last 5 cards

**d) [WRITTEN]** The number of cards before the first ace 

**e) [WRITTEN]** The number of cards strictly in between the first ace and the last ace

**f) [WRITTEN]** The number of face cards before the first ace

## 2. Phone Calls ##
In an hour, a student receives $X$ phone calls from people he knows and $Y$ phone calls from people he doesn't know. Assume that $X$ has the Poisson $(\lambda)$ distribution and $Y$ has the Poisson $(\mu)$ distribution. Also assume that $X$ and $Y$ are independent. Finally, assume that each call has chance 0.1 of being missed by the student, independently of all other calls.

**a) [WRITTEN]** Fill in the blank with the name of a distribution and its parameter or parameters:

The total number of calls that the student receives has the $\underline{~~~~~~~~~~~~}$ distribution.

**b) [WRITTEN]** Fill in the blank with the name of a distribution and its parameter or parameters:

Given that the student receives a total of $k$ calls, the conditional distribution of the total number of missed calls is $\underline{~~~~~~~~~~~~}$.

**c) [WRITTEN]** Fill in the blank with the name of a distribution and its parameter or parameters:

The total number of missed calls has the $\underline{~~~~~~~~~~~~}$ distribution.

**d) [WRITTEN]** For integers $k \ge 0$, find the chance that the student misses at most $k$ calls from people he knows.

**e) [WRITTEN]** Let $n$ and $m$ be non-negative integers. Find the chance that the student misses $n$ calls from people he knows and misses $m$ calls from people he doesn't know.

## 3. Unbiased Estimators ##

**a) [WRITTEN]** A population of known size $N$ contains an unknown number $G$ of good elements. Let $X$ be the number of good elements in a simple random sample of size $n$ drawn from this population. Use $X$ to construct an unbiased estimator of $G$.

See the example in Section 8.2 for a refresher: http://prob140.org/textbook/chapters/Chapter_08/02_Additivity

**b) [WRITTEN]** Would your answer to Part **a** have been different if $X$ had been the number of good elements in a random sample drawn with replacement from the population? Why or why not?

**c) [WRITTEN]** A flattened die lands 1 and 6 with chance $p/2$ each, and the other faces 2, 3, 4, and 5 with chance $(1-p)/4$ each. Here $p \in (0, 1)$ is an unknown number. Let $X_1, X_2, \ldots, X_n$ be the results of $n$ rolls of this die. First find $E(\vert X_1 - 3.5 \vert)$, and use the answer to construct an unbiased estimator of $p$ based on all of $X_1, X_2, \ldots, X_n$.

## 4. Classification ##
In a population of four classes of individuals, the proportion of individuals of Class $i$ is $p_i$ for $i = 1, 2, 3, 4$. Suppose you sample independently from this population, and suppose the sample size is a Poisson random variable with parameter $n$ for a fixed positive integer $n$. 

**a) [WRITTEN]** What is the probability that the sample contains at least one individual from each class?

**b) [CODE]** Let $p_i = \frac{i}{10}$ for $i = 1, 2, 3, 4$. Plot a graph of your answer as a function of $n$ for $n$ in the range 1 through 60. For an example of a calculation of probabilities accompanied by a plot, see Section 1.4 of the text.

**c) [CODE]** Still assuming $p_i = \frac{i}{10}$ for $i = 1, 2, 3, 4$, find a decimal value for the chance that you get at least two of each kind of individual if the parameter is $n = 50$.

In [None]:
#Answer to 4b

def p_at_least_one(n):
    ...

In [None]:
#Answer for 4c
...

## 5. Geometric Distribution ##
Consider an infinite sequence of i.i.d. Bernoulli $(p)$ trials, where $0 < p < 1$. To make the problem more concrete, imagine rolling a die over and over again and keeping track of whether or not you see the face with six spots; but solve the problem for a general $p$.

Let $T$ be the index of the first trial that results in a success. Then the possible values of $T$ are $1, 2, 3, \ldots $. In the case of rolling a die till you see a six, if the rolls come out $2,3,3,4,6$ then the value of $T$ is 5.

**a) [WRITTEN]** Let $q = 1-p$. Explain why $P(T = k) = q^{k-1}p$ for $k \ge 1$. This is called the *geometric distribution with parameter $p$ on $\{1, 2, 3, \ldots \}$*.

**b) [WRITTEN]** For a positive integer $k$, let $N_k$ be the number of successes in trials 1 through $k$. Fill in the blank in the identity below and explain your choice.

$$
P(T > k) = P(N_k = \underline{ ~~~~~~~~~~~~~ })
$$

**c) [WRITTEN]** Use **(b)** to find $P(T > k)$ with no calculation.

**d) [WRITTEN]** Use **(c)** and the tail sum formula used in Lab 4 to show that $E(T) = 1/p$.

## 6. Collecting Distinct Values ##
This problem is a workout in finding expectations by using all the tools at your disposal. Some of them have been developed in this homework. If an answer doesn't appear to fit into a formula that has already been proven, it's a very good idea to try to write the variable as a sum of simpler variables.

**a) [WRITTEN]** A fair die is rolled $n$ times. Find the expected number of times the face with six spots appears.

**b) [WRITTEN]** A fair die is rolled $n$ times. Find the expected number of faces that *do not* appear, and say what happens to this expectation as $n$ increases.

**c) [WRITTEN]** Use your answer to **(b)** to find the expected number of distinct faces that *do* appear in $n$ rolls of a die.

**d) [WRITTEN]** Find the expected number of times you have to roll a die till you have seen all of the faces. This is a version of what is known as the *coupon collector's problem*.


## Checklist

Your submission should have the following parts:

#### Part A (Written)

- 1a, 1b, 1c, 1d, 1e, 1f
- 2a, 2b, 2c, 2d, 2e
- 3a, 3b, 3c
- 4a
- 5a, 5b, 5c, 5d
- 6a, 6b, 6c, 6d

#### Part B (Code)

- 4b, 4c


## Submission Instructions

#### Part A (Written)
- Make sure you have at least 6 pages of homework. Each problem should start on a new page; for example,  Problem 1 on page 1, Problem 2 on page 2, etc.
- Scan all the pages into a PDF. It is your responsibility to check that all the work on the scanned pages is legible. You can use any scanner or a phone using applications such as CamScanner. Save the PDF.
- Upload the scanned PDF of your work onto Gradescope for the assignment "HW04a (Written)". 
Refer to [this guide](http://gradescope-static-assets.s3-us-west-2.amazonaws.com/help/submitting_hw_guide.pdf) for detailed instructions about scanning and submitting, or consult course staff.

#### Part B (Code)

1. **Save your notebook using File > Save and Checkpoint.**
2. Run the cell below to generate a pdf file.
3. Download the pdf file and confirm that none of your work is missing or cut off.
4. Submit the assignment to "HW04b (Code)" on Gradescope.

In [None]:
import gsExport
gsExport.generateSubmission("hw04.ipynb")