<table align="left" style="border-style: hidden" class="table"> <tr><td class="col-md-2"><img style="float" src="../logo.png" alt="Prob140 Logo" style="width: 120px;"/></td><td><div align="left"><h3 style="margin-top: 0;">Probability for Data Science</h3><h4 style="margin-top: 20px;">UC Berkeley, Fall 2023</h4><p>Ani Adhikari and Alexander Strang</p>CC BY-NC-SA 4.0</div></td></tr></table><!-- not in pdf -->

This content is protected and may not be shared, uploaded, or distributed.

In [None]:
import warnings
warnings.filterwarnings('ignore')

from datascience import *
from prob140 import *
import numpy as np
from scipy import stats
from scipy import special

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Homework 3 #

### Instructions

Your homeworks will generally have two components: a written portion and a portion that also involves code.  Written work should be completed on paper, and coding questions should be done in the notebook. Start the work for the written portions of each section on a new page. You are welcome to $\LaTeX$ your answers to the written portions, but staff will not be able to assist you with $\LaTeX$ related issues. 

It is your responsibility to ensure that both components of the lab are submitted completely and properly to Gradescope. **Make sure to assign each page of your pdf to the correct question. Refer to the bottom of the notebook for submission instructions.**

Every answer should contain a calculation or reasoning. For example, a calculation such as $(1/3)(0.8) + (2/3)(0.7)$ or `sum([(1/3)*0.8, (2/3)*0.7])`is fine without further explanation or simplification. If we want you to simplify, we'll ask you to. But just ${5 \choose 2}$ by itself is not fine; write "we want any 2 out of the 5 frogs and they can appear in any order" or whatever reasoning you used. Reasoning can be brief and abbreviated, e.g. "product rule" or "not mutually exclusive."

### Code Resources

* [`Data 8` Code Reference](http://data8.org/sp22/python-reference.html)
* [`Data 140` Code Reference](http://prob140.org/assets/references/final_reference_fa18.pdf)

## 1. College Degrees

In the U.S., 38% of adults aged 25 and over have a four-year college degree.

In each part below, write a math expression for the chance and provide a brief justification. Please use the appropriate summation, not "...". Then use the appropriate code cell to find the numerical value of the chance, using `stats.binom.pmf`, `stats.binom.cdf`, and arithmetic. See [the textbook](http://prob140.org/textbook/content/Chapter_06/01_Binomial_Distribution.html#binomial-probabilities-in-python) for a reference. The `stats` library of `scipy` has been imported in the top cell of this notebook.

In what follows, we will use the term *population* to mean US adults aged 25 and over, and a *successful draw* to mean a draw that results in a person who has a four-year college degree.

**a)** Suppose I draw from the population at random with replacement. What is the chance that at least one-third of my first 30 draws are successful?

**b)** Suppose I draw from the population at random with replacement, till 10 of my draws are successful. What is the chance that I draw at most 30 times? How is this answer related to the answer in Part **a**?

**c)** Suppose my friend and I both draw from the population at random with replacement, independently of each other. Suppose I make 30 draws and my friend makes 20 draws. What is the chance that I get more successful draws than my friend?

In [None]:
# Answer to a. You can use more than one line of code.
...

In [None]:
# Answer to b. You can use more than one line of code.
...

In [None]:
# Answer to c. You can use more than one line of code.
...

\newpage

## 2. Poisson Approximation at Both Ends ##
Consider $n$ independent Bernoulli $(p)$ trials.

**a)** Fill in the blanks with names of distributions along with parameters in parentheses: If $n = 1000$ and $p=0.003$, the distribution of the number of successes is exactly $\underline{~~~~~~~~~~~~~~~~~~~~~~~~~~~}$ $(\underline{~~~~~~~~~~~~ })$ and approximately $\underline{~~~~~~~~~~~~~~~~~~~~~~ }$ $(\underline{~~~~~~~~~~~~ })$.

**b)** Let $n$ be large and let $p$ be close to 1. Find a Poisson approximation to $p_k$ (the chance of $k$ successes) by an appropriate use of the Poisson approximation to the binomial derived in the textbook. 

**Note:** Don't try to derive a new limit from scratch. Just use the limit already derived in the textbook, but appropriately.

**c)** Plot the probability histogram of the binomial (1000, 0.997) distribution, and overlay your Poisson approximation from part (b). For computing Poisson probabilities, see the [textbook](http://prob140.org/textbook/content/Chapter_06/06_Law_of_Small_Numbers.html#poisson-probabilities-in-python). Please don't plot the entire range of the binomial. Choose an informative range of values on the horizontal axis.

In [None]:
n = 1000
p = 0.997

k = ...                # array of possible values
binomial_probs = ...   # array of exact binomial probabilities

def poisson_approximation_pmf(j):
    """Returns the Poisson approximation to the
    exact binomial probability of j successes"""
    return ...

exact_binomial = Table().values(k).probabilities(...)
poisson_approximation = Table().values(k).probability_function(...)

Plots(...)
plt.xlim(..., ...)

\newpage

## 3. Related Distribution Families ##

As you have seen, the binomial distribution is closely related to the hypergeometric and Poisson distributions. Here are two more connections.

**Note:** Both of the connections arise via conditioning. For an example of how to go about finding the conditional distribution of one random variable given the value of another, study the steps in [Section 6.2.3](http://prob140.org/textbook/content/Chapter_06/02_Examples.html#conditional-distribution-of-the-trial-that-results-in-the-first-success) of the textbook.  

For all of your answers below, remember that when you are finding a distribution, you must first **provide the possible values**.

**a)** Let $1 \le n < m$ be integers. Consider $m$ i.i.d. Bernoulli trials with probability of $p$ of success on each trial. Let $X$ be the number of successes in the first $n$ trials and let $S$ be the number of successes in all $m$ trials. 

For a fixed integer $s$, find the conditional distribution of $X$ given $S=s$. Identify it as one of the famous ones and provide the parameters in terms of $m$, $n$, and $s$.

**b)** Let $X$ and $Y$ be independent random variables such that $X$ has the Poisson $(\mu)$ distribution and $Y$ has the Poisson $(\lambda)$ distribution. 

For a fixed integer $n$, find the conditional distribution of $X$ given  $X+Y = n$. Identify it as one of the famous ones and provide the parameters in terms of $n$, $\mu$, and $\lambda$.

\newpage

## 4. Poissonization ##
The math of Poissonization works out beautifully but the process can seem abstract. In this exercise you will carry out the Poissonization process to help make the results more concrete. 

**a) Fixed Number of Rolls.** Let's start with a more familiar setting. Suppose you roll a die 12 times. Let $X_1$ be the number of times the face with one spot appears. Complete the cell below to plot the distribution of $X_1$. In the last line, enter the name of the distribution and the parameters. [Section 6.6](http://prob140.org/textbook/content/Chapter_06/06_Law_of_Small_Numbers.html#poisson-probabilities-in-python) contains relevant examples of code.

In [None]:
k = np.arange(12)
dist_X1_probs = ...
dist_X1 = Table().values(k).probabilities(dist_X1_probs)
Plot(dist_X1)
plt.title('... (...) Distribution');


**b) Poisson Number of Rolls.** Now begin a simulation study of Poissonization. First, here are some computational notes.

**Poisson:** To simulate a Poisson (`mu`) random variable once, use `stats.poisson.rvs` as in the cell below. You can also just enter the numerical value of `mu` directly in the argument. Run the cell a few times to see how the output changes. Ctrl-Return works well for this.

In [None]:
mu = 5
stats.poisson.rvs(mu)


If you want to generate `n` independent copies of a Poisson (`mu`) random variable, use the `size` argument:

In [None]:
n = 10
stats.poisson.rvs(mu, size=n)


**Multinomial:** You can use `np.random.multinomial` to simulate the counts in all the categories in multinomial trials. As an example, suppose you want to simulate 50 independent and identically distributed trials such that the result of each one is red with chance 30%, green with chance 60%, and orange with chance 10%. 

The first argument is the number of trials. The second argument can be an array or a list of the probabilities of the categories on a single trial.

In [None]:
np.random.multinomial(50, [0.3, 0.6, 0.1])


The output is an array of the simulated counts in the three categories in the order in which the probabilities were specified (in this case, "red, green, orange"). The sum of the counts equals the first argument.

Data 8 note: The `datascience` library method [`sample_proportions`](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html#comparison-with-panels-selected-at-random) is based on `np.random.multinomial`.


**Appending Rows:** To augment a table with a row, use `tbl.append`. Unlike the `Table` methods commonly used in Data 8, `tbl.append` modifies `tbl`; it does not create a copy of it. This method was used in Step 4 of the [Monty Hall simulation](https://inferentialthinking.com/chapters/09/4/Monty_Hall_Problem.html#simulation) in Data 8.

The cell below starts by creating the column labels and then appending rows.

In [None]:
my_table = Table(['First Column', 'Second Column', 'Third Column'])
my_table.append([8, 100, 140])

In [None]:
my_table

In [None]:
my_table.append(np.arange(3))

In [None]:
my_table


Now you are ready to Poissonize! Suppose you roll $N$ dice where $N$ has the Poisson $(12)$ distribution. For $1 \le i \le 6$, let $N_i$ be the number of times the face with $i$ spots appears. 

Complete the code cell to simulate the following process independently 100,000 times.

- Generate a value of $N$.
- Roll a die that many times.
- Collect the values of $N_i$ for $1 \le i \le 6$.

Your simulation should result in a table `counts` that has 100,000 rows. But **please don't start with 100,000 repetitions; just test out the code for 2 repetitions until you're sure it's working.** Then you can change the number of repetitions. The full simulation might take some time to run.

In [None]:
# Array or list of the probabilities of the six faces
fair_die_probs = ...

# Optional line in case you want to generate all the values of N at once; delete if not used
...

counts = Table(['N', 'N1', 'N2', 'N3', 'N4', 'N5', 'N6'])

for ...:
    ... # Use as many lines as you need or none
    counts.append(np.append(...))
    
counts.show(5)


**c) Poissonization: The Number of Rolls.** Complete the two cells below to confirm that your simulated $N$ has the right distribution. We have restricted the range of possible values, for reasons that will be clear from the graphs. Refer to [Section 6.6](http://prob140.org/textbook/content/Chapter_06/06_Law_of_Small_Numbers.html#poisson-probabilities-in-python) for code examples relevant for the first cell.

In [None]:
# Probability distribution of N
j = np.arange(36)
dist_N_probs = ...
dist_N = ...
Plot(...)
plt.title('Poisson (12) Distribution');

In [None]:
# Empirical distribution of N
counts...(..., bins=np.arange(-0.5, 36.6, 1))
plt.title('Empirical Distribution of N');


**d) Poissonization: The Marginals.** Run the cell below to display the empirical distribution of each $N_i$.

In [None]:
counts.drop('N').hist(bins = np.arange(-0.5, 12.6, 1), overlay=False)

Not surprisingly, they resemble each other. Do they also look the same as the distribution you plotted in Part **a**? If not, what should each distribution be? Answer this question by completing the cell below. It's fine to just copy the code from Part **a** if that distribution is valid.

In [None]:
# Probability distribution of each N_i (same for all i)
k = np.arange(12)
dist_Ni_probs = ...
... # Use as many lines as you need
plt.title('... (...) Distribution');


**e) Poissonization: A Conditional Distribution.** What is the conditional distribution of $N_2$ given $N_1 < 4$? 

- First, write the answer based on the theory, with an explanation.
- Next, complete the cell below to plot the empirical approximation to this distribution based on your simulations. The plot should be consistent with your answer based on the theory.

In [None]:
counts_restricted = counts...(...) # Applying the condition
...(..., bins = np.arange(-0.5, 12.6, 1))
plt.title('Empirical Conditional Distribution of N2 given N1 < 4');

\newpage

## 5. Counting Categories

In each part below, write the probabilities as math expressions. You don't have to find any numerical values. But please make sure there are **no infinite sums** in your answers. For distributions, you can just provide names and parameters if they are applicable. 

**a)** 18 dice are rolled. Let $X$ be the number of times the face with 1 spot appears. Let $Y$ be the number of times a multiple of 3 appears. What is the joint distribution of $X$ and $Y$? [Be careful about the possible values of the pair $(X, Y)$.]

**b)** 18 dice are rolled. Find the chance that two of the faces appear 5 times each, another two faces appear 3 times each, and the remaining two faces appear 1 time each.

**c)** Repeat Part **b** in the case when the number of dice rolled isn't 18 but is instead a random number that has the Poisson $(18)$ distribution.

**d)** As in Part **c**, roll a random number of dice, where the random number has a Poisson $(18)$ distribution. Don't touch any die that shows the face with six spots. If there are any other dice (that is, if there are dice that don't show six spots), roll those dice one more time. Then stop.

Let $S$ be the total number of dice that show the face with six spots when you stop. Find the distribution of $S$.

## Submission Instructions ##

Many assignments throughout the course will have a written portion and a code portion. Please follow the directions below to properly submit both portions.

### Written Portion ###
*  Scan all the pages into a PDF. You can use any scanner or a phone using applications such as CamScanner. Please **DO NOT** simply take pictures using your phone. 
* Please start a new page for each question. If you have already written multiple questions on the same page, you can crop the image in CamScanner or fold your page over (the old-fashioned way). This helps expedite grading.
* It is your responsibility to check that all the work on all the scanned pages is legible.
* If you used $\LaTeX$ to do the written portions, you do not need to do any scanning; you can just download the whole notebook as a PDF via LaTeX.

### Code Portion ###
* Save your notebook using `File > Save and Checkpoint`.
* Generate a PDF file using `File > Download As > PDF via LaTeX`. This might take a few seconds and will automatically download a PDF version of this notebook.
    * If you have issues, please post a follow-up on the general Homework 3 Ed thread.
    
### Submitting ###
* Combine the PDFs from the written and code portions into one PDF. [Here](https://smallpdf.com/merge-pdf) is a useful tool for doing so. 
* Submit the assignment to Homework 3 on Gradescope. 
* **Make sure to assign each page of your pdf to the correct question.**
* **It is your responsibility to verify that all of your work shows up in your final PDF submission.**

If you are having difficulties scanning, uploading, or submitting your work, please read the [Ed Thread](https://edstem.org/us/courses/43303/discussion/3344107) on this topic and post a follow-up on the general Homework 3 Ed thread.

## **We will not grade assignments which do not have pages selected for each question.** ##