<table align="left" style="border-style: hidden" class="table"> <tr><td class="col-md-2"><img style="float: left; width: 200px; height: 200px;" src="../logo.png" alt="Data 140 Logo" style="width: 120px;"/></td><td><div align="left"><h3 style="margin-top: 0;">Probability for Data Science</h3><h4 style="margin-top: 20px;">UC Berkeley, Fall 2025</h4><p>Alexander Strang</p>CC BY-NC-SA 4.0</div></td></tr></table><!-- not in pdf -->

This content is protected and may not be shared, uploaded, or distributed.

In [None]:
# SETUP
# The main libraries
import numpy as np
from datascience import *
from prob140 import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Lab 2: Right Tails and Sample Minima (Due Monday, September 8th at 5PM)


### Right Hand Tails

To find the distribution of a discrete random variable $X$, you need $P(X = x)$ for each possible value $x$. For some random variables, finding this probability is made easier by first finding a related probability such as $P(X \leq x)$ or $P(X > x)$.

The *cumulative distribution function* or *cdf* of $X$ is the function $F$ given by

$$
F(x) ~ = ~ P(X \le x) ~~~~~~~ \text{for all } x
$$

The *right hand tail* probability at $x$, denoted $S(x)$ below, is the probability of the complementary event:

$$
S(x) ~ = ~ P(X > x) ~ = ~ 1 - P(X \le x)
$$

Though you can calculate $S$ from $F$ and vice versa, in some situations one of them is much easier to think about than the other. This lab is about a random variable for which the right hand tail is the simplest way to start. 

Here is a visualization of a right hand tail probability of a random variable $X$ that has a distribution on the integers $0$ through $7$. The gold area is the cdf at $x = 4$, and the blue area is the right tail at $x = 4$.

<img src="lab2_fig.png" width="360" height="360">

### The Sample Minimum

For random variables $X_1, X_2, \ldots, X_n$, the random variable $M = \min\{X_1, X_2, \ldots, X_n\}$ is called the *minimum* of the $n$ variables. If you think of $X_1, X_2, \ldots, X_n$ as a random sample, then $M$ is the *sample minimum*.

Data scientists are often interested in extrema (minima and maxima) of samples, for example when they are interested in the tails of an underlying population distribution, or worst-case scenarios for the sizes of errors, and so on.

Right hand tail probabilities provide the most straightforward path to finding the distributions of many random variables. Sample minima are a primary example of such variables. 

What you will learn in this lab:
- How to find the distribution of a sample minimum by creating a complete list of the outcome space, in an example where the outcome space is small enough for this to be feasible.
- How to spot errors in a common but flawed *combinatoric* or "counting" argument for finding the distribution of a sample minimum.
- The best way to find the distribution of any sample minimum, by starting with the right hand tails.

### Useful Methods

It's still early in the semester, so here is some helpful code as a refresher. Please consult the one-page [**code reference sheet**](http://prob140.org/assets/references/final_reference_code_fa18.pdf) in [Resources](http://prob140.org/resources/) if you need a reminder of the syntax.

For this lab you will need some or all of:
- Array operations and NumPy functions including `item`, `diff`, `append`, and `cumsum`
- Defining functions: `def`
- Iteration: `for` (or any other Python method for iteration)
- `Table` methods from the `datascience` [library](https://www.data8.org/datascience/tables.html):
    - Creation: `with_columns`
    - Accessing and using values as inputs to a function: `apply`
- Distribution methods from the `prob140` [library](http://prob140.org/prob140/), operating on `Table()`:
    - Specifying the possible values: `values`
    - Specifying the probabilities: `probabilities` or `probability_function`
- Visualization methods from the `prob140` library:
    - Probability histogram of an integer-valued random variable: `Plot`
    - Events: `event`
    - Overlaid probability histograms of two integer-valued random variables: `Plots`

## Identify Your Lab Partner

This is a multiple choice question. Please select **ONE** of following options that best describes how you complete Lab 1.

- I am doing this lab by myself and I don't have a partner.
- My partner for this lab is [PARTNER'S NAME] with email [berkeley.edu email address]. [SUBMITTER'S NAME] will submit to Gradescope and add the other partner to the group on Gradescope after submission.

Please copy and paste **ONE** of above statements and fill in blanks if needed. If you work with a partner, make sure only one of you submit on Gradescope and that the other member of the group is added to the submission on Gradescope. Refer to the bottom of the notebook for submission instructions.


**Type your answer in this cell.**

\newpage

## Section 1: Complete Enumeration ##

If the sample consists of a small number of draws from a small population, and all the samples are equally likely, then we can find the distribution of the sample minimum by simply listing all the samples, finding the minimum of each one, and counting how many outcomes have the same minimum value.

This method is intractable and cumbersome in almost all situations of interest to data scientists. But it offers us a way to double-check more abstract methods. After we develop a more abstract method, we can apply it in a setting with a small sample and small population, and check that its results agree with the results we get by complete enumeration. You will do that later in this lab.

In this section you will use complete enumeration to find the distribution of the minimum number of spots you see on 7 rolls of a die. We will call this random variable $M$.

The code will closely follow the code used to find the distribution of the sum of 5 rolls of a die in [Section 3.1](http://prob140.org/textbook/content/Chapter_03/01_Functions_on_an_Outcome_Space.html) and [Section 3.2](http://prob140.org/textbook/content/Chapter_03/02_Distributions.html) of the textbook. **Please have the textbook at hand.**

### 1a) The Outcome Space and the Random Variable
Start by constructing the space of all possible outcomes of 7 rolls of a die, and augmenting it with the probability of each outcome as well as the value of $M$ for each outcome.

As in Section 3.1, we have some importing to do. Run the cell below. 

In [None]:
from itertools import product

What is the total number of outcomes of 7 rolls of a die? Enter the answer as a numerical expression in the cell below.

In [None]:
...

That's a lot of outcomes, but a `datascience` table can hold them all. Complete the cell below to create a table `seven_rolls_space` that has three columns in the following sequence.
- `omega`: The outcomes of 7 rolls of a die
- `M(omega)`: The minimum value in each outcome
- `P(omega)`: The probability of each outcome

Enter expressions (not multiple lines) to fill in the blanks in the cell below. They should be the appropriate modifications of corresponding expressions in Section 3.1.

In [None]:
die = np.arange(1, 7) # array of numbers of spots on the faces of the die

seven_rolls = ...  # All possible outcomes of 7 rolls

seven_rolls_probs = ...  # Array of the chances of the above outcomes

# Create a table with outcomes and their probabilities
seven_rolls_space = Table().with_columns(
   'omega', seven_rolls,
    'P(omega)', seven_rolls_probs
)

# Add a column M(omega) that contains the minimum of each outcome omega
# and put the columns in the specified order
seven_rolls_min = seven_rolls_space.with_columns(
    'M(omega)', ...
).move_to_end('P(omega)')


seven_rolls_min

### 1b) The Distribution of the Minimum
Fill in the blanks in the cell below to create the distribution object `dist_M`. This is a table that contains the distribution of $M$. It has two columns:
- `Value`: The possible values of $M$
- `Probability`: The probabilities of those values

In [None]:
dist_M = seven_rolls_min.drop('omega')...
dist_M = dist_M.relabeled(0, 'Value').relabeled(1, 'Probability')
dist_M

Add all the probabilities to confirm that you have a distribution.

In [None]:
...

If you computed the distribution correctly, one of your results should be that $P(M = 2)$ is just over $0.22$. We will refer to this result in a later Section.

Run the cell below to visualize the distribution. Notice that is bunched towards the left end of the possible values. This is typical for distributions of sample minima, since the variable is the smallest value observed in the sample.

In [None]:
Plot(dist_M)
plt.title('Minimum of 7 Rolls of a Die');

\newpage

## Section 2: An Approach that is Tempting but <span style="color: #FF0000">Wrong</span>

There is a common *wrong* approach to finding the distribution of a sample minimum. It is so tempting and plausible at first glance that it is worth examining how it fails.

We will apply it to try to find the distribution of $M$, the minimum of 7 rolls of a die. You know what this distribution should be because you found it in Section 1.

### 2a) Identifying the Error

Here are the steps of this proposed approach, being used for finding $P(M = 2)$.

- **Step 1**: Somewhere among the 7 rolls there must be one that shows 2 spots. Each single roll shows 2 spots with chance 1/6.
- **Step 2**: Since the minimum has to be 2, each of the other 6 rolls must show 2 or more spots. For each roll, the chance of this is 5/6.
- **Step 3**: $P(M = 2)$ is the sum: P(Roll 1 shows 2 and all the others show 2 or more) + P(Roll 2 shows 2 and all the others show 2 or more) + ... + P(Roll 7 shows 2 and all the others show 2 or more) $ = 7 \cdot (1/6) \cdot (5/6)^6 $

Something is clearly wrong with this, since in Part **1b** you found $P(M = 2) \approx 0.22$ whereas:

In [None]:
7 * (1/6) * ( (5/6) ** 6 )

Identify where the "method" goes wrong, as follows. For each of the above proposed Steps 1, 2, and 3, say whether it is true or false. If it is true, you don't have to give a reason. If it is false, explain where the logic fails.


**Type your answer in this cell.** 

**Step 1:**

**Step 2:**

**Step 3:**

### 2b) Ways to Spot Over-Counting

Think back to the list of outcomes in Part **1a**. When we calculate $P(M = 2)$, each outcome $\omega$ for which $M(\omega) = 2$ should be counted exactly once. But consider how many times the proposed "method" in Part **a** counts some outcomes. Fill in the two cells below with the corresponding counts. You should just enter a number or a numerical expression. 

In [None]:
# Goal: To find P(M = 2) using the "method" proposed in 2a

# Number of times the "method" counts the outcome [2 2 2 2 2 2 2]
...

In [None]:
# Goal: To find P(M = 2) using the "method" proposed in 2a

# Number of times the "method" counts the outcome [2 3 2 4 2 5 6]
...

Finally, see what happens if you try to apply the "method" to find $P(M = 1)$. What's the worst thing about the result purporting to be $P(M = 1)$?

In the cell below, fill in each blank in the calculation with a numerical expression, and the blank in the comment with a few words.

In [None]:
# Applying the "method" proposed in 2a to find P(M = 1)

... * ... * ...  # The worst thing about this result is ...

**The moral of this story:** Counting isn't always easy. If you're going to count, then count carefully. Great care is required to correctly count outcomes for a particular value of the minimum. 

Fortunately, it is not worth the bother. You will now derive a far better method using right tails.

\newpage

## Section 3: The Key to Sample Minima – Right Tails

By far the best method to work with sample minima, whether you are finding probabilities, distributions, or (appearing soon in this course) expectations, is to use the *right hand tail* probabilities of its distribution. 

### 3a) [On Paper] The Right Tails of a Distribution

We will work with positive integer valued random variables, though the method is more general.

Consider a probability distribution on the integers $1, 2, \ldots, N$. Let $p_i$ denote the probability of the integer $i$, so that $\sum_{i=1}^N p_i = 1$.

Let $X$ have the distribution above. For each non-negative integer $k$, find the right tail of $X$ at $k$. That is, find $P(X > k)$.

Your answers should be expressions involving $p_1, p_2, \ldots, p_N$, $N$, and $k$. Be careful; the expression is not the same for all $k$.

### 3b) [On Paper]  Distribution of a Sample Minimum, by Right Tails

Let random variables $X_1, X_2, \ldots, X_n$ be i.i.d. with the distribution in Part **a**, and let $M = \min\{X_1, X_2, \ldots, X_n\}$. We will call $\{X_1, X_2, \ldots, X_n\}$ the *sample*, and $M$ the *sample minimum*.

The goal is to find the distribution of $M$. Find the following and justify your answers.

**(i)** The possible values of $M$.

**(ii)** For each non-negative integer $k$, an expression for the right tail probability $P(M > k)$ in terms of $p_1, p_2, \ldots, p_N$, $N$, $k$, and $n$. Be careful; the expression is not the same for all $k$.

**(iii)** Use (i) and (ii) to find the distribution of $M$. Make sure your probabilities make sense for the smallest and largest possible value.

[Suggestion for (iii): It might help to draw a number line and mark the events $\{M > k\}$ for a few values of $k$.]

### 3c) Computing the Distribution

Define a function `distribution_of_minimum` that takes the underlying distribution of the sample and the i.i.d. sample size, and returns the distribution of the sample minimum. 

In the notation of Parts **a** and **b**, the arguments of `distribution_of_minimum` should be:
- The probabilities $p_1, p_2, \ldots, p_N$ as an array
- The i.i.d. sample size $n$

The function should return the distribution of the sample minimum as a `prob140` library "distribution object", which is just a table that has column labels `Value` and `Probability`.

To define the function, fill in each blank in the cell below with an expression, not multiple lines of code. **Use array operations**. They make your code simple and clear. Do not use loops, list comprehensions, or other iterative methods. 

There is a short guide to code at the start of this lab. Remember that the input array contains all the probabilities in a distribution; `np.cumsum` adds elements cumulatively, starting with Item 0; `np.diff` subtracts successive elements; and so on. 

It is fine to assume that Item 0 of the probability array is $p_1$ and Item $N-1$ is $p_N$. Thus you can infer the possible values from the probability array. Be careful about array lengths and endpoints when using `np.cumsum` and `np.diff`. 

In [None]:
def distribution_of_minimum(prob_array, n):
    N = ...      # the value of N
    vals = ... # array of possible values of M, the sample minimum
    
    # Array containing P(X > k) for k = 1, 2, ... N, where X is a generic sampled element
    right_tails_X = ...
    
    # Array containing P(M > k) for the possible values of M
    right_tails_M = ...
    
    # Array containing P(M = k) for the possible values of M
    probs_M = ...
    
    # The rest of the code creates the distribution object and returns it.
    distribution_of_M = Table().values(vals).probabilities(probs_M)
    return distribution_of_M

Run the cell below to test whether your function is returning the correct distribution for the minimum of 7 rolls of a die. The output should be the same as `dist_M` in Part **1b**.

In [None]:
fair_die = np.ones(6)/6  # array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])
distribution_of_minimum(fair_die, 7)

### 3d) Minimum Household Size in a Sample
The website Statista provides the [distributions of household sizes](https://www.statista.com/statistics/242189/disitribution-of-households-in-the-us-by-household-size/) in the United States, for various years. Go to the website and notice their choice of stacked bars to represent the proportions of households of different sizes. If you hover your cursor over one of the stacks, you can see the percents in that stack. 

Notice also:
- No household has 0 members, so the bottom bar (blue) represents households with one member.
- The distributions have been *truncated* at 7. That is, the last category of sizes is "7 or more". This truncation is at the high end of the distribution and therefore will not have a noticeable effect on the distribution of the minimum of a random sample. In fact, we will just assume that the category consists of exactly 7 people, as this too won't have much affect on the distribution of the sample minimum.

We will work with the 1970 and 2022 distributions. As discussed above, we'll call the final category "7 persons", not "7 or more".

First, run the two cells below for some cleanup to compensate for the fact that the proportions provided don't quite add up to 1. The array `h_size_1970` is the household size probability array for one random draw from US households in 1970. We have just normalized Statista's proportions by the total and rounded the results.

In [None]:
# US household size distribution, 1970
h_size_1970 = np.array([0.1711, 0.2892, 0.1727, 0.1576, 0.1033, 0.0557, 0.0504])
h_size_1970 = np.round(h_size_1970/sum(h_size_1970), 5)
h_size_1970, sum(h_size_1970)

In [None]:
# US household size distribution, 2022
h_size_2022 = np.array([0.2888, 0.3472, 0.1510, 0.1234, 0.0556, 0.0215, 0.0126])
h_size_2022 = np.round(h_size_2022/sum(h_size_2022), 5)
h_size_2022, sum(h_size_2022)

Run the next cell to see overlaid histograms of the two distributions.

In [None]:
sizes = np.arange(1, 8)
dist_size_1970 = Table().values(sizes).probabilities(h_size_1970)
dist_size_2022 = Table().values(sizes).probabilities(h_size_2022)
Plots('1970', dist_size_1970, '2022', dist_size_2022)
plt.title('US Household Size Distribution (truncated)');

The two distributions reflect changes in the US over half a century. Briefly describe the main difference that you see, and say how you think it will affect the distribution of the minimum of an i.i.d. random sample of US households taken in each of the two years. You can assume the two samples have the same size.


**Type your answer in this cell.**

Run the cell below to see the distribution of the sample minimum for i.i.d. samples of size 10 drawn in each of the two years. Without drawing new histograms, say what the distributions would look like for a sample size of 100 instead of 10.

In [None]:
dist_M_1970 = distribution_of_minimum(h_size_1970, 10)
dist_M_2022 = distribution_of_minimum(h_size_2022, 10)
Plots('1970 Sample Minimum', dist_M_1970, '2022 Sample Minimum', dist_M_2022)


**Type your answer in this cell.**

## Conclusion ##

You now know:

- How to find distributions of random variables by listing all the elements of the outcome space.
- Finding distributions by complete enumeration is not feasible for large outcome spaces, as in situations where the population and sample sizes are large.
- When you count outcomes that satisfy a specified condition, every outcome that satisfies the condition must be counted exactly once. You have to be careful about over-counting and under-counting. 
- To find the distribution of a discrete random variable $X$, it might be complicated to find the chance of all the outcomes that satisfy the condition $\{X = x\}$. For a sample minimum $M$, it is much easier to identify the outcomes that satisfy $\{M > x\}$ and hence find the right tail probability $P(M > x)$ for each $x$. You can then combine those probabilities appropriately to find $P(M = x)$.

Sample minima are just one example of random variables whose right tails are useful. You'll see a few more as the course progresses.

### <span style="color: #113399">You have completed Lab 2. Congratulations!</span> ###

Please follow the submission instructions below to ensure that you have submitted the lab correctly.

## Submission Instructions ##

Many assignments throughout the course will have a written portion and a code portion. Please follow the directions below to properly submit both portions.

### Written Portion ###
*  Scan all the pages into a PDF. You can use any scanner or a phone using applications such as CamScanner. Please **DO NOT** simply take pictures using your phone. 
* Please start a new page for each question. If you have already written multiple questions on the same page, you can crop the image in CamScanner or fold your page over (the old-fashioned way). This helps expedite grading.
* It is your responsibility to check that all the work on all the scanned pages is legible.
* If you used $\LaTeX$ to do the written portions, you do not need to do any scanning; you can just download the whole notebook as a PDF via LaTeX.

### Code Portion ###
* Save your notebook using `File > Save Notebook`.
* Generate a PDF file using `File > Save and Export Notebook As > PDF`. This might take a few seconds and will automatically download a PDF version of this notebook.
    * If you have issues, please post a follow-up on the general Lab 2 Ed thread.
    
### Submitting ###
* Combine the PDFs from the written and code portions into one PDF. [Here](https://smallpdf.com/merge-pdf) is a useful tool for doing so. 
* Submit the assignment to Lab 2 on Gradescope. 
* **Make sure to assign each page of your pdf to the correct question.**
* **It is your responsibility to verify that all of your work shows up in your final PDF submission.**

If you are having difficulties scanning, uploading, or submitting your work, please read the [Ed Thread](https://edstem.org/us/courses/83687/discussion/6882870) on this topic and post a follow-up on the general Lab 2 Ed thread.

## **We will not grade assignments which do not have pages selected for each question.** ##