In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook. Before you begin click the "Run cell" button at the top that looks like ▶| or hold down `Shift` + `Return`.

## Lab 03: Probability Distributions

Welcome to Advanced Topics in Data Science for High School! Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

**Collaboration Policy:**

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on questions in labs, so ask a neighbor or an instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** _just_ copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

**Note:** For autograded probability questions, the provided tests will only check that your answer is within a reasonable range.

**Due Date:**

## Today's Assignment 

In today's assignment, you'll learn about:

- the [`seaborn` library](https://seaborn.pydata.org/). 

  **Note:** The `seaborn` library is a Python data visualization library based on `matplotlib`. It provides a high-level interface for drawing attractive and informative statistical graphics.
  
First, set up the imports by running the cell below.

In [None]:
import numpy as np
import seaborn as sns
sns.set()

## Distributions

Visualizing distributions, both categorical and numerical, helps us understand variability. In Foundations of Data Science you visualized numerical distributions by drawing [histograms](https://www.inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#A-Histogram), which look like bar charts but represent proportions by the *areas* of the bars instead of the heights or lengths. In this exercise you will use the [`histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html)  method in `seaborn` instead of the corresponding `Table` method (from Foundations of Data Science) to draw histograms.

[`seaborn`](https://seaborn.pydata.org/index.html) is a Python data visualization library based on [`matplotlib`](https://matplotlib.org/). It provides a high-level interface for drawing statistical graphics. 

To start off, suppose we want to plot the probability distribution of the number of dots on a single roll of a die. That should be a flat histogram since the chance of each of the values 1 through 6 is $\frac{1}{6}$. Here is a first attempt at drawing the histogram.

In [None]:
faces = range(1, 7)
sns.histplot(faces);

This default plot is not helpful. We have to choose some arguments to get a visualization that we can interpret. 

Note that the second printed line shows the left ends of the default bins, as well as the right end of the last bin. The first line shows the counts in the bins. 

Let's redraw the histogram with bins of unit length centered at the possible values. By the end of the exercise you'll see a reason for centering.

In [None]:
unit_bins = np.arange(0.5, 6.6)
sns.histplot(faces, bins=unit_bins);

That's much better, but look at the vertical axis. It is not drawn to the [density scale](https://inferentialthinking.com/chapters/07/2/Visualizing_Numerical_Distributions.html#the-vertical-axis-density-scale) defined in Foundations of Data Science. We want a histogram of a probability distribution, so the total area should be 1. 

To fix this we can use the parameter `stat='density'`.

In [None]:
sns.histplot(faces, bins=unit_bins, stat='density');

That's the probability histogram of the number of spots on one roll of a die. The proportion is $\frac{1}{6}$ in each of the bins.

<!-- BEGIN QUESTION -->

**Question 1.** Define a function `integer_distribution` that takes an array of integers and draws the histogram of the distribution using unit bins centered at the integers and white edges for the bars. The histogram should be drawn to the density scale. The left-most bar should be centered at the smallest integer in the array, and the right-most bar at the largest.

Your function does not have to check that the input is an array consisting only of integers. The display does not need to include the printed proportions and bins.

**Hint:** If you have trouble defining the function, go back and carefully read all the lines of code that resulted in the probability histogram of the number of spots on one roll of a die. Pay special attention to the bins.

In [None]:
def integer_distribution(x):
    ...

integer_distribution(np.arange(1,6))

<!-- END QUESTION -->

## `SciPy` and `special`

Factorials and the binomial coefficients 

$$\binom{n}{k} = \frac{n!}{k!(n-k)!}$$

get large very quickly as $n$ gets large. One way to compute them is to use the `SciPy` module `special`. `SciPy` is a collection of Python-based software for math, probability, statistics, science, and engineering.

In [None]:
from scipy import special

Below are two examples of `special.factorial`.

In [None]:
special.factorial(5)

If `exact=True`, the output will be calculated exactly using long integer arithmetic.

In [None]:
special.factorial(range(1, 10), exact=True)

Traditionally, subsets of $k$ items out of a population of $n$ items are called **combinations**, and so `special.comb(n, k)` evaluates to $\binom{n}{k}$. 

**Note:** We will always use the term **subsets** to mean un-ordered sets. We will use **permutations** in situations where we need to keep track of the order in which the elements appear.

Look at the code and output from running the next 5 code cells carefully (including types) before starting **Question 2.**.

In [None]:
special.comb(5, 3)

In [None]:
special.factorial(5)/(special.factorial(3)*special.factorial(2))

In [None]:
special.comb(5, range(6))

In [None]:
special.comb(100, 50)

In [None]:
special.comb(100, 50, exact=True)

Consider a population in which a proportion $p$ of individuals are called *"successes"* (or 1, if you prefer) and the remaining proportion are rudely called *"failures"* (or 0). 

Let $n$ be a positive integer. If you draw a sample of size $n$ at random with replacement from the population then the chance that you get $k$ successes and $n-k$ failures in your sample is 

$$\binom{n}{k}p^k(1-p)^{n-k}$$ for $0 \le k \le n$. 

You saw examples of binomial probabilities in the Permutations and Combinations assignment.

**Question 2.** In the Permutations and Combinations assignment **Question 5.** asks 

*"What is the probability that, if challenged, Eddy could make at least 3 out of 5 free throws?"*

Given the fact that Eddy shoots 60% from the free-throw line, complete the cell with a Python expression that evaluates to a `NumPy` array whose elements are the chances of $k$ successes for $k = 0, 1, 2, 3, 4, 5$. Assign your array to `all_probs`. Your array should sum to 1.

In [None]:
all_probs = ...
all_probs

In [None]:
grader.check("q2")

**Question 3.** Suppose you sample 100 times at random with replacement from a population in which 26% of the individuals are called "successes" (that's traditional terminology in probability). Write a Python expression that evaluates to the chance that the sample has 20 successes. 

**Computational Note:** Don't import any other libraries; just use the ones already imported and plug into the formula above **Question 2**. It's far from the best way numerically, but it is fine for the numbers involved in this lab. For this course, we are focusing on issues of numerical accuracy.


In [None]:
prob_q3 = ...
prob_q3

In [None]:
grader.check("q3")

**Question 4.** Suppose we want to know the theoretical probability that the number of successes in a sample of 100 is in the interval 20 to 30 (inclusive on both sides), where the probability of a success is 0.26. Calculate this probability and assign it to the variable `prob_q4`.


**Hints:** You can use the `special.comb` function to compute binomial probability.

In [None]:
prob_q4 = ...
prob_q4

In [None]:
grader.check("q4")

Let $n$ and $s$ be an positive integers such that $0 \le s \le n$. Consider a sample of size $n$ drawn at random with replacement from a population in which a proportion $p$ of the individuals are called successes. The mathematical expression for the probability that the number of successes in the sample is at most $s$ can be written as 

$$\sum_{k=0}^s \binom{n}{k}p^k(1-p)^{n-k}$$

In statisitcs classes this probability will typically be denoted $P(S \le s)$ where $S$ denotes the random number of successes in the sample. Formal definitions of the pieces of this notation aren't particularly helpful for our purposes. Just read it as *"the probability that the number of successes is at most $s$."*

<!-- BEGIN QUESTION -->

**Question 5.** Fill in the function `prob_at_most` which takes $n$, $p$, and $s$ and returns $P(S \le s)$ as defined in summation above.

**Note:** Make sure that the inputs are valid. For example, if $p > 1$ or $s > n$ then return 0. 

**Warning:** Check your function to make sure you are returning the correct probability before continuing with the lab.

In [None]:
def prob_at_most(n, p, s):
    """Returns the probability of S <= s
       n: sample size
       p : proportion of successes
       s: number of successes at most"""
    ...

<!-- END QUESTION -->

## Polling

**Question 6.**  In an election, supporters of Candidate C are in a minority. Only 45% of the voters in the population favor the candidate. Suppose a survey organization takes a sample of 200 voters from this population. Use `prob_at_most` to write an expression that evaluates to the chance that a majority (more than half) of the sampled voters favor Candidate C.

In [None]:
p_majority = ...
p_majority

In [None]:
grader.check("q6")

Suppose five different survey organizations takes a sample of voters at random with replacement from the population of voters in **Question 6**, independently of the samples drawn by the other organizations. 

- Three of the organizations use a sample size of 200
- One organization uses a sample size of 300
- One organization uses a sample size of 400

**Question 7.**  Write an expression that evaluates to the chance that in at least one of the five samples the majority of voters favor Candidate C. You can use any quantity or function defined earlier in this lab.

**Hints:** 

- When two or more events are independent, the probability that both will happen is the product of their individual probabilities.

- The probability of an event occurring at least once, will be the complement of the event never occurring. This means that the probability of the event never occurring and the probability of the event occurring at least once will equal 1.

In [None]:
prob_candidate_c = ...
prob_candidate_c

In [None]:
grader.check("q7")

# Simulating Risk Probability

Studies indicate that about 90% of the booked passengers actually arrive for their flights. Suppose that a small plane has 75 seats and, for the purpose of this exercise, assume that passengers arrive independently of each other. 

**Note:** This assumption is not realistic (i.e. many people do not travel alone).

Most of the time airlines overbook flights (i.e. the airline sells more tickets than there are seats on the plane). This is due to the fact that sometimes passengers fail to show up, and the plane must be flown with empty seats. However, if they overbook, they run the risk of having more passengers than seats. This will undoubtedly upset passengers. 

To entice overbooked passengers to give up their seats, the airline will give vouchers for free flights or offer money to passengers to take a later flight.

With these risks in mind the airline decides to sell 80 tickets. You and your friends see this as an opportunity to design an application that will help the airlines determine if this is a good strategy.

<!-- BEGIN QUESTION -->

**Question 8.** Read the article [5 things to know about Southwest's disastrous meltdown](https://www.npr.org/2022/12/30/1146377342/5-things-to-know-about-southwests-disastrous-meltdown). 

- What did the article claim to be the major reason behind this debacle? Do you agree? 
- Do you think the company can avoid this from happening again in the future? Why, or why not? 

Write a paragraph that addresses each of the previously mentioned questions.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 9.** Write a function named `number_of_tickets` that will simulate 80 tickets being sold for flight on an airplane with 75 available seats. Simulate the probability that the number of passengers who arrive for the flight is more than the number of seats available. Store this proportion in `prop_tickets`. 

**Note:** Make sure that the inputs are valid. For example, if $p > 1$ or $s > n$ then return 0.

In [None]:
def number_of_tickets(n, p, s):
    """Returns the probability of S >= s
       n: number of tickets sold 
       p: proportion of passengers expected to arrive 
       s: number of available seats"""
    ...
prop_tickets = number_of_tickets(80, 0.9, 75)
prop_tickets

In [None]:
grader.check("q9")

**Question 10.** If the airline wants to keep the probability of having more than 75 passengers show up to get on the flight to around 1%, what is the maximum number of tickets they should sell? Assign this value to `tickets_to_sell`.

In [None]:
...
tickets_to_sell

In [None]:
grader.check("q10")

<!-- BEGIN QUESTION -->

**Question 11.** In 2-3 sentences, explain how you determined your answer for **Question 10**.

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)