# CMM 262: Introduction to Statistics, Day 2/2 (Distributions and Hypothesis Testing)

### Contributors

* **Clarence Mah**, Ph.D. (CMM 262, 2020-2021)
* **Michelle Franc Ragsac**, Ph.D. (CMM 262, 2020-2021, 2026)
* **Graham McVicker**, Ph.D. (CMM 262, 2022-2024)

### Importing Packages of Interest

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# The new additions for this week!
import numpy as np
import scipy.stats as stats

## Exercise 1: Discrete Distributions

### Plotting the Probability Mass Function (PMF) for the Binomial Distribution

In this notebook, we'll cover the general notation for interacting with `scipy.stats` modules in the context of the **binomial distribution**. Luckily, other probability distributions implemented in `scipy.stats` follow similar notation and parameter structure, so what you learn here is transferrable to other methods! 

As a refresher, the binomial distribution describes the number of "successes" ($k$) observed from $n$ Bernoulli trials (i.e., imagine coin flips), and the formula for the binomial **probability mass function (PMF)** can be represented as:

$$ f(k) = {n \choose k} p^k (1-p)^{n-k} $$

To start off this notebook, let's learn how to model the binomial PMF and plot the values of the binomial PMF with the following parameters: $n=10$ and $p=0.1$.

In [None]:
# TODO
# ...

<div class="alert alert-block alert-warning">
<b>Interactive Exercise #1: Plotting the Binomial Probability Mass Function</b>
    <p></p>
    <p>Generate a plot depicting the binomial probability mass function (PMF) for <font face="Latin Modern">n = 15</font> and <font face="Latin Modern">p = 0.5</font>.</p>
</div>

In [None]:
# TODO
# ...

### Sampling from the Binomial Distribution ($n=15$, $p=0.5$)

In the next code cell, we can try sampling from our distirbution 100 times and plot the results as a histogram--but lets do it multiple times and see how the distribution changes!

In [None]:
# TODO
# ...

### The Binomial Distribution in Practice: The Palmer Penguins Dataset

Now that we know how to use the `scipy.stats` module to model the binomial distribution (i.e., calculate the PMF for different values of $n$, $k$, and $p$ parameters) as well as sample from the distribution (i.e., with the `stats.binom.rvs()` function), let's dive into a practical application of probability distributions.

The Palmer Penguins dataset we used in the previous lecture contains `sex` information (i.e., `male` and `female`) for our penguins! In this portion of the notebook, we will try and model the theoretical distribution of `sex` in our population and compare it to what we observe in our sample set.

First, let's see how many `male` penguins we have in our sample.

In [None]:
# TODO
# ...

<div class="alert alert-block alert-warning">
<b>Interactive Exercise #2: Probability of Encountering Male Penguins in the Palmer Penguins Dataset</b>
    <p></p>
    <p>Assuming that the probability of encountering a <code>male</code> penguin is 50%, what is the probability of observing <b>exactly this number</b> of <code>male</code> penguins within our sample? (<i>Hint</i>: You can use the <code>stats.biom.pmf()</code> function with our parameters).</p>
</div>

In [None]:
# TODO
# ...

<div class="alert alert-block alert-warning">
<b>Interactive Exercise #3: Plotting the Binomial Probability Mass Function with the Palmer Penguin Parameters</b>
    <p></p>
    <p>Generate a plot depicting the binomial probability mass function (PMF) for <font face="Latin Modern">n = total number of penguins</font> and <font face="Latin Modern">p = 0.5</font> to model the PMF for our population of penguins.</p>
</div>

In [None]:
# TODO
# ...

---

## Exercise 2: The Central Limit Theorem

### Modeling Body Mass in the Palmer Penguins Dataset

In [None]:
# TODO
# ...

<div class="alert alert-block alert-warning">
<b>Interactive Exercise #3: Modeling Body Mass in the Palmer Penguins Dataset</b>
    <p></p>
    <p>Say we had a larger research budget, allowing us to sample 40 penguins during each expedition! Generate a plot to depict the results of sampling the body mass of penguins over 1,000 expeditions and compare it to the theoretical distribution. How does the standard error change with this increase in sampling size?</p>
</div>

In [None]:
# TODO
# ...

## Exercise 3: Hypothesis Testing

### Analyzing Body Mass Differences between Species

In this section of the notebook, let's test whether the mass of `Adelie` and `Chinstrap` penguins differ. 

First, we can compute a z-score for the difference in the means of the two penguin species. The formula for that is:

$$ z = \frac{( \bar{x}_1 - \bar{x}_2 ) - ( \mu_1 - \mu_2 ) }{ \sqrt{ \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} } } $$

In [None]:
# TODO
# ...

---

## Exercise 4: Correlation Tests and Linear Models

### Evaluating the Correlation Between Bill Length and Body Mass

<div class="alert alert-block alert-warning">
<b>Interactive Exercise #4: Visualizing the Relationship Between Two Quantitative Variables</b>
    <p></p>
    <p>Generate a scatter plot to compare the relationship between <code>bill_length_mm</code> and <code>body_mass_g</code>, differentiating the points in the plot by penguin <code>species</code>.</p>
</div>

In [None]:
# TODO
# ...