<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Normal-Distribution" data-toc-modified-id="Normal-Distribution-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Normal Distribution</a></span><ul class="toc-item"><li><span><a href="#Why-a-Normal-Distribution?" data-toc-modified-id="Why-a-Normal-Distribution?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Why a Normal Distribution?</a></span></li><li><span><a href="#Normal-Curve-==-Awesome-Math--😎" data-toc-modified-id="Normal-Curve-==-Awesome-Math--😎-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Normal Curve == Awesome Math  😎</a></span><ul class="toc-item"><li><span><a href="#🧠-Knowledge-Check" data-toc-modified-id="🧠-Knowledge-Check-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>🧠 Knowledge Check</a></span></li><li><span><a href="#More-Normal-Curves!" data-toc-modified-id="More-Normal-Curves!-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>More Normal Curves!</a></span></li></ul></li><li><span><a href="#Standard-Normal-Distribution" data-toc-modified-id="Standard-Normal-Distribution-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Standard Normal Distribution</a></span><ul class="toc-item"><li><span><a href="#$z$-Score" data-toc-modified-id="$z$-Score-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>$z$-Score</a></span><ul class="toc-item"><li><span><a href="#🧠-Knowledge-Check" data-toc-modified-id="🧠-Knowledge-Check-2.3.1.1"><span class="toc-item-num">2.3.1.1&nbsp;&nbsp;</span>🧠 Knowledge Check</a></span></li></ul></li><li><span><a href="#The-Empirical-Rule" data-toc-modified-id="The-Empirical-Rule-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>The Empirical Rule</a></span><ul class="toc-item"><li><span><a href="#🧠-Knowledge-Check" data-toc-modified-id="🧠-Knowledge-Check-2.3.2.1"><span class="toc-item-num">2.3.2.1&nbsp;&nbsp;</span>🧠 Knowledge Check</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Exercises" data-toc-modified-id="Exercises-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exercises</a></span><ul class="toc-item"><li><span><a href="#Height-$z$-score" data-toc-modified-id="Height-$z$-score-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Height $z$-score</a></span></li><li><span><a href="#Height-Empirical-Rule" data-toc-modified-id="Height-Empirical-Rule-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Height Empirical Rule</a></span></li><li><span><a href="#Height-Percentile" data-toc-modified-id="Height-Percentile-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Height Percentile</a></span></li><li><span><a href="#Bonus" data-toc-modified-id="Bonus-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Bonus</a></span>

# Objectives

* Describe the normal distribution
* Calculate $z$-scores from a normal distribution through standardization
* Describe the normal distribution's Empirical Rule

In [None]:
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

# Normal Distribution

We'll find that the **normal distribution** or **normal curve** or **bell curve** (it has many names) is a very common distribution and it's very useful to us in statistics.

![](images/normal_curve_animation.gif)

## Why a Normal Distribution?

Turns out the normal distribution describes many phenomena. Think of anything that has a typical range:

- human body temperatures
- sizes of elephants
- sizes of stars
- populations of cities
- IQ
- heart rate

Among human beings, 98.6 degrees Fahrenheit is an _average_ body temperature. Many folks' temperatures won't measure _exactly_ 98.6 degrees, but most measurements will be _close_. It is much more common to have a body temperature close to 98.6 (whether slightly more or slightly less) than it is to have a body temperature far from 98.6 (whether significantly more or significantly less). This is a hallmark of a normally distributed variable.

Similarly, there are large elephants and there are small elephants, but most elephants are near the average size.


## Normal Curve == Awesome Math  😎

This might look complicated at first, but it means that the we describe a normal curve with just **two parameters**: $\sigma^2$ (the variance) & $\mu$ (the mean). You may see the notation $N(\mu, \sigma^2)$ which emphasizes there are only two parameters to describe the distribution.


In [None]:
fig, ax = plt.subplots()

# variables for our mean 'mu' and our standard deviation, 'sigma'.
mu = 0
sigma = 1
# This defines the points along the x-axis
x = np.linspace(
        stats.norm(mu, sigma).ppf(0.01), # Start plotting here       #ppf= point percent function:
        stats.norm(mu, sigma).ppf(0.99), # End plotting here         #     plot start at the 1st percentile
        100                              # Number of points          #     plot end at the 99th percentile
)
# The values as at x given by the normal curve (with mu & sigma)
y = stats.norm(mu,sigma).pdf(x)
ax.plot(x, y,'r-');

### Normal PDF

If you're curious about how we can mathematically define a normal curve, we give this below. (Don't worry, you don't need to recall the mathematical definition.)

<details>

$\Large f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x - \mu)^2}{2\sigma^2}}$

This might look complicated at first, but what you should focus on is that there are really on two parameters that determine $f(x)$ given $x$
</details>

In [None]:
def plot_normal_curve(mu, sigma, ax):
    # This defines the points along the x-axis
    x = np.linspace(
            stats.norm(mu,sigma).ppf(0.01), # Start plotting here
            stats.norm(mu,sigma).ppf(0.99), # End plotting here
            100                             # Number of points
    )
    # The values as at x given by the normal curve (with mu & sigma)
    y = stats.norm(mu, sigma).pdf(x)
    ax.plot(x, y, 'r-');
    return ax

In [None]:
fig, ax = plt.subplots()
plot_normal_curve(mu=10, sigma=1, ax=ax)

### 🧠 Knowledge Check

What would the distribution look like if we make $\sigma$ smaller or bigger?

In [None]:
fig, axs = plt.subplots(nrows=3, sharex=True, sharey=True, figsize=(6, 8))

for n, ax in enumerate(axs, start=1):
    # Make sigma slightly bigger each time
    sigma = n
    plot_normal_curve(mu=0, sigma=sigma, ax=ax)
    ax.set_title(f'$\mu$:{mu}, $\sigma$:{sigma}')

plt.tight_layout()

What would the distribution look like if we make $\mu$ smaller or bigger?

In [None]:
fig, axs = plt.subplots(nrows=3, sharex=True, sharey=True, figsize=(6, 8))

for n, ax in enumerate(axs):
    # Make mu slightly bigger each time
    mu = n 
    plot_normal_curve(mu=mu, sigma=1, ax=ax)
    ax.set_title(f'$\mu$:{mu}, $\sigma$:{sigma}')

plt.tight_layout()

### More Normal Curves!

We can now describe any normal curve by setting the mean and the variance!

In [None]:
# Function to plot multiple normal curves
def plot_normal_curves(parameters_list, ax):
    ''' Use a list of parameters (in dictionary form) to plot multiple normal 
        curves.
    '''
    for params in parameters_list:
        mu = params.get('mu')
        sigma = params.get('sigma')
        style = params.get('style','')
        # This defines the points along the x-axis
        x = np.linspace(
                stats.norm(mu, sigma).ppf(0.01), # Start plotting here
                stats.norm(mu, sigma).ppf(0.99), # End plotting here
                100                             # Number of points
        )
        # The values as at x given by the normal curve (with mu & sigma)
        y = stats.norm(mu, sigma).pdf(x)
        ax.plot(x, y, 
                style, 
                linewidth=4, 
                label=f'$\mu={mu}$, $\sigma={sigma}$');
    ax.legend()
    return ax

In [None]:
fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(12, 6))
 
# mean, standard deviation, graphing style
normal_curve_parameters = {
    # Normal curves centered at zero
    'center': [
        {'mu': 0, 'sigma': 1, 'style': 'y-'},
        {'mu': 0, 'sigma': 0.5, 'style': 'b-'},
        {'mu': 0, 'sigma': 2, 'style': 'g-'}
    ],
    # Same normal curves but with different means
    'off-center': [
        {'mu': 1, 'sigma': 1, 'style': 'y-'},
        {'mu': 2, 'sigma': 0.5, 'style': 'b-'},
        {'mu': 5, 'sigma': 2, 'style': 'g-'}
    ]
}

ax = plot_normal_curves(normal_curve_parameters['center'], ax0)
ax.set_title('Center')
 
    
ax = plot_normal_curves(normal_curve_parameters['off-center'], ax1)
ax.set_title('Off-Center')

# Neat output
plt.tight_layout()

## Standard Normal Distribution

A special normal distribution called the **standard normal distribution** has a mean of 0 and variance of 1. This is also known as a z distribution.

Since we know that the shape of a normal distribution changes based on its mean and variance, we'll typically convert or **standardize** our normal distribution to the standard normal distribution.

We simply subtract the mean $\mu$ from each value and then divide by the standard deviation $\sigma$:

$$\frac{x - \mu}{\sigma}$$

We call this process **standardization**.

![norm_to_z](images/norm_to_z.png)

In [None]:
# Let's transform the normal distribution centered on 5
# with a standard deviation of 2 into a standard normal

normal_dist = np.random.normal(loc=5, scale=2, size=1000)
z_dist = [(x - np.mean(normal_dist)) / np.std(normal_dist) 
          for x in normal_dist]

fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, figsize=(10, 6))
sns.kdeplot(data=normal_dist, ax=ax0); # Older versions of seaborn have "data"
                                        # instead of "x"
ax0.set_title('Before Standardization')
sns.kdeplot(data=z_dist, ax=ax1);
ax1.set_title('After Standardization')
plt.tight_layout()

Talking about the standard normal distribution can be very convenient since the values correspond to the number of standard deviations above or below the mean.

### $z$-Score

A **$z$-score** for a data point $x$ (in a normal distribution) is simply the distance to the mean in units of standard deviations

$$\large z = \frac{x - \mu}{\sigma}$$

By calculating the z-score of an individual point, we can see how unlikely a value is.

Here's a little site with some [interactive Gaussians](https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php)

#### 🧠 Knowledge Check

What would the $z$-score be for the middle of a normal curve?

<details>
    <summary>Answer</summary>
    0!
    </details>

### The Empirical Rule

> Rule states that $68\%$ of the values of a normal distribution of data lie within 1 standard deviation ($\sigma$) of the mean, $95\%$ within $2\sigma$, and $99.7\%$ within $3\sigma$.  

This makes it really quick to look at a normal distribution and understand where values tend to lie

<img src='https://github.com/learn-co-students/dsc-0-09-12-gaussian-distributions-online-ds-ft-031119/blob/master/normalsd.jpg?raw=true' width=700/>

#### 🧠 Knowledge Check

About what percentage of the values would be between a $z$-score of $-1$ and a $z$-score of $2$?

<details>
    <summary>Answer</summary>
    About $82\%$
</details>

# Exercises

## Height $z$-score

The distribution of people's heights in the United States has a mean of 66 inches and a standard deviation of 4 inches. **Calculate the z-score of a height of 76 inches.**

<details>
    <summary>Answer</summary>
    <code># z-score: z = (x - mu) / std
(76 - 66) / 4</code>
</details>

## Height Empirical Rule

Use the empirical rule and the information above to determine about how many people are between **62 inches and 74 inches**.

<details>
    <summary>Answer 1</summary>
<code># z-scores for 62" and 74":
z_62 = (62 - 66) / 4
z_74 = (74 - 66) / 4
z_62, z_74</code>
    </details>

<details>
    <summary>Answer 2</summary>
    <code>heights = stats.norm(loc=66, scale=4)
heights.cdf(74) - heights.cdf(62)</code>
    </details>

## Height Percentile

Assuming the above distribution of people's heights in the United States is approximately normal, what percent of people have a height less than **75 inches**?

<details>
    <summary>Answer</summary>
    <code>heights.cdf(75)</code>
    </details>

## Bonus

Assuming the above distribution of people's heights in the United States is approximately normal, what range of heights contain the **middle 50% of values**,also known as the _interquartile range_ (IQR)?

<details>
    <summary>Answer</summary>
    <code>heights.ppf(0.25), heights.ppf(0.75)</code>
    </details>