# The normal distribution

This is a Jupyter Notebook.  It can be run as an interactive demo.

The Notebook contains *cells*.  This is a *text* cell.  The next cell is a *code* cell.

Press the Shift key with the Enter (or Return) key to execute a cell and move to the next cell.

You can also run a cell with the Run icon at the top of the Window.

In [None]:
# Execute this cell by pressing Shift and Enter at the same time.
# Libraries for plotting, statistical distributions
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as sps

In [None]:
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')

We sample one million values from a normal distribution.  The distribution has a mean of 0 and a *standard deviation* of 1.

In [None]:
number_of_values = 1000000
values = np.random.normal(0, 1, size=number_of_values)

The first 10 values:

In [None]:
values[:10]

Plot all the values as a histogram with 200 bins:

In [None]:
plt.hist(values, bins=250);

What values are between -1.96 and +1.96?

In [None]:
left_threshold = -1.96
right_threshold = 1.96
betweens = (values >= left_threshold) & (values <= right_threshold)

In [None]:
# Where are the between values?
counts, bins, patches = plt.hist(values, bins=250)
plt.hist(values[betweens], bins=bins, lw=0);

In [None]:
# Proportion of values between -1.96 and 1.96
np.count_nonzero(betweens) / number_of_values

This proportion - around 95% - is characteristic of the normal distribution.

What if the distribution doesn't have a mean of 0 or a standard deviation of 1?

In [None]:
values_around_10 = np.random.normal(10, 3, size=number_of_values)

In [None]:
plt.hist(values_around_10, bins=250);

Now the center is at 10, and one standard deviation to the left of center is 10 - 3 = 7.

The rule above works by multiplying the standard deviation by 1.96, and subtracting / adding to the mean.

In [None]:
left_threshold = 10 - 3 * 1.96
right_threshold = 10 + 3 * 1.96
betweens_around_10 = (values_around_10 >= left_threshold) & (values_around_10 <= right_threshold)

In [None]:
# Where are the between values?
counts, bins, patches = plt.hist(values_around_10, bins=250)
plt.hist(values_around_10[betweens_around_10], bins=bins, lw=0);

In [None]:
# Proportion of values between -1.96 and 1.96 standard deviations
np.count_nonzero(betweens_around_10) / number_of_values

## What's special about the normal distribution?

Let's get some values from some not-normal distributions.

Here's a not-normal distribution, the "chi square" distribution:

In [None]:
distribution = np.random.chisquare(2, size=number_of_values)
plt.hist(distribution, bins=100);

In [None]:
# The first 10 values
distribution[:10]

In [None]:
# Let's imagine we had sampled one million values from the chi squared distribution, 50 times
distributions = []
for i in range(50):
    distribution = np.random.chisquare(2, size=number_of_values)
    distributions.append(distribution)

In [None]:
# Now let's add them up
added = np.zeros(number_of_values)
for distribution in distributions:
    # Add the distribution
    added = added + distribution

What shape do you think I will get?

In [None]:
plt.hist(added, bins=100);