# Lesson 15: Normal Distributions

Welcome to Lesson 15! Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on a question, so ask a post to the discussion board or ask your instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** just copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

To receive credit for this assignment, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## Today's Lesson

In today's lab, you'll learn about:

- normal distributions. 
- sample means.

Let's get started!

## Words of Caution

Remember to run the cell below. It's for setting up the environment so you can have access to what's needed for this lesson. For now, don't worry about what it means: we'll learn more about what's inside of it in the next few lessons.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Standard Units

Standard units are a way of putting different kinds of observations on the same scale. The idea is to replace a datum by the number of standard deviations it is above the mean of the data. If a datum is above the mean, its value in standard units is positive; if it is below the mean, its value in standard units is negative. A datum that is above the mean by 2.5 times the SD is 2.5 in standard units.

In [None]:
births = Table.read_table('data/baby.csv')
births.show(3)

**Question 1.** Let's write a function to convert an array to standard units.

In [None]:
def standard_units(x):
    """Convert array x to standard units."""
    return (x-np.mean(x))/np.std(x)

**Question 2.** Create an array to hold the maternal ages.

In [None]:
ages = births.column('Maternal Age')
ages

**Question 3.** Convert the maternal ages to standard units.

**Hint:** Use your function from **Example 1.** 

In [None]:
ages_standard_units = standard_units(ages)
print(ages_standard_units)

Verify that the mean is 0 and standard deviation is 1.

In [None]:
np.mean(ages_standard_units), np.std(ages_standard_units)

**Question 4.** Create a table with both original and standard units.

In [None]:
both = Table().with_columns('Age in Years', ages, 'Age in Standard Units', ages_standard_units)
both

Display average and and standard deviation in original units

In [None]:
np.mean(ages), np.std(ages)

Plot a histogram of the distribution of the ages

In [None]:
both.hist('Age in Years', bins=np.arange(15, 46, 2))

Confirm distributions are the same shape in standard units

In [None]:
both.hist('Age in Standard Units', bins=np.arange(-2.2, 3.4, 0.35))
plt.xlim(-2, 3.1);

## The SD and Bell Shaped Curves

**Question 5.** Plot a histogram of the `Maternal Height`.

In [None]:
births.hist('Maternal Height', bins=np.arange(56.5, 72.6, 1))

**Question 6.** Compute the mean and standard deviation of the Maternal heights.

In [None]:
heights = births.column('Maternal Height')
np.mean(heights), np.std(heights)

**Question 7.** Estimate the inflection points.

In [None]:
np.mean(heights)-np.std(heights), np.mean(heights)+np.std(heights)

## Central Limit Theorem

In [None]:
united = Table.read_table('data/united.csv')
united_bins = np.arange(-20, 300, 10)
united

Make a histogram of the delay times.

In [None]:
united.hist('Delay', bins=united_bins)

Determine mean and sd of the Delay column

In [None]:
delays = united.column('Delay')
delay_mean = np.mean(delays)
delay_sd = np.std(delays)
delay_mean, delay_sd

**Question 8.**  Let's write a function to simulate one sample mean for the delay times.

In [None]:
def one_sample_mean(sample_size):
    """Takes a sample from the population of flights and computes its mean.
    """
    sampled_flights = united.sample(sample_size)
    return np.mean(sampled_flights.column('Delay'))

In [None]:
one_sample_mean(1000)

**Question 9.** Write a function that creates an array with 10000 sample means of the `Delays`.

In [None]:
def ten_thousand_sample_means(sample_size):
    """A function to that takes 10000 samples from the population of
       flights, computes the sample mean and saves them to an array.
    """
    means = make_array()
    for i in np.arange(10000):
        mean = one_sample_mean(sample_size)
        means = np.append(means, mean)
    return means

In [None]:
sample_means_100 = ten_thousand_sample_means(100)

In [None]:
sample_means_100

In [None]:
len(sample_means_100)

**Question 10.** Plot the histogram of the 10,000 samples of flight delays.

In [None]:
Table().with_column('Mean of 100 flight delays', sample_means_100).hist(bins=20)
print('Population Average:', delay_mean)

**Question 11.** How does increasing the sample size impact the distribution? Now let's use 400 for our size.

In [None]:
sample_means_400 = ten_thousand_sample_means(400)
Table().with_column('Mean of 400 flight delays', sample_means_400).hist(bins=20)
print('Population Average:', delay_mean)

Let's try 900.

In [None]:
sample_means_900 = ten_thousand_sample_means(900)
Table().with_column('Mean of 900 flight delays', sample_means_900).hist(bins=20)
print('Population Average:', delay_mean)

## Distribution of the Sample Average

In [None]:
united.hist('Delay', bins=united_bins)

**Question 12.** How many possible sample means are there for a sample size of 300?

In [None]:
united.num_rows

In [None]:
13825**300

In [None]:
united.hist('Delay', bins=united_bins)

In [None]:
means_tbl = Table().with_columns(
    '100', ten_thousand_sample_means(100),
    '256', ten_thousand_sample_means(256),
    '400', ten_thousand_sample_means(400),
    '625', ten_thousand_sample_means(625),
    '900', ten_thousand_sample_means(900)
)

In [None]:
means_tbl

In [None]:
means_tbl.hist(bins = np.arange(5, 31, 0.5))
plt.title('Distribution of Sample Average');

## Variability of the Sample Mean

In [None]:
delay_mean = np.mean(united.column('Delay'))
delay_sd = np.std(united.column('Delay'))

In [None]:
delay_mean

In [None]:
delay_sd

In [None]:
def summarize_sample_means():
    """Empirical distribution of random sample means.
    """    
    print("Population mean:", delay_mean)
    print("Population SD:", delay_sd)
    
    for i in np.arange(len(means_tbl)):
        print('')
        print("Sample size: ", means_tbl.labels[i])
        print("Average of sample means: ", np.mean(means_tbl.column(i)))
        print("SD of sample means:", np.std(means_tbl.column(i)))
        print("Ratio of Pop SD to SD of sample means", delay_sd / np.std(means_tbl.column(i)))

In [None]:
summarize_sample_means()