## Lab 7: Babies

Please complete this lab by providing answers in cells after the question. Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

This assignment is due by **11:59pm on Tuesday, March 29**.

In [None]:
# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In this lab, we will look at a dataset of a sample of newborns in a large hospital system. We will treat it as if it were a simple random sample though the sampling was done in multiple stages. The table births contains the following variables for 1,174 mother-baby pairs: the baby’s birth weight in ounces, the number of gestational days, the mother’s age in completed years, the mother’s height in inches, pregnancy weight in pounds, and whether or not the mother smoked during pregnancy.

The key question we want to answer is whether maternal smoking is associated with lower birthweights of babies. 

In [None]:
births = Table.read_table('baby.csv')
births.show(5)

Let's first take a look at the dataset. First, we select just the variables we want to look at. Then, since `Maternal Smoker` is a categorical variable, we group by that variable and look at summaries of the `Birth Weight` variable. 

In [None]:
smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')
smoking_and_birthweight.group('Maternal Smoker')

In [None]:
smoking_and_birthweight.group('Maternal Smoker', collect = np.mean)

In [None]:
smoking_and_birthweight.group('Maternal Smoker', collect = np.std)

In [None]:
smoking_and_birthweight.hist('Birth Weight', group = 'Maternal Smoker')

The distribution of the weights of the babies born to mothers who smoked appears to be based slightly to the left of the distribution corresponding to non-smoking mothers. The weights of the babies of the mothers who smoked seem lower on average than the weights of the babies of the non-smokers.

This raises the question of whether the difference reflects just chance variation or a difference in the distributions in the larger population. Could it be that there is no difference between the two distributions in the population, but we are seeing a difference in the samples just because of the mothers who happened to be selected?

Remember, we are mainly interested in whether maternal smoking is associated with **lower** birthweights of babies. 

<font color = 'red'>**Question 1: What is the null hypothesis? What is the alternative hypothesis?**</font>

*Replace this text with your answer.*



<font color = 'red'>**Question 2: What is the statistic we want to calculate to perform the hypothesis test? Calculate the observed value of this statistic for our data.**</font>

*Hint:* Remember, we want to compare the means of the two groups. Make sure the statistic you calculate is consistent with the alternative hypothesis that we are testing!

<font color = 'red'>**Question 3: Define the function `statistic` which takes in a Table as an argument and returns the value of a statistic. Check to make sure the function works by using the `smoking_and_birthweight` table and make sure it provides one value of the statistic as the output. Assign the observed value of the statistic that you just calculated using the function to `observed_statistic`.**</font>

In [None]:
def statistic(births_table):
    births_grouped = births_table.group('Maternal Smoker', collect = np.mean)
    return births_grouped.column('Birth Weight mean').item(0) - births_grouped.column('Birth Weight mean').item(1)

observed_statistic = statistic(smoking_and_birthweight)
observed_statistic

If there were no difference between the two distributions in the underlying population, then whether a birth weight has the label True or False with respect to maternal smoking should make no difference to the average. The idea, then, is to shuffle all the labels randomly among the mothers. This is called random permutation.

Shuffling ensures that the count of True labels does not change, and nor does the count of False labels. This is important for the comparability of the simulated and original statistics.

<font color = 'red'>**Question 4: Shuffle the `smoking_and_birthweight` table and assign the shuffled table to `shuffled_smoker`. Take the `Maternal Smoker` column from that shuffled table. Create a new table called `simulated_smoker` that contains the original `Birth Weight` variable as well as the new shuffled `Maternal Smoker` variable.**</font>

In [None]:
shuffled_smoker = ...
simulated_smoker = Table().with_columns("Birth Weight", ...,
                                        "Maternal Smoker", ...)

<font color = 'red'>**Question 5: Let's now see what the distribution of statistics is actually like under the null hypothesis.**</font>

Define the function `simulation_and_statistic` that shuffles the table, calculates the statistic, and returns the statistic. Then, create an array called `simulated_statistics` and use a loop to generate 5000 simulated statistics. 


In [None]:
def simulation_and_statistic():
    '''Simulates shuffling the smoking_and_birthweight table and calculating the statistics.
    Returns one statistic.'''
    ...
    return ...

num_repetitions = 5000

simulated_statistics = ...

for ... in ...:
    ...


We can visualize the resulting simulated statistisc by putting the array into a table and using `hist`. 

In [None]:
Table().with_column('Simulated Statistic', simulated_statistics).hist()
plt.title('Prediction Under the Null Hypothesis')
plt.scatter(observed_statistic, 0, color='red', s=30);

<font color = 'red'>**Question 6: Calculate the p-value.**</font>

*Hint:* Think about how you set up the alternative hypothesis and what you used for your statistic.