# Lesson 12: A/B Testing: Comparing Two Samples

Welcome to Lesson 12!  Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on a question, so ask a post to the discussion board or ask your instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** just copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

To receive credit for this assignment, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## Today's Lesson

In today's lab, you'll learn about:

- A/B testing.

Let's get started!

## Words of Caution

Remember to run the cell below. It's for setting up the environment so you can have access to what's needed for this lesson. For now, don't worry about what it means: we'll learn more about what's inside of it in the next few lessons.

In [None]:
from datascience import *
import numpy as np
np.warnings.filterwarnings('ignore', category = np.VisibleDeprecationWarning)

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight') 

In [None]:
births = Table.read_table('data/baby.csv')
births

**Question 1.** Make a table with the `Maternal Smoker` and `Birth Weight` columns.

In [None]:
smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')

In [None]:
smoking_and_birthweight.show(10)

**Question 2.** Make a histogram with the `Maternal Smoker` and `Birth Weight` columns grouped by the status of `Maternal Smoker`.

In [None]:
smoking_and_birthweight.hist('Birth Weight', group = 'Maternal Smoker')

## Test Statistic


Use the `group` method to compute the average for each group in the `Maternal Smoker` column.

In [None]:
means_table = smoking_and_birthweight.group('Maternal Smoker', np.mean)
means_table

**Question 3.** Use the table that results from using `.group` to determine the value of the observed test statistic.

In [None]:
means = means_table.column('Birth Weight mean')
observed_difference = means.item(1) - means.item(0)
observed_difference

**Question 4.** Write a function that could calculate the statistic but is flexible enough to work on any table when you specify the column label that contains numerical values that you want to average, and the grouping label.

In [None]:
def difference_of_means(table, label, group_label):

    # Create table with only the two relevant columns
    reduced = table.select(label, group_label)  
    
    # Create table containing group means
    means_table = reduced.group(group_label, np.average)
    
    # Pull just the column/array with the group means
    # Use .column(1) since predicted the label is hard
    means = means_table.column(1)
    
    # Return the difference between the two elements
    return means.item(1) - means.item(0)

**Question 5.** Use the function to calculate the observed statistic.

In [None]:
difference_of_means(births, 'Birth Weight', 'Maternal Smoker')

**Question 6.** Since the function provides flexibility, look at a few other variables.

In [None]:
difference_of_means(births, 'Gestational Days', 'Maternal Smoker')

In [None]:
difference_of_means(births, 'Maternal Age', 'Maternal Smoker')

In [None]:
difference_of_means(births, 'Maternal Height', 'Maternal Smoker')

## Random Permutation (Shuffling)

In [None]:
letters = Table().with_column('Letter', make_array('a', 'b', 'c', 'd', 'e'))
letters

In [None]:
letters.sample()

In [None]:
letters.sample(with_replacement = False)

In [None]:
shuffled_letters = letters.sample(with_replacement = False).column(0)
letters.with_column('Shuffled', shuffled_letters)

## Simulation Under Null Hypothesis

In [None]:
smoking_and_birthweight

**Question 7.** Shuffle the labels in the `Maternal Smoker` column.

In [None]:
shuffled_labels = smoking_and_birthweight.sample(with_replacement=False
                                                ).column('Maternal Smoker')
shuffled_labels

In [None]:
original_and_shuffled = smoking_and_birthweight.with_column(
    'Shuffled Label', shuffled_labels
)

In [None]:
original_and_shuffled

In [None]:
original_and_shuffled.group("Shuffled Label")

In [None]:
original_and_shuffled.group("Maternal Smoker")

In [None]:
smoking_and_birthweight.group('Maternal Smoker', np.average)

In [None]:
original_and_shuffled.group('Shuffled Label', np.average)

**Question 8.** Calculate the difference in the `Shuffled label` column.

In [None]:
difference_of_means(original_and_shuffled, 'Birth Weight', 'Shuffled Label')

In [None]:
difference_of_means(original_and_shuffled, 'Birth Weight', 'Maternal Smoker')

# Permutation Test

**Question 9.** Write a function to complete one simulated statistic.

In [None]:
def one_simulated_difference(table, label, group_label):

    # select array of shuffled labels as an array
    shuffled_labels = table.sample(with_replacement = False).column(group_label)
    
    # add in the shuffled labels as a new column to the provided table
    shuffled_table = table.select(label).with_column('Shuffled Label', shuffled_labels)
    
    # return the difference of the means using the shuffled labels
    return difference_of_means(shuffled_table, label, 'Shuffled Label')   

In [None]:
one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')

**Question 10.** Simulate 2500 times and store statistics in an array.

In [None]:
differences = make_array()

for i in np.arange(2500):
    new_difference = one_simulated_difference(births, 'Birth Weight', 'Maternal Smoker')
    differences = np.append(differences, new_difference)

In [None]:
Table().with_column('Difference Between Group Means', differences).hist(bins=np.arange(-10, 4, 0.5))
print('Observed Difference:', observed_difference)
plt.title('Prediction Under the Null Hypothesis');
plt.scatter(observed_difference, 0.01, color='red', s=40);

### What About Gestational Days?

In [None]:
observed_difference = difference_of_means(births, 'Gestational Days', 'Maternal Smoker')

differences = make_array()

for i in np.arange(2500):
    new_difference = one_simulated_difference(births, 'Gestational Days', 'Maternal Smoker')
    differences = np.append(differences, new_difference)

In [None]:
Table().with_column('Difference Between Group Means', differences).hist(bins=np.arange(-4, 4, 0.5))
print('Observed Difference:', observed_difference)
plt.title('Prediction Under the Null Hypothesis');
plt.scatter(observed_difference, 0.01, color='red', s=40);

In [None]:
sum(differences <= observed_difference)/2500