In [None]:
import numpy as np
from datascience import *

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
# Fix for datascience plots
import collections as collections
import collections.abc as abc
collections.Iterable = abc.Iterable

# Python Dictionaries

It is time to learn another data structure built into Python -- dictionaries. With lists and arrays, you accessed elements index number.

In [None]:
my_list = ['Temple University', 2023, np.pi]
my_list

In [None]:
# Accessing the second element in my_list
my_list[1]

In [None]:
my_array = make_array(1, 2, 3, 5, 7, 11, 13)
my_array

In [None]:
# Accessing the second element in my_array
my_array[1]

By contrast, a dictionary stores any number of {key: value, key: value} pairs. You access a value in a dictionary by using its key. Dictionaries are defined using curly brackets. Notice that the values of a dictionary can be other data structures, such as a list.

In [None]:
cst = {
    "name": "College of Science and Technology",
    "departments": ["Biology", "Chemistry", "CIS", "Math", "EES", "Physics"],
    "dean": "Michael Klein",
}
cst

In [None]:
# Access a dictionary element by key
print(cst['dean'])
print(cst['departments'])

In [None]:
cst.keys()

In [None]:
cst.values()

Each key in a dictionary my be unique -- no duplicates. You can change a dictionary value simply by assigned a new value to a key.

In [None]:
cst['dean'] = 'Miguel Mustafa'
cst

## Hypothesis A/B Testing

The following example comes from your textbook. See the section on [A/B testing.](https://inferentialthinking.com/chapters/12/1/AB_Testing.html)

"The table births contains the following variables for 1,174 mother-baby pairs: the baby’s birth weight in ounces, the number of gestational days, the mother’s age in completed years, the mother’s height in inches, pregnancy weight in pounds, and whether or not the mother smoked during pregnancy."

In [None]:
births = Table.read_table('data/baby.csv')
births

"One of the aims of the study was to see whether maternal smoking was associated with birth weight. Let’s see what we can say about the two variables."

In [None]:
smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')
smoking_and_birthweight.group('Maternal Smoker')

In [None]:
smoking_and_birthweight.hist('Birth Weight', group = 'Maternal Smoker')

"This raises the question of whether the difference reflects just chance variation or a difference in the distributions in the larger population. Could it be that there is no difference between the two distributions in the population, but we are seeing a difference in the samples just because of the mothers who happened to be selected?"

**Null hypothesis:** In the population, the distribution of birth weights of babies is the same for mothers who don’t smoke as for mothers who do. The difference in the sample is due to chance.

**Alternative hypothesis:** In the population, the babies of the mothers who smoke have a lower birth weight, on average, than the babies of the non-smokers.



In [None]:
smoking_and_birthweight.show(3)

In [None]:
means_table = smoking_and_birthweight.group('Maternal Smoker', np.average)
means_table

In [None]:
means = means_table.column(1)
observed_difference = means.item(1) - means.item(0)
observed_difference

In [None]:
def difference_of_means(table, group_label):
    """Takes: name of table,
    column label that indicates the group to which the row belongs
    Returns: Difference of mean birth weights of the two groups"""
    reduced = table.select('Birth Weight', group_label)
    means_table = reduced.group(group_label, np.average)
    means = means_table.column(1)
    return means.item(1) - means.item(0)

In [None]:
# Test the function
difference_of_means(births, 'Maternal Smoker')

### Predicting the Statistic Under the Null Hypothesis
"To see how the statistic should vary under the null hypothesis, we have to figure out how to simulate the statistic under that hypothesis. A clever method based on random permutations does just that.

If there were no difference between the two distributions in the underlying population, then whether a birth weight has the label True or False with respect to maternal smoking should make no difference to the average. The idea, then, is to shuffle all the labels randomly among the mothers. This is called random permutation.

Shuffling ensures that the count of True labels does not change, and nor does the count of False labels. This is important for the comparability of the simulated differences of means and the original difference of means. We will see later in the course that the sample size affects the variability of a sample mean.

Take the difference of the two new group means: the mean weight of the babies whose mothers have been randomly labeled smokers and the mean weight of the babies of the remaining mothers who have all been randomly labeled non-smokers. This is a simulated value of the test statistic under the null hypothesis."

In [None]:
# How does shuffling work?
quote = "Light travels faster than sound. This is why some people appear bright until you hear them speak."
quote_table = Table().with_column("Quote Words", quote.split())
quote_table

In [None]:
quote_table.column("Quote Words")

In [None]:
shuffled_words = quote_table.sample(with_replacement = False).column("Quote Words")
shuffled_words

In [None]:
shuffled_labels = smoking_and_birthweight.sample(with_replacement = False).column(0)
original_and_shuffled = smoking_and_birthweight.with_column('Shuffled Label', shuffled_labels)

original_and_shuffled

"Each baby’s mother now has a random smoker/non-smoker label in the column Shuffled Label, while her original label is in Maternal Smoker. If the null hypothesis is true, all the random re-arrangements of the labels should be equally likely.

Let’s see how different the average weights are in the two randomly labeled groups."

In [None]:
shuffled_only = original_and_shuffled.select('Birth Weight','Shuffled Label')
shuffled_group_means = shuffled_only.group('Shuffled Label', np.average)
shuffled_group_means

In [None]:
difference_of_means(original_and_shuffled, 'Shuffled Label')

"But could a different shuffle have resulted in a larger difference between the group averages? To get a sense of the variability, we must simulate the difference many times.

As always, we will start by defining a function that simulates one value of the test statistic under the null hypothesis. This is just a matter of collecting the code that we wrote above.

The function is called one_simulated_difference_of_means. It takes no arguments, and returns the difference between the mean birth weights of two groups formed by randomly shuffling all the labels."

In [None]:
def one_simulated_difference_of_means():
    """Returns: Difference between mean birthweights
    of babies of smokers and non-smokers after shuffling labels"""
    
    # array of shuffled labels
    shuffled_labels = births.sample(with_replacement=False).column('Maternal Smoker')
    
    # table of birth weights and shuffled labels
    shuffled_table = births.select('Birth Weight').with_column(
        'Shuffled Label', shuffled_labels)
    
    return difference_of_means(shuffled_table, 'Shuffled Label')   

In [None]:
one_simulated_difference_of_means()

"Tests based on random permutations of the data are called permutation tests. We are performing one in this example. In the cell below, we will simulate our test statistic – the difference between the average birth weight of the two randomly formed groups – many times and collect the differences in an array."

In [None]:
differences = make_array()

repetitions = 5000
for i in np.arange(repetitions):
    new_difference = one_simulated_difference_of_means()
    differences = np.append(differences, new_difference)    

In [None]:
Table().with_column('Difference Between Group Means', differences).hist(bins=20)
print('Observed Difference:', observed_difference)
ax = plt.gca()
ax.set_xlim((-10, 10))
ax.plot(observed_difference, 0,  marker='^', markersize=40, mec='red')
ax.set_title('Prediction Under the Null Hypothesis');

Notice how the distribution is centered roughly around 0. This makes sense, because under the null hypothesis the two groups should have roughly the same average. Therefore the difference between the group averages should be around 0.

The observed difference in the original sample is about 
 ounces, which doesn’t even appear on the horizontal scale of the histogram. The observed value of the statistic and the predicted behavior of the statistic under the null hypothesis are inconsistent.

The conclusion of the test is that the data favor the alternative over the null. It supports the hypothesis that the average birth weight of babies born to mothers who smoke is less than the average birth weight of babies born to non-smokers.

If you want to compute an empirical p-value, remember that low values of the statistic favor the alternative hypothesis.

In [None]:
empirical_p = np.count_nonzero(differences <= observed_difference) / repetitions
empirical_p

**NOTE: Even though the difference in weight is only 9 ounces, we had a lot of data:**

In [None]:
smoking_and_birthweight.num_rows

## What if we didn't have as much data?
Let's repeat the whole process, but with data for only 100 mothers.

In [None]:
small_data = smoking_and_birthweight.take(np.arange(100))
small_data

In [None]:
means_table = small_data.group('Maternal Smoker', np.average)
means_table

In [None]:
observed_difference = difference_of_means(small_data, 'Maternal Smoker')
observed_difference

In [None]:
def one_simulated_difference_of_means():
    """Returns: Difference between mean birthweights
    of babies of smokers and non-smokers after shuffling labels"""
    
    # array of shuffled labels
    shuffled_labels = small_data.sample(with_replacement=False).column('Maternal Smoker')
    
    # table of birth weights and shuffled labels
    shuffled_table = small_data.select('Birth Weight').with_column(
        'Shuffled Label', shuffled_labels)
    
    return difference_of_means(shuffled_table, 'Shuffled Label')  

In [None]:
differences = make_array()

repetitions = 5000
for i in np.arange(repetitions):
    new_difference = one_simulated_difference_of_means()
    differences = np.append(differences, new_difference)  

In [None]:
Table().with_column('Difference Between Group Means', differences).hist(bins=20)
print('Observed Difference:', observed_difference)
ax = plt.gca()
ax.set_xlim((-10, 10))
ax.plot(observed_difference, 0,  marker='^', markersize=40, mec='red')
ax.set_title('Prediction Under the Null Hypothesis');

In [None]:
empirical_p = np.count_nonzero(differences <= observed_difference) / repetitions
empirical_p

Notice how much broader is the distribution of our simulated weight difference!

The result is still significane with 95% confidence, though not with 99% confidence.