# Lesson 11: Comparing Distributions

Welcome to Lesson 11!  Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on a question, so ask a post to the discussion board or ask your instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** just copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

To receive credit for this assignment, answer all questions correctly and submit before the deadline.

**Due Date:** 

**Collaboration Policy:** Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others **please include their names below** (it's a good way to learn your classmates' names).

**Collaborators:** 

List collaborators here.

## Today's Lesson

In today's lab, you'll learn about:

- comparing distributions.

Let's get started!

## Words of Caution

Remember to run the cell below. It's for setting up the environment so you can have access to what's needed for this lesson. For now, don't worry about what it means: we'll learn more about what's inside of it in the next few lessons.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
for i in np.arange(5):
    print(i)

## Mendel and Pea Flowers

Mendel had 929 plants, of which 709 had purple flowers ([LibreTexts](https://bio.libretexts.org/Bookshelves/Introductory_and_General_Biology/Book%3A_Introductory_Biology_(CK-12)/03%3A_Genetics/3.01%3A_Mendel's_Pea_Plants)).

In [None]:
observed_purples = 709/929
observed_purples

Array of proportions of purple flowering plants.

In [None]:
predicted_proportions = make_array(.75,.25)

Randomly sample using sample_proportions function.

In [None]:
sample_proportions(929, predicted_proportions).item(0)*100

A function to return the number of purple flowers sampled under the assumption of our model.

In [None]:
def purple_flowers():
    return sample_proportions(929, predicted_proportions).item(0)*100

In [None]:
purple_flowers()

In [None]:
# Array to store our simulated number of purple flowers
purples = make_array()

# Number of trials
trials = 1000

# Loop to run our simulation and collect the number of purple
# flowering plants sampled under the assumption of our model
for i in np.arange(trials):
    new_number_of_purple_flowers = purple_flowers()
    purples = np.append(purples, new_number_of_purple_flowers)

In [None]:
purples

In [None]:
Table.interactive_plots()
Table().with_column('Percent of purple flowers in sample of 929', purples).hist()

Array to store our statistics.

In [None]:
statistics = abs(purples-75)

In [None]:
Table.interactive_plots()
Table().with_column('Discrepancy in sample of 929 if the model is true', statistics).hist()

In [None]:
abs(observed_purples*100-75)

## Alameda County Jury Panels

In 2010, the American Civil Liberties Union (ACLU) of Northern California presented a [report](https://www.aclunc.org/sites/default/files/racial_and_ethnic_disparities_in_alameda_county_jury_pools.pdf) on jury selection in Alameda County, California. The report concluded that certain racial and ethnic groups are underrepresented among jury panelists in Alameda County, and suggested some reforms of the process by which eligible jurors are assigned to panels. In this section, we will analyze the data provided by the ACLU. ([Computational and Inferential Thinking: The Foundations of Data Science](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html?highlight=alameda))

In [None]:
jury = Table().with_columns(
    'Ethnicity', make_array('Asian', 'Black', 'Latino', 'White', 'Other'),
    'Eligible', make_array(0.15, 0.18, 0.12, 0.54, 0.01),
    'Panels', make_array(0.26, 0.08, 0.08, 0.54, 0.04)
)

jury

**Question 1.** Make a ar chart to visualize the distribution. 

In [None]:
Table.static_plots()
jury.barh('Ethnicity')

**Question 2.** Under the model, make an array that is the true distribution of people from which the jurors are randomly sampled.

In [None]:
model = make_array(0.15, 0.18, 0.12, 0.54, 0.01)
model

**Question 3.** Simulate a random draw of 1423 jurors from this distribution.

In [None]:
simulated = sample_proportions(1423, model)
simulated

## Distance Between Distributions

In the Mendel Pea Flowers experiment, the difference between observed black/purple and their expected values (26%/75%) was our statistic. In this case, we need to understand how each of the 5 categories differ from their expected values according to the model.

In [None]:
diffs = jury.column('Panels') - jury.column('Eligible')
jury_with_difference = jury.with_column('Difference', diffs)
jury_with_difference

## Total Variation Distance

In [None]:
def tvd(dist1, dist2):
    return sum(abs(dist1-dist2))/2

The TVD of our observed data (Panels) from their expected value assuming the model is true (Eligible). ([Computational and Inferential Thinking: The Foundations of Data Science](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html?highlight=total%20variation%20distance#a-new-statistic-the-distance-between-two-distributions))

In [None]:
obsvd_tvd = tvd(jury.column('Panels'), jury.column('Eligible'))
obsvd_tvd

The TVD of a model simulation from its expected values.

In [None]:
tvd(sample_proportions(1423, model), jury.column('Eligible'))

**Question 4.** Write a function to find the simulated TVD.

In [None]:
def simulated_tvd():
    return tvd(sample_proportions(1423, model), model)

**Question 5.** Run a simulation of 10000 trials. Plot the results and the observed statistic in a histogram.

In [None]:
tvds = make_array()
num_simulations = 10000

for i in np.arange(num_simulations):
    new_tvd = simulated_tvd()
    tvds = np.append(tvds, new_tvd)

In [None]:
title = 'Simulated TVDs (if model is true)'
bins = np.arange(0, .2, .005)

Table.static_plots()
Table().with_column(title, tvds).hist(bins=bins)
plt.plot(obsvd_tvd, -0.01, 'ro', markersize=10);
print('Observed TVD: ' + str(obsvd_tvd))