# TCM 21

Run the cell below to import packages and set plotting options.

In [1]:
from datascience import *

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import otter
grader = otter.Notebook()

plt.style.use('fivethirtyeight')
%matplotlib inline

## 1. Confidence Intervals

### 1.1. Employment Data from the City of Durham

The following code will read in a table that contains a random sample of 200 workers from the City of Durham. 

**Example 1.** Run the cell below to load the file `durham_city_employees.csv` as a table.

In [2]:
sample_employees = Table().read_table('durham_city_employees.csv')
sample_employees

**Example 2.** Run the cell below will generate a histogram that shows the distribution of salaries from the sample.

In [3]:
sample_employees.hist('SALARY', unit = "Dollars")

The `percentile()` function returns returns the *p*th percentile of the input array (the value that is at least as great as *p*$\%$ of the values in the array)

**Example 3.** Use the `percentile()` function to find the salary that is greater than or equal to 50% of the employees in the sample.

In [4]:
percentile(50, sample_employees.column('SALARY'))

The `median` function from the `numpy` package will reutrn the median from an array of numbers.

**Example 4.** Use the `np.median()` function to find the median salary of the employees in the sample.

In [5]:
np.median(sample_employees.column('SALARY'))

<!-- BEGIN QUESTION -->

**Question 1.** Why do you think the value in **Example 3.** is different from the value in **Example 4.**?

<!--
BEGIN QUESTION
name: q1
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### 1.2. The Goal

We'd like to know what the median salary is for **all** workers from the City of Durham, not just from our sample of 200 workers. This particular sample may vary depending on which 200 workers are selected. We will resample our sample from the population to determine a measure of how variable the median might be, and then build an interval that we feel the true median of the population should be between.

### 1.2.1. Resampling

**Example 5.** By using the `np.median()` function we can sample our sample from the population. Run the cell below a few times, then answer **Question 2.**.

In [6]:
np.median(sample_employees.sample(with_replacement = True).column('SALARY'))

**Question 2.** Create an array named `medians` that contains the median of 1000 samples (`with_replacement`) of the `sample_employees` table.

<!--
BEGIN QUESTION
name: q2
manual: false
-->

In [7]:
medians = make_array()
repetitions = ...

for _ in np.arange(...):
    medians = np.append(medians, np.median(sample_employees.sample(with_replacement = True).column('SALARY')))

In [None]:
grader.check("q2")

**Example 6.** Run the cell below to plot the empirical distribution of the medians of the 1000 samples of our sample.

In [9]:
Table().with_column('Medians', medians).hist()

**Question 3.** What is the interval that contains 95% of the medians?

<!--
BEGIN QUESTION
name: q3
manual: false
-->

In [10]:
lower_bound = percentile(..., medians)
upper_bound = percentile(..., medians)
print('The 95% confidence interval starts at',lower_bound ,'and goes to', upper_bound)

In [None]:
grader.check("q3")

**Example 7.** Let's visualize our emperical distribution using a histogram with the 95% confidence interval overlayed.

In [13]:
Table().with_column('Resampled Medians', medians).hist()
plt.plot([lower_bound, upper_bound], [0, 0], color = 'gold', lw = 15);

## 2. A/B Testing

The following code will read in a table that contains a the birth weights of babies delivered from smoking and non-smoking mothers. 

**Example 8.** Run the cell below to load the file `birth_weights.csv` as a table.

In [14]:
birth_weights = Table.read_table('birth_weights.csv')
birth_weights

<!-- BEGIN QUESTION -->

### 2.1. The Goal

We want to determine if there is an association between smoking and the birth weight of a baby?

**Question 4.** Do you think the birth weight of the baby was effected by whether or not thte mother smoked?

<!--
BEGIN QUESTION
name: q4
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 5.** Make a table named `smoking_and_birthweight` that contains `Maternal Smoker` and `Birth Weight` columns from the `birth_weights` table.

<!--
BEGIN QUESTION
name: q5
manual: true
-->

In [15]:
smoking_and_birthweight = birth_weights.select('...', '...')
smoking_and_birthweight

<!-- END QUESTION -->

**Question 6.** How many smokers and non-smokers are in the study?

<!--
BEGIN QUESTION
name: q6
manual: false
-->

In [16]:
maternal_smoker = smoking_and_birthweight.where('...', True).num_rows
maternal_nonsmoker = smoking_and_birthweight.num_rows - maternal_smoker
print('There are',maternal_smoker,'smokers and',maternal_nonsmoker,'non-smokers in the study.')

In [None]:
grader.check("q6")

**Example 9.** Make a histogram of the birth wieghts of the babies of the smokers.

In [19]:
smoking_and_birthweight.where('Maternal Smoker', True).hist('Birth Weight')

**Example 10.** Make a histogram of the birth wieghts of the babies of the non-smokers.

In [20]:
smoking_and_birthweight.where('Maternal Smoker', False).hist('Birth Weight')

**Example 11.** Let's overlay the histograms from **Example 9.** and **Example 10.**.

In [21]:
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.filterwarnings("ignore", message = "Creating an ndarray from ragged")

smoking_and_birthweight.hist('Birth Weight', group = 'Maternal Smoker')

<!-- BEGIN QUESTION -->

**Question 7.** What is the average birth weight for the smokers and the non-smokers?

<!--
BEGIN QUESTION
name: q7
manual: true
-->

In [22]:
avg_birthweight = smoking_and_birthweight.group('...', np.average)
avg_birthweight

In [None]:
grader.check("q7")

<!-- END QUESTION -->

**Question 8.** What is the difference between the average birth weight between the smokers and non-smokers?

<!--
BEGIN QUESTION
name: q8
manual: false
-->

In [25]:
diff_btween_avg_birthweight = avg_birthweight.column('...').item(0)- avg_birthweight.column('...').item(1)
diff_btween_avg_birthweight

In [None]:
grader.check("q8")

<!-- BEGIN QUESTION -->

### 2.2. A Difference in the Mean 

What is the cause of this difference? Would the difference in our sample be the same for the population? Could the difference be due to chance alone? What would be a good test statistic? 

Let's investigate.

**Question 9.** Write the Null and the Alternative Hypotheses.

<!--
BEGIN QUESTION
name: q9
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



Let's do the following:

    
1. Make an array of shuffled weights.
    
2. Make a table with the shuffled weights assigned to the group lables.
    
3. Make an array of means of the two groups (smoker and non-smoker).
    
4. Calculate the difference between the means of the two groups.

<!-- BEGIN QUESTION -->

**Question 10.** Create a table named `weights` that only has one column with the wieghts for each baby in the sample.

<!--
BEGIN QUESTION
name: q10
manual: true
-->

In [27]:
weights = birth_weights.select('...')
weights

<!-- END QUESTION -->



If we use the `.sample` method and do not specify a sample amount the values in the column will be shuffled.

**Example 11.** Run the cell below to shuffle the weights.

In [28]:
shuffled_weights = weights.sample(with_replacement = False)
shuffled_weights

<!-- BEGIN QUESTION -->

**Question 11.** Create a table named `shuffled_birthweight_table` that has the column names `Maternal Smoker`, `Shuffled Weight`, and `Original Weight`.

<!--
BEGIN QUESTION
name: q11
manual: true
-->

In [29]:
shuffled_weights = weights.sample(with_replacement = False)
shuffled_birthweight_table = Table().with_columns('...', birth_weights.column('Maternal Smoker'),
                                                  '...', shuffled_weights.column('Birth Weight'),
                                                  '...', weights.column('Birth Weight')
                                                 )
shuffled_birthweight_table

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 12.** Make an array named `birth_weight_means` that contains the averages of the two groups (smoker and non-smoker).

<!--
BEGIN QUESTION
name: q12
manual: true
-->

In [30]:
birth_weight_means = shuffled_birthweight_table.group('...', np.average)
birth_weight_means

<!-- END QUESTION -->



Now let's find the difference in the average brith weight between smokers and non-smokers in the table with the shuffled weights.

**Example 12.** Run the cell below.

In [31]:
sample_diff = birth_weight_means.column(1).item(0)-birth_weight_means.column(1).item(1)
sample_diff

**Example 13.** Run a simulation to make a table of 5000 sample differeces, make a histogram, and then plot the observed value (i.e. the value of `diff_btween_avg_birthweight`).

In [32]:
shuffled_weights_table = Table().with_column('Maternal Smoker', birth_weights.column('Maternal Smoker')) 

differences = make_array()
repetitions = 1000

for _ in np.arange(repetitions):
    shuffled_weights = weights.sample(with_replacement = False).column('Birth Weight')
    shuffled_weights_table = shuffled_weights_table.with_columns('Shuffled Weight', shuffled_weights) 
    birth_weight_means = shuffled_weights_table.group('Maternal Smoker', np.average).column(1)
    new_diff = birth_weight_means.item(0)-birth_weight_means.item(1)
    differences = np.append(differences, new_diff)

Table().with_column('Mean Difference', differences).hist()
plt.scatter(diff_btween_avg_birthweight, 0.01, color = 'red', s = 50);

<!-- BEGIN QUESTION -->

**Question 14.** Use the plot from **Example 13.** to determine whether the observed statistic in the sample supports the null or the alternative hypothesis. 

<!--
BEGIN QUESTION
name: q14
manual: true
-->

_Type your answer here, replacing this text._

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Canvas to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export("tcm.ipynb")