In [None]:
import otter
grader = otter.Notebook()

# Lab 7: Crime and Penalty

Welcome to Lab 7!

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

import scipy.stats as stats

# These lines load the tests.
#import otter
#grader = otter.Notebook()


## 1. Matched Pairs Testing

Matched Pairs testing is a form of hypothesis testing that allows you to make comparisons between two measurements$^*$ for the same individuals.

You'll almost never be explicitly asked to perform an matched pairs test. Make sure you can identify situations where the test is appropriate and know how to correctly implement each step.  

Most often the two measurements are referred to as "pre" and "post" measurements; but there are instances where the two measurements may have been taken at the same time$^{**}$.  

$ ^*$The two measurements must be of the same type, measured in the same units, etc.

$^{**}$Examples where the so-called "pre" and "post" aren't really pre and post measurements include: asking married people the same question, measuring the length of a person's left and right foot, etc.

**Question 1.1:** The following statements are the unordered steps of a hypothesis test:

1. Choose a test statistic (with paired tests its the mean of the difference between the two measurements)

2. Perform the analysis, whether its a traditional hypothesis test or a simulation

3. Find the value of the observed test statistic

4. Find the p-value 

5. Define a null and alternate model

6. Use the p-value and p-value cutoff to draw a conclusion about the null hypothesis

Make an array called `test_order` that contains the correct order of a hypothesis test, where the first item of the array is the first step of a test and the last item of the array is the last step of a test.

<!--
BEGIN QUESTION
name: q1_1
-->

In [None]:
test_order = ...

In [None]:
grader.check("q1_1")

**Question 1.2:** If the null hypothesis of an matched pairs test is correct, should either the pre or the post be systematically greater than the other?

<!--
BEGIN QUESTION
name: q1_2
-->

*Write your answer here, replacing this text.*

## 2: Murder Rates

Punishment for crime has many [philosophical justifications](http://plato.stanford.edu/entries/punishment/#ThePun).  An important one is that fear of punishment may *deter* people from committing crimes.

In the United States, some jurisdictions execute people who are convicted of particularly serious crimes, such as murder.  This punishment is called the *death penalty* or *capital punishment*.  The death penalty is controversial, and deterrence has been one focal point of the debate.  There are other reasons to support or oppose the death penalty, but in this project we'll focus on deterrence.

The key question about deterrence is:

> Through our exploration, does instituting a death penalty for murder actually reduce the number of murders?

You might have a strong intuition in one direction, but the evidence turns out to be surprisingly complex.  Different sides have variously argued that the death penalty has no deterrent effect and that each execution prevents 8 murders, all using statistical arguments!  We'll try to come to our own conclusion.

#### The data

The main data source for this lab comes from a [paper](http://cjlf.org/deathpenalty/DezRubShepDeterFinal.pdf) by three researchers, Dezhbakhsh, Rubin, and Shepherd.  The dataset contains rates of various violent crimes for every year 1960-2003 (44 years) in every US state.  The researchers compiled the data from the FBI's Uniform Crime Reports.

Since crimes are committed by people, not states, we need to account for the number of people in each state when we're looking at state-level data.  Murder rates are calculated as follows:

$$\text{murder rate for state X in year Y} = \frac{\text{number of murders in state X in year Y}}{\text{population in state X in year Y}}*100000$$

(Murder is rare, so we multiply by 100,000 just to avoid dealing with tiny numbers.)

In [None]:
murder_rates = Table.read_table('crime_rates.csv').select('State', 'Year', 'Population', 'Murder Rate')
murder_rates.set_format("Population", NumberFormatter)

Murder rates vary over time, and different states exhibit different trends. The rates in some states change dramatically from year to year, while others are quite stable. Let's plot a couple, just to see the variety.

**Question 2.1.** Draw a line plot with years on the horizontal axis and murder rates on the 
vertical axis. Include two lines: one for Alaska murder rates and one for Minnesota murder rates. Create this plot using a single call, `ak_mn.plot('Year')`.

*Hint*: To create two lines, you will need create the table `ak_mn` with two columns of murder rates, in addition to a column of years. This table will have the following structure:

| Year | Murder rate in Alaska | Murder rate in Minnesota |
|------|-----------------------|--------------------------|
| 1960 | 10.2                  | 1.2                      |
| 1961 | 11.5                  | 1                        |
| 1962 | 4.5                   | 0.9                      |

<center>... (41 rows omitted)</center>

<!--
BEGIN QUESTION
name: q2_1
-->

In [None]:
# The next lines are provided for you.  They create a table
# containing only the Alaska information and one containing
# only the Minnesota information.
ak = murder_rates.where('State', 'Alaska').drop('State', 'Population').relabeled(1, 'Murder rate in Alaska')
mn = murder_rates.where('State', 'Minnesota').drop('State', 'Population').relabeled(1, 'Murder rate in Minnesota')

# Fill in this line to make a table like the one pictured above.
ak_mn = ...
ak_mn

In [None]:
grader.check("q2_1")

**Question 2.2:** Using the table `ak_mn`, draw a line plot that compares the murder rate in Alaska and the murder rate in Minnesota over time.

<!--
BEGIN QUESTION
name: q2_2
-->

In [None]:
# Draw your line plot here
...

Now what about the murder rates of other states?  Say, for example, California and New York? Run the cell below to plot the murder rates of different pairs of states.

In [None]:
# Compare the murder rates of any two states by filling in the blanks below

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

def state(state1, state2):
    state1_table = murder_rates.where('State', state1).drop('State', 'Population').relabeled(1, 'Murder rate in {}'.format(state1))
    state2_table = murder_rates.where('State', state2).drop('State', 'Population').relabeled(1, 'Murder rate in {}'.format(state2))
    s1_s2 = state1_table.join('Year', state2_table)
    s1_s2.plot('Year')
    plt.show()

states_array = murder_rates.group('State').column('State')

_ = interact(state,
             state1=widgets.Dropdown(options=list(states_array),value='California'),
             state2=widgets.Dropdown(options=list(states_array),value='New York')
            )

## 3. The Death Penalty

Some US states have the death penalty, and others don't, and laws have changed over time. In addition to changes in murder rates, we will also consider whether the death penalty was in force in each state and each year.

Using this information, we would like to investigate how the presence of the death penalty affects the murder rate of a state.

**Question 3.1.** We want to know whether the death penalty *causes* a change in the murder rate.  Why is it not sufficient to compare murder rates in places and times when the death penalty was in force with places and times when it wasn't?

<!--
BEGIN QUESTION
name: q3_1
-->

*Write your answer here, replacing this text.*

### A Natural Experiment

In order to attempt to investigate the causal relationship between the death penalty and murder rates, we're going to take advantage of a *natural experiment*.  A natural experiment happens when something other than experimental design applies a treatment to one group and not to another (control) group, and we have some hope that the treatment and control groups don't have any other systematic differences.

Our natural experiment is this: in 1972, a Supreme Court decision called *Furman v. Georgia* banned the death penalty throughout the US.  Suddenly, many states went from having the death penalty to not having the death penalty.

As a first step, let's see how murder rates changed before and after the court decision.  We'll define the test as follows:

> **Population:** All the states that had the death penalty before the 1972 abolition.  (There is no control group for the states that already lacked the death penalty in 1972, so we should omit them.)  This includes all US states **except** Alaska, Hawaii, Maine, Michigan, Wisconsin, and Minnesota.

> **Post group:** The states in that population, in 1973 (the year after 1972).

> **Pre group:** The states in that population, in 1971 (the year before 1972).

> **Null hypothesis:** Murder rates in 1971 and 1973 come from the same distribution; there was no significant change in murder rates from 1971 to 1973.

> **Alternative hypothesis:** Murder rates were higher in 1973 than they were in 1971.

Our alternative hypothesis is related to our suspicion that murder rates increase when the death penalty is eliminated.  

**Question 3.2:** Should we use a matched pairs test to test these hypotheses? If yes, what is our "pre" group and what is our "post" group?

<!--
BEGIN QUESTION
name: q3_2
-->

*Write your answer here, replacing this text.*

The `death_penalty` table below describes whether each state allowed the death penalty in 1971.

In [None]:
non_death_penalty_states = make_array('Alaska', 'Hawaii', 'Maine', 'Michigan', 'Wisconsin', 'Minnesota')

def had_death_penalty_in_1971(state):
    """Returns True if the argument is the name of a state that had the death penalty in 1971."""
    # The implementation of this function uses a bit of syntax
    # we haven't seen before.  Just trust that it behaves as its
    # documentation claims.
    return state not in non_death_penalty_states

states = murder_rates.group('State').select('State')
death_penalty = states.with_column('Death Penalty', states.apply(had_death_penalty_in_1971, 0))
death_penalty

**Question 3.3:** Use the `death_penalty` and `murder_rates` tables to find murder rates in 1971 for states with the death penalty before the abolition. Create a new table `preban_rates` that contains the same information as `murder_rates`, along with a column `Death Penalty` that contains booleans (`True` or `False`) describing if states had the death penalty in 1971.

<!--
BEGIN QUESTION
name: q3_3
-->

In [None]:
# States that had death penalty in 1971
preban_rates = ...

preban_rates = preban_rates.sort("State")
preban_rates

In [None]:
grader.check("q3_3")

**Question 3.4:** Create a table `postban_rates` that contains the same information as `preban_rates`, but for 1973 instead of 1971. `postban_rates` should only contain the states found in `preban_rates`.

<!--
BEGIN QUESTION
name: q3_4
-->

In [None]:
postban_rates =...

postban_rates = postban_rates.sort("State")
postban_rates

In [None]:
grader.check("q3_4")

**Question 3.5:** Use `preban_rates` and `postban_rates` to create a table `change_in_death_rates_table` that contains each state's population in 1973, murder rate, and whether or not that state had the death penalty in 1971. 


<!--
BEGIN QUESTION
name: q3_5
-->

In [None]:
change_in_death_rates = ...

State = preban_rates.column("State")
Death_Penalty = preban_rates.column("Death Penalty")
Pop_in_73 = postban_rates.column("Population")
change_in_death_rates_table = Table().with_columns("State",State, 
                                                   "Death Penalty", Death_Penalty, 
                                                   "Population (1973)", Pop_in_73,
                                                  "Change in Murder Rate", change_in_death_rates)
change_in_death_rates_table

Run the cell below to view the distribution of death rates during the pre-ban and post-ban time periods.

In [None]:
change_in_death_rates_table.hist('Change in Murder Rate', group = 'Death Penalty')

**Question 3.6:** The code below *almost* creates a table `change_in_rate_means` that contains the average murder rates for the states that had the death penalty and the states that didn't have the death penalty. It should have two columns: one indicating if the penalty was in place, and one that contains the **average** murder rate for each group.  Finish the code.

<!--
BEGIN QUESTION
name: q3_6
-->

In [None]:
change_in_rate_means = change_in_death_rates_table.drop("State", "Population (1973)").group("Death Penalty", ...)
change_in_rate_means

In states that had the death penalty in 1971, after the ban went into effect, did the murder rate go up or down?

    a) Up

    b) Down

In [None]:
answer3_6= "..."

answeer3_6

In [None]:
grader.check("q3_6")

**Question 3.7:** We want to figure out if there is a difference between the distribution of death rates in 1971 and 1973. Specifically, we want to test if murder rates were higher in 1973 than they were in 1971. 

The code below runs a Matched Pairs t-test and reports the value of the $t$ and the $p$-value.  Decide if the $p$-value is small enough to conclude that on average the murder rates went up a significant amount after the ban.  In the cell below, write your conclusions, including citing the $p$-value.  
<!--
BEGIN QUESTION
name: q3_7
-->

In [None]:
pre  = preban_rates.column("Murder Rate")
post = postban_rates.column("Murder Rate")

stats.ttest_rel(pre, post)

*Write your answer here, replacing this text.*

**Question 3.8:** To be fair, we did state that we're only interested in the states that had the death penalty in 1971.  

The code below is *almost* correct for dropping the states that did not have the death penalty in 1971 and re-runs the matched pairs t-test.  Finish the code.  

After you fix the code, write your conclusion in the cell below.  

<!--
BEGIN QUESTION
name: q3_8
-->

In [None]:
pre = preban_rates.where("...", ...).column("Murder Rate")
post = postban_rates.where("...", ...).column("Murder Rate")

stats.ttest_rel(pre, post)

*Write your answer here, replacing this text.*

## Can we approach this using a simulation?

Yes.  If the murder rate was just as likely to go up as down following the ban in 1972, then roughly half the changes in murder rate should be positive and the rest should be negative.  In other words, fluxuation either up or down is assumed to be random and up should be just as likely as down.

If that's the case, we can simulate this using a coin toss.  

First, in the real data set, how often did the murder rate go up?

**Question 4.1:** Set `observed_positive_change` to the observed number of states where the murder rate rose following the ban of their death penalty.  

In [None]:
observed_positive_change = ...
observed_positive_change

In [None]:
grader.check("q3_8")

There were 44 states that had the death penalty in 1971.  


**Question 4.2:** Write a function that is equivalent to flipping a coin 44 times and counting the number of heads that occurs.  There are a number of ways to write such a function.  

<!--
BEGIN QUESTION
name: q3_9
-->

In [None]:
def flips():
    ...

In [None]:
# Try your function to make sure it works

flips()

In [None]:
grader.check("q3_9")

In [None]:
grader.check("q3_10")

**Question 4.3:** Simulate 5000 trials of our test and store the test statistics in an array called `differences`

<!--
BEGIN QUESTION
name: q3_11
-->

In [None]:
# This cell might take a couple seconds to run
differences = make_array()

...
                                                 
differences

In [None]:
grader.check("q3_11")

Run the cell below to view a histogram of your simulated test statistics plotted with your observed test statistic

In [None]:
Table().with_column('Difference Between Group Years', differences).hist()
plt.scatter(observed_positive_change, 0, color='red', s=50, zorder=2);

**Question 4.4:** Find the p-value for your test and assign it to `empirical_P`.

<!--
BEGIN QUESTION
name: q3_12
-->

In [None]:
empirical_P = ...
empirical_P

In [None]:
grader.check("q3_12")

**Question 4.5:** Using a 5% P-value cutoff, draw a conclusion about the null and alternative hypotheses. Describe your findings using simple, non-technical language. What does your analysis tell you about murder rates after the death penalty was suspended? What can you claim about causation from your statistical analysis?

<!--
BEGIN QUESTION
name: q3_13
-->

*Write your answer here, replacing this text.*

## Further Discussion 

Did banning the death penalty *cause* the murder rates to rise?  Probably not, and that's not the claim being made here.  If we look more closely at the average murder rate for the nation, we see the murder rate started to increase in 1962 and continued to increase until 1974.  So the murder rate started to increase years before the death penalty was banned and continued to increase for several years after the ban.  The implication is that since the rates were already increasing before the ban, it's unlikely that the 1972 ban on the death penalty caused the increase from 1971 to 1973.  

In [None]:
rates = death_penalty.join("State", murder_rates, "State").drop("Death Penalty", "Population")
rates.drop("State").group("Year", np.mean).where("Year", are.between(1960, 1980)).plot("Year")
plt.title("National Average (1960-'80)")
plt.ylabel("Murder Rate");

Were the murder rates trending upward over this time period in *each* state?  In many cases, yes.  We can verify this looking at state by state line plots over the year.  Looking at one plot with every state represented is possible, but it's difficult to read.  Instead, let's just examine five randomly chosen states at a time.  

Run the cell below several times.  Each time you run it, you'll see the line plot for five different randomly selected states.  

In [None]:
State_list = rates.select("State").group("State").column("State")
years = rates.select("Year").group("Year").column("Year")

five_random_states= np.random.choice(State_list, 5, replace=False)

rate_table = Table().with_column("Year", years)

for state_i in five_random_states:
    state_rate = murder_rates.where('State', state_i).drop('State', 'Population').relabeled(1, '{}'.format(state_i))
    rate_table = rate_table.join("Year", state_rate, "Year")

rate_table.where("Year", are.between(1960, 1980)).plot("Year")
plt.title("Murder Rate in Five Random States")
plt.ylabel("Murder Rate");

**You're done! Congratulations.** 