[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)


# Results:

By Friday, April 29 at 11:59PM, you must submit a draft of your results for at least one research question (we recommend trying to have a draft of both done by this time). If (and only if) you address all the criteria in the corresponding Results section above, you will receive full credit on the
checkpoint.
- MULTIPLE HYPOTHESIS TESTING: 
  - Summarize and interpret the results from the hypothesis tests themselves.
  - For the two correction methods you chose, clearly explain what kind of error rate is being controlled by each one
- CAUSAL INFERENCE:
  - Summarize and interpret your results, providing a clear statement about causality (or a lack thereof) including any assumptions necessary.
  - Where possible, discuss the uncertainty in your estimate and/or the evidence against the hypotheses you are investigating.
  
You are free to change your results section or add (or remove) content between the checkpoint and your final submission. Course staff will not provide any feedback on the research question checkpoint.

## Multiple Hypothesis Testing

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import statsmodels.api as sm
from statsmodels.graphics.regressionplots import abline_plot
import datetime
import math

import matplotlib.pyplot as plt

In [2]:
import random 
df = pd.read_csv('https://raw.githubusercontent.com/haroldcha/gunviolencedata/master/gun_violence.csv')

In [3]:
# reminder of cleaned dataset
df

Unnamed: 0.1,Unnamed: 0,incident_id,date,state,city_or_county,address,n_killed,n_injured,congressional_district,gun_stolen,...,participant_status,participant_type,state_house_district,state_senate_district,datetime,year,state_pop,city_pop,area (sq. mi),pop_density
0,0,461105,2013-01-01,Pennsylvania,Mckeesport,1506 Versailles Avenue and Coursin Street,0,4,14.0,,...,0::Arrested||1::Injured||2::Injured||3::Injure...,0::Victim||1::Victim||2::Victim||3::Victim||4:...,,,2013-01-01,2013,12776309,,46058,277.396088
1,1,460726,2013-01-01,California,Hawthorne,13500 block of Cerise Avenue,1,3,43.0,,...,0::Killed||1::Injured||2::Injured||3::Injured,0::Victim||1::Victim||2::Victim||3::Victim||4:...,62.0,35.0,2013-01-01,2013,38260787,85863.0,163707,233.715034
2,2,478855,2013-01-01,Ohio,Lorain,1776 East 28th Street,1,3,9.0,0::Unknown||1::Unknown,...,"0::Injured, Unharmed, Arrested||1::Unharmed, A...",0::Subject-Suspect||1::Subject-Suspect||2::Vic...,56.0,13.0,2013-01-01,2013,11576684,63735.0,44828,258.246721
3,3,478925,2013-01-05,Colorado,Aurora,16000 block of East Ithaca Place,4,0,6.0,,...,0::Killed||1::Killed||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,40.0,28.0,2013-01-05,2013,5269035,345613.0,104100,50.615130
4,4,478959,2013-01-07,North Carolina,Greensboro,307 Mourning Dove Terrace,2,2,6.0,0::Unknown||1::Unknown,...,0::Injured||1::Injured||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,62.0,27.0,2013-01-07,2013,9843336,279244.0,53821,182.890247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239672,239672,1083142,2018-03-31,Louisiana,Rayne,North Riceland Road and Highway 90,0,0,,0::Unknown,...,"0::Unharmed, Arrested",0::Subject-Suspect,,,2018-03-31,2018,4659690,,51843,89.880794
239673,239673,1083139,2018-03-31,Louisiana,Natchitoches,247 Keyser Ave,1,0,4.0,0::Unknown,...,"0::Killed||1::Unharmed, Arrested",0::Victim||1::Subject-Suspect,23.0,31.0,2018-03-31,2018,4659690,,51843,89.880794
239674,239674,1083151,2018-03-31,Louisiana,Gretna,1300 block of Cook Street,0,1,2.0,0::Unknown,...,0::Injured,0::Victim,85.0,7.0,2018-03-31,2018,4659690,,51843,89.880794
239675,239675,1082514,2018-03-31,Texas,Houston,12630 Ashford Point Dr,1,0,9.0,0::Unknown,...,0::Killed,0::Victim,149.0,17.0,2018-03-31,2018,28628666,2318573.0,268601,106.584361


In [161]:
def test_statistic(df, column):
    means_table = df.groupby(column).mean()
    if 1 not in means_table.index:
        diff = 0 - means_table.loc[0].values
    elif 0 not in means_table.index:
        diff = means_table.loc[1].values - 0
    else:
        diff = means_table.loc[1].values - means_table.loc[0].values
    
    return diff

In [138]:
def simulated_test_stat(df, column):
    shuffled_labels = df.sample(frac=1, replace=True)[column]
    df["shuffled_labels"] = shuffled_labels.to_numpy()
    df = df.drop(column,axis=1)
    
    return test_statistic(df, 'shuffled_labels')   

In [162]:
def p_value(df, column, observed_difference, comparison):
    differences = []

    repetitions = 5000
    for i in np.arange(repetitions):
        new_difference = simulated_test_stat(df, column)
        differences = np.append(differences, new_difference) 
    if comparison=="greater":
        #indicates that higher values of the difference favor the alternative hypothesis that the treatment incidents were higher on average
        print(differences, observed_difference)
        empirical_p = np.count_nonzero(differences >= observed_difference) / repetitions
    else:
        #indicates that lower values of the difference favor the alternative hypothesis that the treatment incidents were lower on average
        empirical_p = np.count_nonzero(differences <= observed_difference) / repetitions  
    return empirical_p

### Are more populated cities associated with deadlier instances of gun violence?

Null hypothesis: In the U.S., the distribution of deadly instances of gun violence (according to our definition of deadly above) is the same for cities that are populated and cities that are not. The difference in the sample is due to chance.

Alternative hypothesis: In the U.S., the instances of gun violence are deadlier for cities that are populated vs. cities that are not. 



In [62]:
most_populous = df.loc[df['city_or_county'].isin(['New York','Los Angeles','Chicago','Houston','Phoenix'])]
most_populous['populous?'] = 1
least_populous = df.loc[df['city_or_county'].isin(['Lakewood', 'Troy', 'Saginaw', 'Niagara Falls', 'Chaleston'])]
least_populous['populous?'] = 0
populous = pd.concat([most_populous, least_populous])
populous_df = populous.groupby(['city_or_county', 'populous?']).sum()
populous_df['num_incidents'] = populous.groupby(['city_or_county', 'populous?']).count()['incident_id'].values
populous_df['num_victims'] = populous_df['n_killed'].values + populous_df['n_injured'].values
populous_df = populous_df.loc[:,['num_incidents', 'num_victims']]
populous_df = populous_df.reset_index('populous?')
populous_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  most_populous['populous?'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  least_populous['populous?'] = 0


Unnamed: 0_level_0,populous?,num_incidents,num_victims
city_or_county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Chicago,1,10814,12531
Houston,1,2501,2400
Lakewood,0,113,85
Los Angeles,1,1066,1189
New York,1,377,327
Niagara Falls,0,182,75
Phoenix,1,973,886
Saginaw,0,242,218
Troy,0,167,89


In [163]:
victim_populous = populous_df.loc[:,['populous?', 'num_victims']]
test_victim_populous = test_statistic(victim_populous, 'populous?')
p_value(victim_populous, 'populous?', test_victim_populous, 'greater')

[ 1565.66666667  3295.3         2739.65       ...  1279.16666667
 -3230.95       -3445.33333333] [3349.85]


0.0672

### Are more populated cities associated with more instances of gun violence?

Null hypothesis: In the U.S., the distribution of instances of gun violence is the same for cities that are populated and cities that are not. The difference in the sample is due to chance.

Alternative hypothesis: In the U.S., the number of instances of gun violence (according to our definition of deadly above) is higher for cities that are populated vs. cities that are not. 

In [189]:
incidents_populous = populous_df.loc[:,['populous?', 'num_incidents']]
test_incidents_populous = test_statistic(incidents_populous, 'populous?')
p_value(incidents_populous, 'populous?', test_victim_populous, 'greater')

[ 3291.83333333 -3210.83333333  2478.16666667 ...  2031.66666667
  -834.66666667    54.78571429] [2970.2]


0.0586

### Are more densely populated states associated with deadlier instances of gun violence?

Null hypothesis: In the U.S., the distribution of deadly instances of gun violence (according to our definition of deadly above) is the same for cities that are densely populated and cities that are not. The difference in the sample is due to chance.

Alternative hypothesis: In the U.S., the instances of gun violence are deadlier for cities that are densely populated vs. cities that are not. 

In [165]:
df

Unnamed: 0.1,Unnamed: 0,incident_id,date,state,city_or_county,address,n_killed,n_injured,congressional_district,gun_stolen,...,participant_status,participant_type,state_house_district,state_senate_district,datetime,year,state_pop,city_pop,area (sq. mi),pop_density
0,0,461105,2013-01-01,Pennsylvania,Mckeesport,1506 Versailles Avenue and Coursin Street,0,4,14.0,,...,0::Arrested||1::Injured||2::Injured||3::Injure...,0::Victim||1::Victim||2::Victim||3::Victim||4:...,,,2013-01-01,2013,12776309,,46058,277.396088
1,1,460726,2013-01-01,California,Hawthorne,13500 block of Cerise Avenue,1,3,43.0,,...,0::Killed||1::Injured||2::Injured||3::Injured,0::Victim||1::Victim||2::Victim||3::Victim||4:...,62.0,35.0,2013-01-01,2013,38260787,85863.0,163707,233.715034
2,2,478855,2013-01-01,Ohio,Lorain,1776 East 28th Street,1,3,9.0,0::Unknown||1::Unknown,...,"0::Injured, Unharmed, Arrested||1::Unharmed, A...",0::Subject-Suspect||1::Subject-Suspect||2::Vic...,56.0,13.0,2013-01-01,2013,11576684,63735.0,44828,258.246721
3,3,478925,2013-01-05,Colorado,Aurora,16000 block of East Ithaca Place,4,0,6.0,,...,0::Killed||1::Killed||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,40.0,28.0,2013-01-05,2013,5269035,345613.0,104100,50.615130
4,4,478959,2013-01-07,North Carolina,Greensboro,307 Mourning Dove Terrace,2,2,6.0,0::Unknown||1::Unknown,...,0::Injured||1::Injured||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,62.0,27.0,2013-01-07,2013,9843336,279244.0,53821,182.890247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239672,239672,1083142,2018-03-31,Louisiana,Rayne,North Riceland Road and Highway 90,0,0,,0::Unknown,...,"0::Unharmed, Arrested",0::Subject-Suspect,,,2018-03-31,2018,4659690,,51843,89.880794
239673,239673,1083139,2018-03-31,Louisiana,Natchitoches,247 Keyser Ave,1,0,4.0,0::Unknown,...,"0::Killed||1::Unharmed, Arrested",0::Victim||1::Subject-Suspect,23.0,31.0,2018-03-31,2018,4659690,,51843,89.880794
239674,239674,1083151,2018-03-31,Louisiana,Gretna,1300 block of Cook Street,0,1,2.0,0::Unknown,...,0::Injured,0::Victim,85.0,7.0,2018-03-31,2018,4659690,,51843,89.880794
239675,239675,1082514,2018-03-31,Texas,Houston,12630 Ashford Point Dr,1,0,9.0,0::Unknown,...,0::Killed,0::Victim,149.0,17.0,2018-03-31,2018,28628666,2318573.0,268601,106.584361


In [188]:
most_dense = df.loc[df['city_or_county'].isin(['New York','San Francisco','Boston','Miami','Chicago'])]
most_dense['dense?'] = 1
least_dense = df.loc[df['city_or_county'].isin(['Anchorage','Augusta','Norman','Chesapeake','Columbus'])]
least_dense['dense?'] = 0
dense = pd.concat([most_dense, least_dense])
dense_df = dense.groupby(['city_or_county', 'dense?']).sum()
dense_df['num_incidents'] = dense.groupby(['city_or_county', 'dense?']).count()['incident_id'].values
dense_df['num_victims'] = dense_df['n_killed'].values + dense_df['n_injured'].values
dense_df = dense_df.loc[:,['num_incidents', 'num_victims']]
dense_df = dense_df.reset_index('dense?')
dense_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  most_dense['dense?'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  least_dense['dense?'] = 0


Unnamed: 0_level_0,dense?,num_incidents,num_victims
city_or_county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Anchorage,0,469,298
Augusta,0,362,251
Boston,1,1737,762
Chesapeake,0,141,127
Chicago,1,10814,12531
Columbus,0,2252,1883
Miami,1,846,922
New York,1,377,327
Norman,0,50,36
San Francisco,1,421,413


In [190]:
victims_dense = dense_df.loc[:,['dense?', 'num_victims']]
test_victim_dense = test_statistic(victims_dense, 'dense?')
p_value(victims_dense, 'dense?', test_victim_dense, 'greater')

[-1618.88888889 -3341.25       -2831.66666667 ...  3619.04761905
  1335.23809524  1645.83333333] [2472.]


0.1912

### Are more densely populated states associated with more instances of gun violence?

Null hypothesis: In the U.S., the distribution of instances of gun violence is the same for cities that are densely populated and cities that are not. The difference in the sample is due to chance.

Alternative hypothesis: In the U.S., the number of instances of gun violence (according to our definition of deadly above) is higher for cities that are densely populated vs. cities that are not. 

In [191]:
incidents_dense = dense_df.loc[:,['dense?', 'num_victims']]
test_incidents_dense = test_statistic(incidents_dense, 'dense?')
p_value(incidents_dense, 'dense?', test_incidents_dense, 'greater')

[ 1928.0952381  -2854.58333333 -2681.66666667 ...  1549.52380952
  2802.91666667 -5717.5       ] [2472.]


0.1826

### Are crimes getting less violent over time (number of casualties)?

Null hypothesis: In the U.S., the distribution of instances of deadly gun violence is the same before 2014 as it was after. The difference in the sample is due to chance.

Alternative hypothesis: In the U.S., the number of instances of deadly gun is higher before 2014 than it was after. 

In [192]:
df

Unnamed: 0.1,Unnamed: 0,incident_id,date,state,city_or_county,address,n_killed,n_injured,congressional_district,gun_stolen,...,participant_status,participant_type,state_house_district,state_senate_district,datetime,year,state_pop,city_pop,area (sq. mi),pop_density
0,0,461105,2013-01-01,Pennsylvania,Mckeesport,1506 Versailles Avenue and Coursin Street,0,4,14.0,,...,0::Arrested||1::Injured||2::Injured||3::Injure...,0::Victim||1::Victim||2::Victim||3::Victim||4:...,,,2013-01-01,2013,12776309,,46058,277.396088
1,1,460726,2013-01-01,California,Hawthorne,13500 block of Cerise Avenue,1,3,43.0,,...,0::Killed||1::Injured||2::Injured||3::Injured,0::Victim||1::Victim||2::Victim||3::Victim||4:...,62.0,35.0,2013-01-01,2013,38260787,85863.0,163707,233.715034
2,2,478855,2013-01-01,Ohio,Lorain,1776 East 28th Street,1,3,9.0,0::Unknown||1::Unknown,...,"0::Injured, Unharmed, Arrested||1::Unharmed, A...",0::Subject-Suspect||1::Subject-Suspect||2::Vic...,56.0,13.0,2013-01-01,2013,11576684,63735.0,44828,258.246721
3,3,478925,2013-01-05,Colorado,Aurora,16000 block of East Ithaca Place,4,0,6.0,,...,0::Killed||1::Killed||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,40.0,28.0,2013-01-05,2013,5269035,345613.0,104100,50.615130
4,4,478959,2013-01-07,North Carolina,Greensboro,307 Mourning Dove Terrace,2,2,6.0,0::Unknown||1::Unknown,...,0::Injured||1::Injured||2::Killed||3::Killed,0::Victim||1::Victim||2::Victim||3::Subject-Su...,62.0,27.0,2013-01-07,2013,9843336,279244.0,53821,182.890247
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239672,239672,1083142,2018-03-31,Louisiana,Rayne,North Riceland Road and Highway 90,0,0,,0::Unknown,...,"0::Unharmed, Arrested",0::Subject-Suspect,,,2018-03-31,2018,4659690,,51843,89.880794
239673,239673,1083139,2018-03-31,Louisiana,Natchitoches,247 Keyser Ave,1,0,4.0,0::Unknown,...,"0::Killed||1::Unharmed, Arrested",0::Victim||1::Subject-Suspect,23.0,31.0,2018-03-31,2018,4659690,,51843,89.880794
239674,239674,1083151,2018-03-31,Louisiana,Gretna,1300 block of Cook Street,0,1,2.0,0::Unknown,...,0::Injured,0::Victim,85.0,7.0,2018-03-31,2018,4659690,,51843,89.880794
239675,239675,1082514,2018-03-31,Texas,Houston,12630 Ashford Point Dr,1,0,9.0,0::Unknown,...,0::Killed,0::Victim,149.0,17.0,2018-03-31,2018,28628666,2318573.0,268601,106.584361


In [198]:
after = df.loc[df['year'] >= 2014]
after['after?'] = 1
before = df.loc[df['year'] < 2014]
before['after?'] = 0
time = pd.concat([after, before])
time_df = time.groupby(['year', 'after?']).sum()
time_df['num_victims'] = time_df['n_killed'].values + time_df['n_injured'].values
time_df = time_df.loc[:,['num_victims']]
time_df = time_df.reset_index('after?')
time_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after['after?'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  before['after?'] = 0


Unnamed: 0_level_0,after?,num_victims
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,0,1296
2014,1,35559
2015,1,40451
2016,1,45646
2017,1,46214
2018,1,9704


In [199]:
test_time = test_statistic(time_df, 'after?')
p_value(time_df, 'after?', test_time, 'greater')

[29811.66666667 17076.25       24129.2        ... 24129.2
 29811.66666667 24129.2       ] [34218.8]


0.0774