# Hypothesis 2

## The proportion of incarcerated people to the total population differs in correlation with whether a state has R or D senators 

${H_0}: \textrm{There is no difference in the total incarcerated population between states with R and D senators}$
${H_\alpha}: \textrm{There is a difference between the proportion of incarcerated persons to total population between R and D states
}$

$\alpha = 0.05$

### Method



In [466]:
import requests
import pandas as pd
import pickle
import re
import numpy as np
from scipy import stats

In [2]:
# Get Senator information
df = pd.DataFrame()

headers = {'X-API-Key': 'l8xbuedseZHG4JrxvS9u40eyNMOWRTVDL4IfnIb6'}
cong_nums = ['112', '113', '114']
url = 'https://api.propublica.org/congress/v1/112/senate/members.json'
response = requests.get(url, headers=headers)

    
print(response)
print(response.text)

<Response [200]>
{
   "status":"OK",
   "copyright":" Copyright (c) 2019 Pro Publica Inc. All Rights Reserved.",
   "results":[
      {
         "congress": "112",
         "chamber": "Senate",
         
         
         "num_results": 102,
         "offset": 0,
         "members": [
              {
                 "id": "A000069",
                 "title": "Senator, 1st Class",
                 "short_title": "Sen.",
                 "api_uri":"https://api.propublica.org/congress/v1/members/A000069.json",
                 "first_name": "Daniel",
                 "middle_name": "K.",
                 "last_name": "Akaka",
                 "suffix": null,
                 "date_of_birth": "1924-09-11",
                 "gender": "M",
                 "party": "D",
                 "leadership_role": null,
                 "twitter_account": "SenatorAkaka",
                 "facebook_account": null,
                 "youtube_account": "senatorakaka",
                 "govtrack_id": "3

In [57]:
len(response.json()['results'][0]['members'])

102

In [59]:
response.json()['results'][0]['members'][0]['first_name']

'Daniel'

In [60]:
response.json()['results'][0]['members'][0]['last_name']

'Akaka'

In [61]:
response.json()['results'][0]['members'][0]['party']

'D'

In [62]:
response.json()['results'][0]['members'][0]['state']

'HI'

In [212]:
congress_nums = ['112', '113', '114']

def get_senators(cong_nums):
    cong_nums = cong_nums
    df = pd.DataFrame()
    headers = {'X-API-Key': 'l8xbuedseZHG4JrxvS9u40eyNMOWRTVDL4IfnIb6'}
    base_url = 'https://api.propublica.org/congress/v1/{}/senate/members.json'
    
    for cong in cong_nums:
        response = requests.get(base_url.format(cong), headers=headers)
        for i in range(len(response.json()['results'][0]['members'])): 
            tdf = pd.DataFrame([cong, response.json()['results'][0]['members'][i]['first_name'], 
                           response.json()['results'][0]['members'][i]['last_name'],
                           response.json()['results'][0]['members'][i]['party'],
                           response.json()['results'][0]['members'][i]['state']]).T
            df = df.append(tdf)
            
    df.columns = ['congress', 'first_name', 'last_name', 'party', 'state']
    df.reset_index(inplace=True, drop=True)
    return df

In [424]:
senators = get_senators(['114'])

# ... or get it out of pickle, if you already ran this, by uncommenting
# dbfile = open('senatorsDB', 'rb')
# senators = pickle.load(dbfile)
# dbfile.close()


# pickle this info
dbfile = open('senatorsDB', 'ab')
pickle.dump(senators, dbfile)
dbfile.close()

senators.head()

Unnamed: 0,congress,first_name,last_name,party,state
0,114,Lamar,Alexander,R,TN
1,114,Kelly,Ayotte,R,NH
2,114,Tammy,Baldwin,D,WI
3,114,John,Barrasso,R,WY
4,114,Michael,Bennet,D,CO


In [425]:
# Convert parties to numeric
# R = 0, D = 1, I = 2

senators['party'] = senators['party'].apply(lambda x : 0 if x == 'R' else 1 if x == 'D' else 2)

In [426]:
senators.head()

Unnamed: 0,congress,first_name,last_name,party,state
0,114,Lamar,Alexander,0,TN
1,114,Kelly,Ayotte,0,NH
2,114,Tammy,Baldwin,1,WI
3,114,John,Barrasso,0,WY
4,114,Michael,Bennet,1,CO


In [236]:
state_abbrevs = sorted(senators['state'].unique())

In [336]:
# Convert the state names to abbreviations

states_list = ['Alaska','Alabama','Arkansas','Arizona','California','Colorado','Connecticut','Delaware','Florida','Georgia','Hawaii', 'Iowa','Idaho','Illinois','Indiana', 'Kansas','Kentucky','Louisiana', 'Massachusetts', 'Maryland','Maine', 'Michigan','Minnesota', 'Missouri','Mississippi', 'Montana', 'North Carolina', 'North Dakota', 'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Vermont', 'Washington', 'Wisconsin', 'West Virginia', 'Wyoming']
state_dict = dict(zip(states_list, state_abbrevs))

In [337]:
# Load incarceration data
# This is suspicious because we can't separate out the federal custody inmates from the state;
# we don't know where they physically live
# For purposes of this analysis I've eliminated the Federal row.

inc_data = pd.read_csv('prison_custody_by_state.csv')
inc_data.head()

Unnamed: 0,jurisdiction,includes_jails,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,Federal,0,149852,158216,168144,177600,186364,190844,197285,198414,205087,206968,214774,216915,214989,209561,195622,188311
1,Alabama,0,24741,25100,27614,25635,24315,24103,25253,25363,27241,27345,26813,26768,26825,26145,25212,23745
2,Alaska,1,4570,4351,4472,4534,4798,5052,5151,4997,5472,5369,6216,6308,5081,6323,5247,4378
3,Arizona,0,27710,29359,31084,32384,33345,35752,37700,39455,40544,40130,39949,40013,41031,42136,42204,42248
4,Arkansas,0,11489,11849,12068,12577,12455,12854,13275,13135,13338,14192,14090,14043,14295,15250,15784,15833


In [338]:
inc_data = inc_data.iloc[1:,:]

In [340]:
inc_data['jurisdiction'] = inc_data['jurisdiction'].map(state_dict)

In [341]:
inc_data.head()

Unnamed: 0,jurisdiction,includes_jails,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
1,AL,0,24741,25100,27614,25635,24315,24103,25253,25363,27241,27345,26813,26768,26825,26145,25212,23745
2,AK,1,4570,4351,4472,4534,4798,5052,5151,4997,5472,5369,6216,6308,5081,6323,5247,4378
3,AZ,0,27710,29359,31084,32384,33345,35752,37700,39455,40544,40130,39949,40013,41031,42136,42204,42248
4,AR,0,11489,11849,12068,12577,12455,12854,13275,13135,13338,14192,14090,14043,14295,15250,15784,15833
5,CA,0,157142,159695,161785,163939,168035,172298,171444,171085,168830,162821,147578,132935,134339,134430,127815,129416


In [342]:
inc2016 = inc_data.filter(['jurisdiction', '2016'])
inc2016['2016'] = inc2016['2016'].replace({',':''}, regex=True).astype(int)

In [343]:
# Total incarcerated population according to this dataset -- cf. BJS which says 6,582,100

inc2016['2016'].sum()

1228706

In [None]:
### Add the senators to inc2016 DF

In [427]:
# Get and compare parties of senators from each state
# Test regex for proof of concept
regex = re.compile("NY")
sens = senators[senators['state'].str.match(regex) == True]

In [429]:
sens

Unnamed: 0,congress,first_name,last_name,party,state
37,114,Kirsten,Gillibrand,1,NY
82,114,Charles,Schumer,1,NY


In [430]:
# create DF for state parties

def determine_party():
    states = []
    parties = []
    
    # for each state:
    for state in state_abbrevs:
        # get both senators for state
        regex = re.compile(state)
        state_sens = senators[senators['state'].str.match(regex) == True]
        
        # Set the state to merge with in the main DF
        states.append(state)
        
        if state_sens['party'].iloc[0] == state_sens['party'].iloc[1]:
            # if both R:
            if state_sens['party'].iloc[0] == 0:
                parties.append(0)
            # if both D:
            else:
                parties.append(1)
            # if mixed:
        else:
            parties.append(2)
    sen_parties = pd.DataFrame([states, parties]).T
    sen_parties.columns = ['state', 'party']
    return sen_parties

In [431]:
sen_parties = determine_party()

In [432]:
sen_parties

Unnamed: 0,state,party
0,AK,0
1,AL,0
2,AR,0
3,AZ,0
4,CA,1
5,CO,2
6,CT,1
7,DE,1
8,FL,2
9,GA,0


In [344]:
## This is the total U.S. Population data from the Census. I have averaged the info from the 2012-2013 data and
## the 2016-2017 data to account for the discrepancy in measuring periods

pop_data_2016 = pd.read_csv('pop/PEP_2017_PEPANNRES.csv', header=1)
pop_data_2016.head(50)

pop_data_2016.drop(['Id', 'Id2'], axis=1, inplace=True)

In [345]:
pop_data_2016.reset_index(inplace=True, drop=True)
pop_data_2016

Unnamed: 0,Geography,"April 1, 2010 - Census","April 1, 2010 - Estimates Base",Population Estimate (as of July 1) - 2010,Population Estimate (as of July 1) - 2011,Population Estimate (as of July 1) - 2012,Population Estimate (as of July 1) - 2013,Population Estimate (as of July 1) - 2014,Population Estimate (as of July 1) - 2015,Population Estimate (as of July 1) - 2016,Population Estimate (as of July 1) - 2017
0,United States,308745538,308758105,309338421,311644280,313993272,316234505,318622525,321039839,323405935,325719178
1,Alabama,4779736,4780135,4785579,4798649,4813946,4827660,4840037,4850858,4860545,4874747
2,Alaska,710231,710249,714015,722259,730825,736760,736759,737979,741522,739795
3,Arizona,6392017,6392309,6407002,6465488,6544211,6616124,6706435,6802262,6908642,7016270
4,Arkansas,2915918,2916031,2921737,2938640,2949208,2956780,2964800,2975626,2988231,3004279
5,California,37253956,37254518,37327690,37672654,38019006,38347383,38701278,39032444,39296476,39536653
6,Colorado,5029196,5029325,5048029,5116411,5186330,5262556,5342311,5440445,5530105,5607154
7,Connecticut,3574097,3574114,3580171,3591927,3597705,3602470,3600188,3593862,3587685,3588184
8,Delaware,897934,897936,899712,907884,916868,925114,934805,944107,952698,961939
9,District of Columbia,601723,601766,605040,620336,635630,650114,660797,672736,684336,693972


In [346]:
pop_data_2016['Geography'] = pop_data_2016['Geography'].map(state_dict)

In [411]:
pop_data_2016.rename(columns={'Geography':'state'}, inplace=True)

In [412]:
pop_data_2016.head()

Unnamed: 0,state,"April 1, 2010 - Census","April 1, 2010 - Estimates Base",Population Estimate (as of July 1) - 2010,Population Estimate (as of July 1) - 2011,Population Estimate (as of July 1) - 2012,Population Estimate (as of July 1) - 2013,Population Estimate (as of July 1) - 2014,Population Estimate (as of July 1) - 2015,Population Estimate (as of July 1) - 2016,Population Estimate (as of July 1) - 2017
1,AL,4779736,4780135,4785579,4798649,4813946,4827660,4840037,4850858,4860545,4874747
2,AK,710231,710249,714015,722259,730825,736760,736759,737979,741522,739795
3,AZ,6392017,6392309,6407002,6465488,6544211,6616124,6706435,6802262,6908642,7016270
4,AR,2915918,2916031,2921737,2938640,2949208,2956780,2964800,2975626,2988231,3004279
5,CA,37253956,37254518,37327690,37672654,38019006,38347383,38701278,39032444,39296476,39536653


In [348]:
pop_data_2016.drop([0,], inplace=True)

In [349]:
pop_data_2016.drop([9,], inplace=True)

In [350]:
pop_data_2016.drop([52,], inplace=True)

In [357]:
# pickle this info
dbfile = open('populationDB', 'ab')
pickle.dump(pop_data_2016, dbfile)
dbfile.close()

In [445]:
### CLEANED STATE POPULATION DATA FROM U.S. CENSUS ###

pop_data_2016.head()

Unnamed: 0,state,"April 1, 2010 - Census","April 1, 2010 - Estimates Base",Population Estimate (as of July 1) - 2010,Population Estimate (as of July 1) - 2011,Population Estimate (as of July 1) - 2012,Population Estimate (as of July 1) - 2013,Population Estimate (as of July 1) - 2014,Population Estimate (as of July 1) - 2015,Population Estimate (as of July 1) - 2016,Population Estimate (as of July 1) - 2017
1,AL,4779736,4780135,4785579,4798649,4813946,4827660,4840037,4850858,4860545,4874747
2,AK,710231,710249,714015,722259,730825,736760,736759,737979,741522,739795
3,AZ,6392017,6392309,6407002,6465488,6544211,6616124,6706435,6802262,6908642,7016270
4,AR,2915918,2916031,2921737,2938640,2949208,2956780,2964800,2975626,2988231,3004279
5,CA,37253956,37254518,37327690,37672654,38019006,38347383,38701278,39032444,39296476,39536653


In [446]:
sen_parties.head()

Unnamed: 0,state,party
0,AK,0
1,AL,0
2,AR,0
3,AZ,0
4,CA,1


In [434]:
master_data = pd.merge(pop_data_2016, sen_parties, on='state')
master_data.head()

Unnamed: 0,state,"April 1, 2010 - Census","April 1, 2010 - Estimates Base",Population Estimate (as of July 1) - 2010,Population Estimate (as of July 1) - 2011,Population Estimate (as of July 1) - 2012,Population Estimate (as of July 1) - 2013,Population Estimate (as of July 1) - 2014,Population Estimate (as of July 1) - 2015,Population Estimate (as of July 1) - 2016,Population Estimate (as of July 1) - 2017,party
0,AL,4779736,4780135,4785579,4798649,4813946,4827660,4840037,4850858,4860545,4874747,0
1,AK,710231,710249,714015,722259,730825,736760,736759,737979,741522,739795,0
2,AZ,6392017,6392309,6407002,6465488,6544211,6616124,6706435,6802262,6908642,7016270,0
3,AR,2915918,2916031,2921737,2938640,2949208,2956780,2964800,2975626,2988231,3004279,0
4,CA,37253956,37254518,37327690,37672654,38019006,38347383,38701278,39032444,39296476,39536653,1


In [439]:
master_data = pd.merge(master_data, inc2016, on='state')

In [440]:
master_data.head(15)

Unnamed: 0,state,"April 1, 2010 - Census","April 1, 2010 - Estimates Base",Population Estimate (as of July 1) - 2010,Population Estimate (as of July 1) - 2011,Population Estimate (as of July 1) - 2012,Population Estimate (as of July 1) - 2013,Population Estimate (as of July 1) - 2014,Population Estimate (as of July 1) - 2015,Population Estimate (as of July 1) - 2016,Population Estimate (as of July 1) - 2017,party,2016_inc_pop
0,AL,4779736,4780135,4785579,4798649,4813946,4827660,4840037,4850858,4860545,4874747,0,23745
1,AK,710231,710249,714015,722259,730825,736760,736759,737979,741522,739795,0,4378
2,AZ,6392017,6392309,6407002,6465488,6544211,6616124,6706435,6802262,6908642,7016270,0,42248
3,AR,2915918,2916031,2921737,2938640,2949208,2956780,2964800,2975626,2988231,3004279,0,15833
4,CA,37253956,37254518,37327690,37672654,38019006,38347383,38701278,39032444,39296476,39536653,1,129416
5,CO,5029196,5029325,5048029,5116411,5186330,5262556,5342311,5440445,5530105,5607154,2,19486
6,CT,3574097,3574114,3580171,3591927,3597705,3602470,3600188,3593862,3587685,3588184,1,15040
7,DE,897934,897936,899712,907884,916868,925114,934805,944107,952698,961939,1,6334
8,FL,18801310,18804594,18846461,19097369,19341327,19584927,19897747,20268567,20656589,20984400,2,98010
9,GA,9687653,9688690,9712696,9810595,9911171,9981773,10083850,10199533,10313620,10429379,0,53433


In [449]:
master_data['avg_US_pop_2016'] = (master_data['Population Estimate (as of July 1) - 2016']+master_data['Population Estimate (as of July 1) - 2017'])/2

In [459]:
master_data['inc_to_total'] = master_data['2016_inc_pop']/master_data['avg_US_pop_2016']

In [460]:
### THE DATA FOR ANALYSIS ###
master_data.head()

Unnamed: 0,state,"April 1, 2010 - Census","April 1, 2010 - Estimates Base",Population Estimate (as of July 1) - 2010,Population Estimate (as of July 1) - 2011,Population Estimate (as of July 1) - 2012,Population Estimate (as of July 1) - 2013,Population Estimate (as of July 1) - 2014,Population Estimate (as of July 1) - 2015,Population Estimate (as of July 1) - 2016,Population Estimate (as of July 1) - 2017,party,2016_inc_pop,avg_US_pop_2016,inc_to_total
0,AL,4779736,4780135,4785579,4798649,4813946,4827660,4840037,4850858,4860545,4874747,0,23745,4867646.0,0.004878
1,AK,710231,710249,714015,722259,730825,736760,736759,737979,741522,739795,0,4378,740658.5,0.005911
2,AZ,6392017,6392309,6407002,6465488,6544211,6616124,6706435,6802262,6908642,7016270,0,42248,6962456.0,0.006068
3,AR,2915918,2916031,2921737,2938640,2949208,2956780,2964800,2975626,2988231,3004279,0,15833,2996255.0,0.005284
4,CA,37253956,37254518,37327690,37672654,38019006,38347383,38701278,39032444,39296476,39536653,1,129416,39416564.5,0.003283


In [461]:
# Separate out the groups for analysis
group_R = master_data[master_data['party'] == 0]
group_D = master_data[master_data['party'] == 1]
group_I = master_data[master_data['party'] == 2]

## Statistical analysis

In [467]:
F, p = stats.f_oneway(group_R['inc_to_total'], group_D['inc_to_total'], group_I['inc_to_total'])

In [491]:
print(f'The F value is {round(F,2)}.')

The F value is 2.81.


In [472]:
print(f'The p-value is {round(p,2)}, which fails to satisfy the requirements for statistical significance (<= 0.05).')

The p-value is 0.07, which fails to satisfy the requirements for statistical significance (<= 0.05).


## Next step: Increase the power of the hypothesis test with a larger sample (add Congresses)

The F ratio is the ratio of two mean square values. If the null hypothesis is true, we would expect F to have a value close to 1. A large F ratio means that the variation among group means is greater than chance would predict. 

This large an F ratio might suggest the null hypothesis is wrong (the data are not sampled from populations with the same mean), but our p-value is just outside of statistical significance.


_What next?_

+ Two-way ANOVA, with time and political valence (to increase observations and therefore power)
+ Repeated measures ANOVA (because prison population is expected to be similar over time) 