In [392]:
import pandas as pd
import pyopenstates as ostate
from time import sleep

# About the data we are collecting

So far we have data on which party is controlling the senate, house and governorship for each state, data for each bill intorduced, data whether that bill was passed, and data on who sponsored the bill. The thing about the sponsorship data is that it gives us the name of the sponsor, but not the party they belong to. The goal of this codebook is to find that information. Knowing which party introduced a bill could be very helpful for creating a good model. The data on the legislatures party comes from Open Sates again. This time however, we will be using there python [API](https://openstates.github.io/pyopenstates/). The main benefit of using the Open States API is that we will be able to match the legislature ids from our sponsorship data very easily with the api as they both come from Open States. A limitation though, is that we are limited with the number of requests we can make per day. Because of this, we will need a few days to collect all of the data we need.

To use the API you need to create an account and then get an API key from Open States.

# Read in sponsorship data

In [372]:
sponsors = pd.read_csv('../../Data/Bills_Data/bill_sponsors_2017_2018.csv.zip')

  sponsors = pd.read_csv('../../Data/Bills_Data/bill_sponsors_2017_2018.csv.zip')


In [373]:
sponsors.head(1)

Unnamed: 0,id,name,entity_type,organization_id,person_id,bill_id,primary,classification
0,91df26f8-d739-4e27-8a55-5aa541cdab95,Olson,person,,,ocd-bill/6db0cabc-e1ad-4257-8d64-8364ad37733f,False,cosponsor


# Clean up sponsorship data a little

In [374]:
#We only want primary sponsors that are people (rather than organizations)
sponsors = sponsors[(sponsors['primary']) & (sponsors['entity_type'] == 'person')].copy()

In [375]:
#We do not need the following columns, drop them
sponsors.drop(columns = ['id', 'entity_type', 'organization_id', 'primary'], inplace = True)

In [377]:
sponsors.head()

Unnamed: 0,name,person_id,bill_id,classification
1,EDGMON,,ocd-bill/64283615-6347-4dbb-a5ea-b9243f752e17,primary
4,TUCK,,ocd-bill/3c972d33-ba3a-4f49-bf7c-5ae69448c9f0,primary
8,JOHNSTON,,ocd-bill/3c60ed80-6c5b-4d95-b6d0-e512b8a3af8d,primary
10,LEDOUX,,ocd-bill/1b2853b2-7d19-4ff6-8a2c-73dac05e70a6,primary
11,WILSON,,ocd-bill/2ff84009-11b2-4e57-9fe6-b0564116c4e8,primary


In [378]:
sponsors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 486756 entries, 1 to 1067550
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   name            486756 non-null  object
 1   person_id       278989 non-null  object
 2   bill_id         486756 non-null  object
 3   classification  486756 non-null  object
dtypes: object(4)
memory usage: 18.6+ MB


Many values for the sponsor id are empty. The thing is there are names that show up multiple times, sometimes with an id and sometimes withought an id. I will try to impute some id values.

In [379]:
#Lowercase all names and get rid of spaces
sponsors['name'] = sponsors['name'].map(lambda x: x.lower().replace(' ',''))

This next block of code looks loops through every unique name. If every row for that name has an id with it, we are good and we leave it. If every row with that name has no id, there is no way for us to find an id for it. If there are some rows with an id and some without one, then we can fill the empy values with the id of the other ones as long as there is only one id associated with the name. If there are multiple ids for one name, then we leave the values empty.

This imputation technique is susceptible to make an error. For example, if we have two legislatures of the name John Doe, but only one has and id associated with him. This code would impute both of them with one id, so we would only think there is one John Doe when in fact there are two. I don't think this is very likely to be a significant problem though, but good to keep in mind.

In [380]:
length = len(sponsors['name'].unique()) #how many unique names there are, used to know how long thare is left in th function

ids_filled = [] #where we will keep track of imputed ids

#Loop through every unique name
for i,name in enumerate(sponsors['name'].unique()):
    print(f'{round((i+1)/ length * 100,2)}%', end = '\r') #Let's us know how far into the runtime we are
    
    #ids for the specific name nulls and non nulls included
    names_ids = sponsors[sponsors['name'] == name]['person_id']
    
    if len(names_ids.dropna().unique()) == 1: #if there is only one id for that name, it is probably the same person, 
        names_ids = names_ids.fillna(names_ids.dropna().unique()[0]) # fill nulls with that id
    
        
    #Add imputed ids list to overall list
    ids_filled = ids_filled + list(names_ids)

100.0%

In [381]:
#Do the lengths match
print(len(ids_filled))
print(len(sponsors['name']))

486756
486756


In [382]:
#Make a new colum for imputed ids
sponsors['imputed_ids'] = ids_filled

In [383]:
sponsors.isnull().sum()

name                   0
person_id         207767
bill_id                0
classification         0
imputed_ids       173920
dtype: int64

We filled up almost 30,000 nulls with our imputation.

# Collect Legislature Data

In [387]:
length = len(sponsors['person_id'].unique())
print(f'There are {length} unique legislatures in our data.')

There are 4218 unique legislatures in our data.


In [388]:
#Saving their ids
legislature_ids = list(sponsors['person_id'].unique())

As there is a limit 250 requests per day, we will not be able to get each legislature at once. This code retrieves our first block of legislature data.

In [394]:
legislatures = [] #we will keep the data in this list

for i in range(1,2501,10): #This will be the index of the legislative ids we pick
    print(f'Collecting {i} through {i+10}', end = '\r')
    sleep(6) #We can only have 10 requests per minute. If we wait 6 seconds between each request, we should be good.
    
    #take only ten ids at a time because that is all the api function will give us at once
    legislature_ids = list(sponsors['person_id'].unique()[i:i+10])
    
    #Get the legislators data from openstates
    new_legislatures = ostate.search_legislators(id_ = legislature_ids) 
    
    legislatures += new_legislatures #add our data to the legislatures list

Collecting 2491 through 2501

In [396]:
#Saving first round of data to dataframe
leg_df = pd.DataFrame(legislatures)

Round 2 of data collection should get all the legislatures we need.

In [400]:
legislatures = [] #we will keep the data in this list

for i in range(2501,4221,10): #This will be the index of the legislative ids we pick
    print(f'Collecting {i} through {i+10}', end = '\r')
    sleep(6) #We can only have 10 requests per minute. If we wait 6 seconds between each request, we should be good.
    
    #take only ten ids at a time because that is all the api function will give us at once
    legislature_ids = list(sponsors['person_id'].unique()[i:i+10])
    
    #Get the legislators data from openstates
    new_legislatures = ostate.search_legislators(id_ = legislature_ids) 
    
    legislatures += new_legislatures #add our data to the legislatures list

Collecting 4211 through 4221

In [406]:
leg_df = pd.concat([leg_df, pd.DataFrame(legislatures)])

In [407]:
leg_df.shape

(4151, 16)

There are 4221 unique legislators but we only retrived 4151 of them. It seems like our code missed 70 legislators.

In [412]:
missing_ids = [] #store the ids that we did not retrive data for

for ids in list(sponsors['person_id'].dropna().unique()): #go thorugh all the legislature ids from prior data
    if ids not in list(leg_df['id']): #check if we retrived data from api for it
        missing_ids.append(ids) #if not, append to missing ids list

Round 3: can we get data for the ids that did not work the first time through.

In [414]:
ostate.search_legislators(id_ = missing_ids) 

[]

It seems like these ids don't exist in the Open States API.

In [427]:
leg_df 

Unnamed: 0,id,name,party
0,ocd-person/a08f605d-221f-4d26-8e13-4bcfa964cd8e,Becky Nordgren,Republican
1,ocd-person/6144ffa0-24f3-4594-87ab-938fd8fb1d68,Ed Henry,Republican
2,ocd-person/0d7f5989-df17-4c90-8267-4914f5598712,Jim Carns,Republican
3,ocd-person/cbef3e56-a05d-4dca-8b0c-200e4bff9ad5,Kyle South,Republican
4,ocd-person/378065a9-1a28-4511-8b40-cdefe30a66a7,Lee Pittman,Republican
...,...,...,...
4146,ocd-person/eb9af80b-b2e8-4b32-9598-b9e12bb470ef,Clark Stith,Republican
4147,ocd-person/b9d3820d-6368-41c3-8730-b287ff78f49e,Dan Furphy,Republican
4148,ocd-person/f6e46a8f-6a6f-4c91-9ee8-d0c55e826c48,Danny Eyre,Republican
4149,ocd-person/e8bbc8c8-4c0d-4567-bb4c-e3ac635ba178,Evan Simpson,Republican


# Merging

We now have all the data to merge our sponsorship data and party of the sponsor. The end product after this section will be a dataframe of each bill with a column for the bill id and some columns showing the primary party behind the bill.

In [432]:
#Data Frame of the legislatures with only the needed columns
leg_df = leg_df[['id','name','party']].copy()
leg_df.reset_index(inplace = True, drop = True)

In [434]:
#Merge the data frames by the legislatures id. One with the imputed ids that we made and one normal.
df = leg_df.merge(sponsors, how = 'inner', left_on = 'id', right_on = 'person_id')
df_imputed = leg_df.merge(sponsors, how = 'inner', left_on = 'id', right_on = 'imputed_ids')

In [435]:
length = len(df.bill_id.unique())
length_imp = len(df_imputed.bill_id.unique())
print('Normal:')
print(f'There are only {length} out of {df.shape[0]}')
print('\nImputed:')
print(f'There are only {length_imp} out of {df_imputed.shape[0]}')

Normal:
There are only 98169 out of 277737

Imputed:
There are only 96707 out of 311468


Confusing result. The dataframe matched with the non imputed ids (meaning there are more missing values) had more unique bills than the dataframed merge on th imputed ids. Something seems to have gone wrong in that code and I will table the imputation effort. We will move forward with the normal df.

In [437]:
#save the unique bills to an array
unique_bills = df.bill_id.unique()

The following code makes a list deciding if a bill was sponsored by democrats or republicans or neither. This is necessary because there are many bills with multiple sponsors. 

If more republicans sponsored a bill, it is a republican bill. If more democrats sponsored a bill, it is a democratic bill. If an independant or an equal amount of republicans and democrats sponsored the bill, it is neither republican or democrat. 

In [440]:
length = len(unique_bills) #helps for the runtime function in the loop
sponsor_party_list = [] #where we will store the party sponsor

for i, bill_id in enumerate(unique_bills): #Loop through every unique bill
    
    print(f'{round(((i+1)/ length) * 100,2)}%', end = '\r') #Let's us know how far into the runtime we are
    
    parties = merged[merged.bill_id == bill_id]['party'] #List of the parties of people sponsoring the bill

    n_dems = list(parties).count('Democratic') #Count Democrats sponsoring bill
    n_reps = list(parties).count('Republican') #Count Republicans sponsoring bill

    if n_dems > n_reps: #If there are more democrats, it is a democrat bill
        sponsor_party_list.append('Dem')
    elif n_dems < n_reps: #If there are more republican, it is a republican bill
        sponsor_party_list.append('Rep')
    else:
        sponsor_party_list.append('Neither')

100.0%

In [441]:
#Checking that they are the same length
print(len(unique_bills))
print(len(sponsor_party_list))

98169
98169


In [442]:
bills_sponsor_party = pd.DataFrame() #Dataframe to hold our final product

bills_sponsor_party['bill_id'] = unique_bills #column for the bill ids
bills_sponsor_party['majority_sponsor_party'] = sponsor_party_list #column for the majority sponsor party calculated above

The following code makes a list deciding if a bill was sponsored similar to the code above but adds some groups. Now if a bill is sponsored by both republicans and democrats, it is a bipartisan bill. If it is only democrats, it is a democratic bill and same for republicans. And if it is none of these options, for example, an independent bill, it is labeld as other.

# Saving the data

In [465]:
bills_sponsor_party.to_csv('../../Data/Bills_Data/bills_sponsor_party.csv.zip', index = False)