In [1]:
import pandas as pd
import zipfile
import os
import shutil

# About original data

The Bills data is from the Open States [website](https://openstates.org/data/session-csv/). Open States is an organization that aggregates, standardizes, and cleans legislative data for all 50 states. The data used for this study is from the bulk data they offer of proposed bills in the state's legislature. The data is stored in zip files for each legislative session by state. In this codebook, we will aggregate all the states bills and sponsorship data from the zip files and save it as a csv file. Open States scrapes their data directly from governemtn websites and seems to be quite reliable. 

# Read in Data for Every State

Except for Nebraska as it has a unique state legislature.

In [2]:
#List of states, will help us when retrieving data from zipfiles
states = [ 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']

The next code block does the following:

    1. Instatiates DataFrames that will hold our bills and sponsorships data.
    2. Loop through each state to extract wanted data:
       a. Unzip every .zip file for that state
       b. Read in wanted data into DataFrames from CSVs
       c. Delete the extracted data after reading everything in to not use too much memory.
       
    Result: Should end up with bills and sponsorships data for every state from the 2017-2018 sessions.

In [3]:
#1.
bills = pd.DataFrame() #instantiate dataframe to store the bills data for each state
sponsors = pd.DataFrame() #instantiate dataframe to store the sponsors data for each bill
abstracts = pd.DataFrame() #Instantiate abstracts dataframe

#2.
#Loop through each state as we have data for each state in seperate folders in the Data/Bills_Data folder
for state in states:
    print(f'Downloading {state} data') #Message showing which state we are downloading from
    
    state = state + '/' #End of path for location of the zip file (different for each state)
    path = '../../Data/Bills_Data/' + state #The whole path to the zipfiles for that particular state
    
    
    
    #2.a
    zip_end = '.zip' #We only want to extract files from zip files
    
    for file in os.listdir(path): #Go through every file in the state folder
        if file.endswith(zip_end): #If file is a zip file, do the next lines of code
            file_path = path+file #Path to zip file
            zip_object = zipfile.ZipFile(file_path) #Instantiate zip object for the particular file
            zip_object.extractall('../../Data/Bills_Data/') #Extract files from zip into the state folder
          
        
      
    #2.b
    #We now have extracted many files from the zip file
    #We only need information from the files ending in 'bills.csv' and 'bill_sponsorships.csv'.
    bills_end = 'bills.csv' #Ending used for bills info
    sponsors_end = 'bill_sponsorships.csv' #Ending used for sponsors data
    abstract_end = 'abstracts.csv' #Ending used for abstract data
    
    #Go through each root, directory, and file from this path(mainly just want files extracted from zip)
    for origin,sub,files in os.walk(path): 
        for file in files: #Go through each file found in the os.walk
            if file.endswith(bills_end): #If file is the 'blabla...bills.csv', read in that data
                new_data = pd.read_csv(origin+'/'+file) #Path to file
                bills = pd.concat([bills,new_data]) #Append new data read in to our bill dataframe instantiated above
            elif file.endswith(sponsors_end): #If file is the 'blabla...bill_sponsorships.csv', read in that data
                new_data = pd.read_csv(origin+'/'+file) #Path to file
                sponsors = pd.concat([sponsors,new_data]) #Append new data read in to our sponsor dataframe instantiated above
            elif file.endswith(abstract_end): #If file is the 'blabla...abstracts.csv', read in that data
                new_data = pd.read_csv(origin+'/'+file) #Path to file
                abstracts = pd.concat([abstracts,new_data]) #Append new data read in to our sponsor dataframe instantiated above
      
    #2.c
    #We have the data we want from this state. No need to keep unneeded data taking up memory.
    #Next few lines delete the extracted files from the zip
    #Go through each root, directory, and file from this path(only need the subdirectories)
    for origin,subs,files in os.walk(path): 
        for sub in subs: #Go through each newly created folder from unzipping state data
            if sub.endswith('.ipynb_checkpoints') == False: #Sometimes this shows up, chossing to ignore it
                shutil.rmtree(path + sub) #Remove that folder and everything in it, we still have the zip file though

Downloading AK data
Downloading AL data
Downloading AR data
Downloading AZ data
Downloading CA data
Downloading CO data
Downloading CT data
Downloading DC data
Downloading DE data
Downloading FL data
Downloading GA data
Downloading HI data
Downloading IA data
Downloading ID data
Downloading IL data
Downloading IN data
Downloading KS data
Downloading KY data
Downloading LA data
Downloading MA data
Downloading MD data
Downloading ME data
Downloading MI data
Downloading MN data
Downloading MO data
Downloading MS data
Downloading MT data
Downloading NC data
Downloading ND data
Downloading NH data
Downloading NJ data
Downloading NM data
Downloading NV data
Downloading NY data
Downloading OH data
Downloading OK data
Downloading OR data
Downloading PA data
Downloading RI data
Downloading SC data
Downloading SD data
Downloading TN data
Downloading TX data
Downloading UT data
Downloading VA data
Downloading VT data
Downloading WA data
Downloading WI data
Downloading WV data
Downloading WY data


# Data Inspection

In [4]:
bills.head()

Unnamed: 0,id,identifier,title,classification,subject,session_identifier,jurisdiction,organization_classification,bill_id,related_bill_id,legislative_session,relation_type
0,ocd-bill/f1741c6f-c9fc-4811-8a5f-aca07d1ae90c,SB 53,An Act relating to insurance coverage for cont...,['bill'],"['BOARDS & COMMISSIONS', 'DRUGS', 'HEALTH & SO...",30,Alaska,upper,,,,
1,ocd-bill/fc02f0e2-a789-48e2-bb71-6839db4af4a1,SB 33,An Act naming the state ferries built in Ketch...,['bill'],"['BOATS & BOATING', 'MARINE HIGHWAY', 'TRANSPO...",30,Alaska,upper,,,,
2,ocd-bill/995b8a0b-41dd-4918-a1ae-ab32c7b41070,HB 141,An Act relating to allocations of funding for ...,['bill'],"['BUSINESS', 'EDUCATION', 'EMPLOYMENT', 'LABOR...",30,Alaska,lower,,,,
3,ocd-bill/c1da3d60-e3e7-4c3a-a244-31b137690e2c,SB 103,An Act establishing the Alaska education innov...,['bill'],"['BOARDS & COMMISSIONS', 'COMMUNITY COLLEGES',...",30,Alaska,upper,,,,
4,ocd-bill/08bfc0da-90ef-47c0-872b-88701a4f8eaa,HB 77,An Act making corrective amendments to the Ala...,['bill'],"['AIRPORTS', 'APPROPRIATIONS', 'AVIATION', 'BO...",30,Alaska,lower,,,,


In [5]:
sponsors.head()

Unnamed: 0,id,name,entity_type,organization_id,person_id,bill_id,primary,classification
0,91df26f8-d739-4e27-8a55-5aa541cdab95,Olson,person,,,ocd-bill/6db0cabc-e1ad-4257-8d64-8364ad37733f,False,cosponsor
1,f3055917-0abb-4f0c-a890-03b77a085645,EDGMON,person,,,ocd-bill/64283615-6347-4dbb-a5ea-b9243f752e17,True,primary
2,5253bf4a-b11d-4d42-89ef-d90525b52d62,Kopp,person,,ocd-person/7474d172-6c90-47f9-aae1-9f66c9518be2,ocd-bill/8c419c90-85c0-4a08-8c37-182e5c5f8185,False,cosponsor
3,6ba8a513-acfa-4b92-97f2-1f7530a7fae8,Fansler,person,,ocd-person/81dcd595-b1c2-4bf9-80a3-5cd91b56d9b2,ocd-bill/e877320d-2def-458f-91d6-12f050f88a98,False,cosponsor
4,b3e53391-4652-4e7e-96fb-ae4fd4520e63,TUCK,person,,,ocd-bill/3c972d33-ba3a-4f49-bf7c-5ae69448c9f0,True,primary


In [6]:
abstracts.head()

Unnamed: 0,id,bill_id,abstract,note
0,ebe2c219-2c22-48c9-92f4-698cfbf0bf5e,ocd-bill/f62aa35e-8c2e-4230-8f95-9330a2b7a45c,"This measure would recognize June 12, 2017, as...",summary
1,481b9e32-6458-45fa-b0f9-063f648ae0bb,ocd-bill/3b4c223e-8ddc-4568-ade2-3d0a1e7847d2,Existing law requires certain elected officers...,summary
2,c635cdd2-76b7-45a8-bfa6-c8f46e64297e,ocd-bill/59c27e21-419f-4597-8037-923a9499589c,"Existing law, the Mental Health Services Act (...",summary
3,acfb1b86-782a-4739-aeba-ba920995e661,ocd-bill/9e9808aa-b3a3-4db7-861f-968ad9f0f289,Existing law authorizes the Director of the Ca...,summary
4,efa68b0b-c569-4cee-83f8-a4336b99dd4d,ocd-bill/fb033877-3ac4-4334-a776-280fef02b5d6,Existing law establishes the State Public Work...,summary


In [63]:
bills.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 256718 entries, 0 to 329
Data columns (total 12 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           256718 non-null  object 
 1   identifier                   256718 non-null  object 
 2   title                        227105 non-null  object 
 3   classification               227105 non-null  object 
 4   subject                      227105 non-null  object 
 5   session_identifier           227105 non-null  object 
 6   jurisdiction                 227105 non-null  object 
 7   organization_classification  227105 non-null  object 
 8   bill_id                      29613 non-null   object 
 9   related_bill_id              0 non-null       float64
 10  legislative_session          29613 non-null   object 
 11  relation_type                29613 non-null   object 
dtypes: float64(1), object(11)
memory usage: 25.5+ MB


In [64]:
sponsors.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1067553 entries, 0 to 1519
Data columns (total 8 columns):
 #   Column           Non-Null Count    Dtype  
---  ------           --------------    -----  
 0   id               1067553 non-null  object 
 1   name             1067553 non-null  object 
 2   entity_type      1067553 non-null  object 
 3   organization_id  0 non-null        float64
 4   person_id        682714 non-null   object 
 5   bill_id          1067553 non-null  object 
 6   primary          1067553 non-null  bool   
 7   classification   1067553 non-null  object 
dtypes: bool(1), float64(1), object(6)
memory usage: 66.2+ MB


In [7]:
abstracts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83415 entries, 0 to 196
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        83415 non-null  object
 1   bill_id   83415 non-null  object
 2   abstract  83415 non-null  object
 3   note      36683 non-null  object
dtypes: object(4)
memory usage: 3.2+ MB


There are some null values but overall, I am happy with this data extraction. Seems like we have 227,105 usable bills and many sponsor names that hopefully we will be able to use.

# Saving Data

In [9]:
bills.to_csv('../../Data/Bills_Data/state_bills_2017_2018.csv.zip', index = False)
sponsors.to_csv('../../Data/Bills_Data/bill_sponsors_2017_2018.csv.zip', index = False)
abstracts.to_csv('../../Data/Bills_Data/bill_abstracts_2017_2018.csv.zip', index = False)