In [51]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Data Cleaning

We now have all of the datasets merged. After merging, we ended up with two datasets; one with almost 160,000 bills and the other with around 85,000 bills. The smaller dataset is a subset of the larger one but has more information with it. It includes a column of which party sponsored the bill. As we have two datasets, we will clean both of them seperately. They will both have a very similar process.

First thing will be to get rid of some columns. They will be removed because either because they have too many missing values, or because they will not be useful for this particular study. Secondly, we will clean up some columns. For example, the subject column has a list of subjects that the bill relates to, it would be more useful if it was in just a normal string format rather than a list so we can use count vectorizer on it in our modeling phase. Lastly, we will have to get rid of a big chunk of our data because we are only interested in types of Legislation that can infact become laws. Our column statute will help with this phase.

# Read in Data

And take a look at it.

In [3]:
df = pd.read_csv('../Data/Merged_Data/merged_data.csv.zip')
df.drop(columns = ['Unnamed: 0'], inplace = True)

df_with_sponsors = pd.read_csv('../Data/Merged_Data/merged_data_with_sponsorship.csv.zip')
df_with_sponsors.drop(columns = 'Unnamed: 0', inplace = True)

#codebook = pd.read_csv('../Data/Legislative_Progression/codebook.csv.zip')

  df = pd.read_csv('../Data/Merged_Data/merged_data.csv.zip')
  df_with_sponsors = pd.read_csv('../Data/Merged_Data/merged_data_with_sponsorship.csv.zip')


In [161]:
df_with_sponsors.head(1)

Unnamed: 0,id_x,identifier,title,classification,subject,session_identifier,jurisdiction,organization_classification,bill_id_x,related_bill_id,...,house_dem,house_rep,gov_party,year,id_y,bill_id_x.1,abstract,note,bill_id_y.1,majority_sponsor_party
0,ocd-bill/1c7dc860-88ed-4c66-ac15-07d506b73921,HR 66,"Stroup, William Earl, death mourned",['resolution'],"['Resolutions, Condolence']",2018rs,al,lower,,,...,33,70,Rep,2018,,,,,ocd-bill/1c7dc860-88ed-4c66-ac15-07d506b73921,Dem


In [162]:
df.head(1)

Unnamed: 0,id_x,identifier,title,classification,subject,session_identifier,jurisdiction,organization_classification,bill_id_x,related_bill_id,...,senate_dem,senate_rep,house_dem,house_rep,gov_party,year,id_y,bill_id,abstract,note
0,ocd-bill/f1741c6f-c9fc-4811-8a5f-aca07d1ae90c,SB 53,An Act relating to insurance coverage for cont...,['bill'],"['BOARDS & COMMISSIONS', 'DRUGS', 'HEALTH & SO...",30,ak,upper,,,...,6,14,17,21,Ind,2017,,,,


In [163]:
#created by the authors of the legislative progression dataset
#Helpful guide to understand some columns
codebook

Unnamed: 0,variable name,Description,type
0,ajo_id,Unique identifier,string
1,state,State (abbreviation),string
2,fips,State FIPS code (numeric),float
3,year2,"Biennium (first year of odd-even sessions, sec...",float
4,year1,Year of activity,float
5,sess_str,Session name (string),string
6,special,Special Session?,string
7,sessrank,Session number within biennium,string
8,bill_id,Bill id,string
9,bill_pre,Bill prefix,string


In [164]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159986 entries, 0 to 159985
Data columns (total 65 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id_x                         159986 non-null  object 
 1   identifier                   159986 non-null  object 
 2   title                        159986 non-null  object 
 3   classification               159986 non-null  object 
 4   subject                      159986 non-null  object 
 5   session_identifier           159986 non-null  object 
 6   jurisdiction                 159986 non-null  object 
 7   organization_classification  159986 non-null  object 
 8   bill_id_x                    0 non-null       float64
 9   related_bill_id              0 non-null       float64
 10  legislative_session          0 non-null       float64
 11  relation_type                0 non-null       float64
 12  custom_id                    159986 non-null  object 
 13 

# Deciding which columns to keep

Below is a list of features we will keep for this study. Out of the 60+ columns, only 19 will be kept. These columns have all the information we need for this study.

|Feature|Type|Original Source|Description|
|---|---|---|---|
|**id_x**|*object*|Open States|Bill ID used on Open States website|
|**classification**|*object*|Open States|Type of legislation. e.g. "bill" or "resolution"|
|**title**|*object*|Open States|Title of bill|
|**subject**|*object*|Open States|Main subjects bill refers to.|
|**abstract**|*object*|Open States|Abstract of bill.|
|**jurisdiction**|*object*|Open States|State the bill was introduced in.|
|**organization_classification**|*object*|Open States|Which house the bill was introduced in.|
|**year1**|*int*|Garlick|Year the bill was introduced.|
|**bill_pre**|*object*|Garlick|Prefix of the bill.|
|**statute**|*int*|Garlick|Whether the bill is a statute or not.|
|**lpcode**|*float*|Garlick|Code that correlates to where the bill failed.|
|**dum_38_pass**|*int*|Garlick|Passed the first chamber or not.|
|**dum_68_pass**|*int*|Garlick|Passed the second chamber or not.|
|**dum_90_law**|*int*|Garlick|Enacted as law.|
|**senate_dem**|*int*|NCSL|Number of democratic state senators.|
|**senate_rep**|*int*|NCSL|Number of republican state senators.|
|**house_dem**|*int*|NCSL|Number of democrats in the house.|
|**house_rep**|*int*|NCSL|Number of republicans in the house.|
|**gov_party**|*object*|NCSL|Party of the state governor.|
|**majority_sponsor_party**|*object*|Open States|Party that introduced the bill. (only in our smaller dataset)|

Links to original datasets:
   - Open States: [website](https://openstates.org/data/session-csv/), [API](https://openstates.github.io/pyopenstates/)
   - [Garlick](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8PTHXT)
   - [NCLS](https://www.ncsl.org/about-state-legislatures/state-partisan-composition)

In [4]:
#Filtering the columns we want kept for bigger dataframe
df = df[['id_x', 'classification', 'title', 'subject', 'abstract','jurisdiction', 
    'organization_classification', 'year1', 'bill_pre', 'statute', 'lpcode', 
    'dum_38_pass', 'dum_68_pass', 'dum_90_law', 'senate_dem', 
    'senate_rep', 'house_dem', 'house_rep', 'gov_party']]

#Filtering the columns we want kept for smaller dataframe
df_with_sponsors = df_with_sponsors[['id_x', 'classification', 'title', 'subject', 'abstract', 'jurisdiction', 
    'organization_classification', 'year1', 'bill_pre', 'statute', 'lpcode', 
    'dum_38_pass', 'dum_68_pass', 'dum_90_law', 'senate_dem', 
    'senate_rep', 'house_dem', 'house_rep', 'gov_party', 'majority_sponsor_party']]

# Cleaning Columns

### Classification

Making the colum more readable.

In [5]:
#The brackets and extra quotations are not useful
df['classification'].value_counts()

['bill']                        124447
['resolution']                   23983
['joint resolution']              6842
['concurrent resolution']         3767
['bill', 'appropriation']          402
['memorial']                       270
['joint memorial']                 137
['commemoration']                   82
['concurrent memorial']             49
['concurrent study request']         4
['proclamation']                     2
['petition']                         1
Name: classification, dtype: int64

In [6]:
#Gets rid of unnecessary brackets and quotations to make the column more readable.
df['classification'] = df['classification'].map(lambda x: x.replace('[','').replace(']','').replace('\'','')) \
.map(lambda x: 'appropriation' if x == 'bill, appropriation' else x)
# ^ change bill, appropriation to appropriation to make it more readable


In [7]:
df['classification'].value_counts()

bill                        124447
resolution                   23983
joint resolution              6842
concurrent resolution         3767
appropriation                  402
memorial                       270
joint memorial                 137
commemoration                   82
concurrent memorial             49
concurrent study request         4
proclamation                     2
petition                         1
Name: classification, dtype: int64

In [8]:
#Do the same for our smaller dataset with the extra column
df_with_sponsors['classification'] = df_with_sponsors['classification'].map(lambda x: x.replace('[','').replace(']','').replace('\'','')) \
.map(lambda x: 'appropriation' if x == 'bill, appropriation' else x)


### Subjects

This column describes the overall subject of the bill. Sadly, many of the bills have empty subjects. The ones that do have subjects are in a list format that is not iterable. Instead, we will just split all the subjects by spaces so it can be used by count vectorizer later on.

In [9]:
#example of what we have
df['subject'][0:5]

0    ['BOARDS & COMMISSIONS', 'DRUGS', 'HEALTH & SO...
1    ['BOATS & BOATING', 'MARINE HIGHWAY', 'TRANSPO...
2    ['BUSINESS', 'EDUCATION', 'EMPLOYMENT', 'LABOR...
3    ['BOARDS & COMMISSIONS', 'COMMUNITY COLLEGES',...
4    ['AIRPORTS', 'APPROPRIATIONS', 'AVIATION', 'BO...
Name: subject, dtype: object

In [10]:
#Gets rid of unnecessary brackets and quotations and combines all words for a given subject together.
#This will be useful for using count vectorizer later.
df['subject'] = df['subject'].map(lambda x: x.lower().replace('[','').replace(']','').replace('\'','').replace(' ','').replace(',',' '))

In [11]:
#Do the same with other dataset
df_with_sponsors['subject'] = df_with_sponsors['subject'].map(lambda x: x.lower().replace('[','').replace(']','').replace('\'','').replace(' ','').replace(',',' '))

In [12]:
df['subject'][0:5]

0    boards&commissions drugs health&socialservices...
1    boats&boating marinehighway transportation ves...
2    business education employment labor salaries&a...
3    boards&commissions communitycolleges education...
4    airports appropriations aviation boards&commis...
Name: subject, dtype: object

### Jurisdiction
Change the column name to state

In [13]:
df.rename(columns = {'jurisdiction': 'state'}, inplace = True)
df_with_sponsors.rename(columns = {'jurisdiction': 'state'}, inplace = True)

### Year

Just change the column name to year instead

In [14]:
df.rename(columns = {'year1': 'year'}, inplace = True)

In [15]:
df_with_sponsors.rename(columns = {'year1': 'year'}, inplace = True)

### Organization Classification

Some bad values in there that need to be replaced with Nulls

We are going to replace the legislature values with nulls as it is uniformative.

In [16]:
df['organization_classification'] = df['organization_classification'].map(lambda x: None if x == 'legislature' else x)
df_with_sponsors['organization_classification'] = df_with_sponsors['organization_classification'].map(lambda x: None if x == 'legislature' else x)

### Statute

This column describes wether a bill is a statute or not. It has a third value as well though...

In [19]:
df['statute'].value_counts(dropna = False)

1    107233
0     41255
3     11498
Name: statute, dtype: int64

A code of 3 claims that the statute is from Nebraska, this is wrong though because none of our data is from Nebraska. After some poking around with the original dataset, it seems like the authors typed in 3 for unobservable and 2 for Nebraska. I am going to replact the threes with Null values.

In [20]:
#Changing all 3s to Nulls
df['statute'] = df['statute'].map(lambda x: None if x == 3 else x)

In [21]:
#Same for other dataset
df_with_sponsors['statute'] = df_with_sponsors['statute'].map(lambda x: None if x == 3 else x)

### Changing some more column names

In [22]:
df.rename(columns = {'id_x': 'id','dum_38_pass': 'pass_1st_chamber', 'dum_68_pass': 'pass_2nd_chamber', 'dum_90_law': 'law_enacted'}, inplace = True)
df_with_sponsors.rename(columns = {'id_x': 'id','dum_38_pass': 'pass_1st_chamber', 'dum_68_pass': 'pass_2nd_chamber', 'dum_90_law': 'law_enacted'}, inplace = True)

In [23]:
df_with_sponsors.columns

Index(['id', 'classification', 'title', 'subject', 'abstract', 'state',
       'organization_classification', 'year', 'bill_pre', 'statute', 'lpcode',
       'pass_1st_chamber', 'pass_2nd_chamber', 'law_enacted', 'senate_dem',
       'senate_rep', 'house_dem', 'house_rep', 'gov_party',
       'majority_sponsor_party'],
      dtype='object')

### Using Statute to filter out unwanted data

Many pieces of legislation do not become laws. For example, simple resolutions only pass one chamber and address matters within the house it was intorduced in. They do not need the approval of the other house or the governor of the state and thus do not become laws. We are not intereste in this kind of legislation and thus will drop them.

The statute column let's us know if this legislation has the potential of becoming a law or not.

In [24]:
df['statute'].value_counts()

1.0    107233
0.0     41255
Name: statute, dtype: int64

In [25]:
df_with_sponsors['statute'].value_counts()

1.0    59980
0.0    16490
Name: statute, dtype: int64

In [26]:
#Getting rid of the non statute peices of legislation.
df = df[df['statute'] == 1]
df_with_sponsors = df_with_sponsors[df_with_sponsors['statute'] == 1]

In [27]:
df_with_sponsors.shape

(59980, 20)

# Saving Cleaned Data

In [124]:
df.to_csv('../Data/Merged_Data/cleaned_data.csv.zip', index = False)
df_with_sponsors.to_csv('../Data/Merged_Data/cleaned_data_with_sponsors.csv.zip', index = False)