# Open States Summit - Data Analysis Track

### Introduction
This track is for people who are interested in obtaining data from Open States, understanding it, and performing quantitative analysis on that data. You will receive a tutorial on how to access data via the API or our bulk offerings, and an example of quantitative analysis in Python.


## Public Policy Data
The legislative process at the state and federal level creates significant amounts of data that can be used to understand, predict and affect future legislation. A few examples of the data product of our legislation is:

- Bills: Bill texts, bills metadata (date introduced, proposer)
- People: Congressman state, age, chamber
- Vote: Number of votes, amount of voting processes
- Committees: Committee members, votes, bills, topics
- Sessions: Start, end, bills introduced
- etc.

There are a few ways to access policy data available in Open States. Some of them are:   
- Bulk data
    - Json, CSV, and database dumps
    - Free access
    - Bills and People data
- API v3
   - Create free account at openstates.com
   - Copy and save the Api Key
   - Swagger UI documentation
- `pyopenstates` Python library

### Import libraries

In [None]:
import requests, zipfile, io

### Download data from bulk export

In [None]:
zip_file_url = 'https://data.openstates.org/csv/latest/MN_2021-2022_csv_ANkWj6NP3kGwnwwxJ3gk6.zip'
r = requests.get(zip_file_url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

In [None]:
!head MN/2021-2022/MN_2021-2022_bills.csv

## Data Analysis for Public Policy
Public Policy affects the life of every citizen and business in the country, in small and big ways, at the federal, state or city level. Companies spend millions of dollars trying to influence new legislation. However, since most policy data is public, individuals and smaller groups can use Open Source tools and services to get policy data and help them achieve their goals.

Some interesting policy questions subject to data analysis are:
- Who will support this bill?
- How likely is this bill to pass?
- What’s the legislation regarding this topic?
- What is the relevant information in this bill?

It's important to note that not all policy analysis will be successful, by a huge amount of factors including low frequency events (ex: # of sessions), unavailable data (ex: backdoor bill negotiations), specialized legal jargon, etc. This however should not discourage us from consciously analyzing existing and high quality data.  

Given the amount and types of data generated, we have multiple opportunities to analyze legislative events: data exploration and visualization, statistical tests on voting data, votes prediction, bill classification and others.

While there's some structured policy data, some of the most common and relevant resources (like bills and regulations) come in the form of unstructured text. As so, it's logical step to use NLP methods common in text heavy fields like Legal, Medicine, and others.   

### Why is NLP relevant for Policy Analysis?

Natural language processing studies interactions between computers and humans using natural languages, intersecting the fields of Linguistics and Computer Science. Natural Language Processing has seen significant advances in the last decade, allowing the deployment of automated ML systems to solve tasks like:  

- **Text generation**: Create new text from a given input.
- **Name entity recognition (NER)**: label each word with the entity it represents (person, date, location, etc.).
- **Question answering**: extract an answer from the context, given the context and a question.
- **Summarization**: generate a summary of a long sequence of text or document.
- **Translation**: translate text into another languages.

## Bill Topic Detection

### Objective
A simple way for a person to identify which bills they care about is to select the ones that affect the policy topics they care about. Since not all states provide the related policies for a given bill, it would be useful to create a system to identify bills policy topics.  
For the sake of simplicity, we will only use bill names to find the possible topics.

### Install required libraries

In [None]:
!pip install transformers

### Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans, DBSCAN, MeanShift
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

import torch
from torch import nn
import torch.nn.functional as F

from transformers import pipeline
from ast import literal_eval

### Read data from bulk download

In [None]:
bills_df = pd.read_csv('MN/2021-2022/MN_2021-2022_bills.csv')
bills_df['classification'] = bills_df['classification'].apply(lambda x: literal_eval(x))
bills_df['subject'] = bills_df['subject'].apply(lambda x: literal_eval(x))
bills_df.head()

### Select only bill data

In [None]:
bills_df['classification'] = bills_df['classification'].apply(lambda x: ';'.join(x))
bills_df = bills_df[bills_df['classification'] == 'bill']

### Drop duplicates and extract bill names

In [None]:
bills_df = bills_df.drop_duplicates('id').reset_index(drop=True)
bills_names = bills_df['title'].tolist()
bills_names[:5]

### Load Language Model for vectorization (add text vector image)
Language Models are a specific type of models trained to predict a word in a sentence based on the words in its context. This type of task, matched with large amounts of parameters and data (LLM) has been used to generate high quality text embeddings, which are representations of words and sentences as vectors that can be used for downstream tasks like NER, QA, classification etc.

In [None]:
# distilbert-base-cased , legalbert, xlm-roberta-base, microsoft/deberta-base
encoder = pipeline("feature-extraction", model='distilbert-base-cased', device=0)
embs = encoder(bills_names)
cls_embs = torch.tensor([emb[0][0] for emb in embs])
cls_embs = normalize(cls_embs)
print(cls_embs[:5])

### Reduce vectors dimensionality with TSNE
A non-linear dimensionality reduction techique to help us visualize high dimensional data

In [None]:
reducer = TSNE(n_components=2, perplexity=10)
plot_embs = reducer.fit_transform(cls_embs)
bills_df[['dim1', 'dim2']] = pd.DataFrame(plot_embs)

### Find policy topic clusters
Use KMeans clustering algorithm to try no find groups of bills similar to each other and different to the rest of the bills

In [None]:
# best clusters: distilbert, kmeans(9) xlm, kmeans(5)
identifier = KMeans(5)
bills_df['pred_labels'] = identifier.fit_predict(cls_embs)
bills_df['pred_labels'] = bills_df['pred_labels'].astype(str)

### Find the right number of groups
Iterate the KMeans algorithm over the number of groups

In [None]:
# inertias = []
# cluster_sizes = list(range(2,50))
# for k in cluster_sizes:
#     kmeans = KMeans(n_clusters=k)
#     kmeans.fit(cls_embs)
#     inertias.append(kmeans.inertia_)
# plt.plot(cluster_sizes, inertias)

### Simple plot presenting the groups in the reduced dimensionality

In [None]:
sns.scatterplot(x="dim1", y="dim2", data=bills_df, hue="pred_labels")

### Plots results interactively

In [None]:
fig = px.scatter(
    bills_df, x='dim1', y='dim2', color='pred_labels',
    hover_data=['title'])
fig.show()

### Manually review embeddings
Compare a bill name with the closest bill names available in the vector space using Cosine Similarity

In [None]:
# Support functions
def test_embs(cls_embs, bills_df):    
    embs_sims = cosine_similarity(cls_embs)
    temp_dfs = []
    for i in range(cls_embs.shape[0]):
        bill_name = bills_df.loc[i, 'title']
        bill_subject = bills_df.loc[i, 'subject']
        emb_sims = embs_sims[i]
        max_sims_ids = np.argsort(emb_sims)[-6:-1]
        max_sims = emb_sims[max_sims_ids]
        sim_bills_names = bills_df.loc[max_sims_ids, 'title']
        sim_bills_subject = bills_df.loc[max_sims_ids, 'subject']
        topic_id = bills_df.loc[i, 'pred_labels']
        temp_df = pd.DataFrame({
            'bill_name': [bill_name]*5,
            'sim_bill_name': sim_bills_names,
            'bill_subject': [bill_subject]*5,
            'sim_bill_subject': sim_bills_subject,
            'sim_score': max_sims,
        })
        temp_dfs.append(temp_df)
    vect_eval_df = pd.concat(temp_dfs)
    return vect_eval_df

In [None]:
vect_eval_df = test_embs(cls_embs, bills_df)
vect_eval_df

### Finding information about topics per bill

In [None]:
subjects_count = bills_df['subject'].apply(len)
subjects_count.describe()

### Topic prevalence in the state session

In [None]:
subjects = bills_df['subject'].explode(ignore_index=True)
subjects.value_counts()

### Conclusions and next steps
- Most bills are about more than one topic
- A clustering algorithm may not be the best way to identify multiple topics inside a bill
- We will likely require a dataset with bills and their corresponding topics
    
### Ideas for next steps
- Train classification model with MN labels. Does the knowledge transfers to different jurisdictions?
- Try different clustering algorithms and hyperparameters
- Test different dimensionality reduction algorithms
- Replace LM with a bigger or specialized model (ex: Roberta, Legalbert)
- Whatever you want to try!
