# Example Topic Modeling and Predictions

In [1]:
# import packages
from joblib import load, dump
import pandas as pd
import numpy as np
import os

os.chdir("../xeval-models")

In [2]:
# load in the topic modeler
vectorizer = load("Notebooks/Topic Modeling/models_test/vectorizer_model_13000_lda.joblib")
lda_model = load("Notebooks/Topic Modeling/models_test/lda_model_13000_lda.joblib")

# load in the prediction model
model = load('Models/Predictions/model/all/random_forest_500k.joblib')

# import test data
test_bills_df = pd.read_csv("../Test-Repo/Data/2019-2020_116th_Congress/csv/bills.csv")
test_people_df = pd.read_csv("../Test-Repo/Data/2019-2020_116th_Congress/csv/people.csv")
test_combined_df = pd.read_csv("Data/test.csv", index_col = 0)

## Topic Model
Latent Dirchlet Allocation (LDA)

We train the LDA model on past bills. This clusters bill text into groups based on the distrubtion of words in each bill. For example, if the word 'gun' is in past bills the LDA model will find that group of bills and assign weights on the word 'gun'. If a new bill is run thorugh the topic modeler that contains the word 'gun', the model will weight that bill toward the 'gun' topic.

In [3]:
# show example data fomr test set
test_bills_df = test_bills_df[test_bills_df['description'].str.match('Gun')]
test_bills_df.head()

Unnamed: 0,bill_number,bill_id,session_id,status,status_desc,status_date,title,description,committee_id,committee,last_action_date,last_action,url,state_link
20,1137428,1658,HB33,1,Introduced,2019-01-03,Gun Trafficking Prohibition Act,Gun Trafficking Prohibition Act This bill esta...,2349,"House Subcommittee on Crime, Terrorism, and Ho...",2019-01-03,"Referred to the Subcommittee on Crime, Terrori...",https://legiscan.com/US/bill/HB33/2019,https://www.congress.gov/bill/116th-congress/h...
144,1137495,1658,HB157,1,Introduced,2019-01-03,Gun Manufacturers Accountability Act,Gun Manufacturers Accountability Act This bill...,4143,"House Subcommittee on the Constitution, Civil ...",2019-01-03,Referred to the Subcommittee on the Constituti...,https://legiscan.com/US/bill/HB157/2019,https://www.congress.gov/bill/116th-congress/h...
661,1161096,1658,HB674,1,Introduced,2019-01-17,Gun Violence Prevention Research Act of 2019,Gun Violence Prevention Research Act of 2019 T...,2355,House Subcommittee on Health,2019-01-25,Referred to the Subcommittee on Health.,https://legiscan.com/US/bill/HB674/2019,https://www.congress.gov/bill/116th-congress/h...
807,1177722,1658,HB820,1,Introduced,2019-01-28,Gun Show Loophole Closing Act of 2019,Gun Show Loophole Closing Act of 2019,2349,"House Subcommittee on Crime, Terrorism, and Ho...",2019-03-25,"Referred to the Subcommittee on Crime, Terrori...",https://legiscan.com/US/bill/HB820/2019,https://www.congress.gov/bill/116th-congress/h...
1732,1238941,1658,HB1745,1,Introduced,2019-03-13,Gun Violence Prevention Act of 2019,Gun Violence Prevention Act of 2019,2349,"House Subcommittee on Crime, Terrorism, and Ho...",2019-05-03,"Referred to the Subcommittee on Crime, Terrori...",https://legiscan.com/US/bill/HB1745/2019,https://www.congress.gov/bill/116th-congress/h...


In [4]:
# print out a description from the top bill
row = 0
title = test_bills_df.iloc[row]['title']
description = test_bills_df.iloc[row]['description']

print("TITLE:" + str(title) + "\n")
print("DESCRIPTION:" + str(description) )

TITLE:Gun Trafficking Prohibition Act

DESCRIPTION:Gun Trafficking Prohibition Act This bill establishes stand-alone criminal offenses for trafficking in firearms and straw purchasing of firearms. The bill expands the categories of prohibited persons (i.e., persons barred from receiving or possessing a firearm or ammunition) to include persons who intend (1) to sell or transfer a firearm or ammunition to a prohibited person, (2) to sell or transfer a firearm to further a crime of violence or drug trafficking offense, or (3) to unlawfully export. It increases the maximum prison term for the sale or transfer of a firearm to or the receipt or possession of a firearm by a prohibited person. The bill revises the existing prohibition on transferring a firearm knowing that it will be used to commit a crime of violence or drug trafficking offense. It broadens the scope of unlawful conduct and increases the maximum prison term for a violator. The bill also revises the existing prohibition on sm

In [5]:
# use the description to topic model
data = [description]

# use term frequency bigram to tokenize
vectorized = vectorizer.transform(data)

# run through lda topic modeler
doc_topic_dist_unnormalized = np.matrix(lda_model.transform(vectorized[row]))
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

In [6]:
# grab the heaviest weighted words form the lda model
vocab = vectorizer.get_feature_names()
topic_words = {}
n_top_words = 20

for topic, comp in enumerate(lda_model.components_): 
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    topic_words[topic] = [vocab[i] for i in word_idx]

In [7]:
# finding top k-topics in the distribution
K = 3
a = doc_topic_dist.tolist()[0]
max_ind = sorted(range(len(a)), key=lambda i: a[i], reverse=True)[:K]
max_vals = list(a[x] for x in max_ind)

# printing info
print("TOPIC DIST",dict(zip(max_ind, max_vals)))
print("\nTITLE:", test_bills_df.iloc[row]['title'])
print("\nFIRST TOPIC WORDS", topic_words[max_ind[0]])
print("\nSECOND TOPIC WORDS", topic_words[max_ind[1]])
print("\nTHIRD TOPIC WORDS", topic_words[max_ind[2]])

TOPIC DIST {21: 0.43949700711774975, 24: 0.2779686632169964, 3: 0.1467344415371722}

TITLE: Gun Trafficking Prohibition Act

FIRST TOPIC WORDS ['enforcement', 'law', 'law enforcement', 'alien', 'criminal', 'justice', 'act', 'doj', 'child', 'trafficking', 'state', 'status', 'federal', 'department justice', 'amends', 'removal', 'person', 'offense', 'victims', 'individual']

SECOND TOPIC WORDS ['act', 'education', 'health', 'program', 'programs', 'grants', 'grant', 'services', 'amends', 'training', 'department', 'school', 'public', 'student', 'act amends', 'care', 'students', 'higher', 'state', 'prevention']

THIRD TOPIC WORDS ['united', 'states', 'united states', 'act', 'foreign', 'rights', 'president', 'government', 'state', 'international', 'department', 'sense', 'consumer', 'congress', 'human', 'civil', 'shall', 'person', 'countries', 'country']


## Prediction on Gun Trafficking Prohibition Act 

We trained a Random Forest prediction model that is trained to find patterns with how people have voted in the past to try to predict how they are going to vote in the future

Now that we have the topic distrubution from the topic modeler we can run the bill through our prediction model and see the simulated votes. 

In [8]:
# get copy
df = test_combined_df.copy(deep=True)

# replace the topic of the test set with the gun bill
for i in range(24):
    row = "topic_" + str(i)
    df[row] = doc_topic_dist.tolist()[0][i]

In [9]:
# make the prediction
X = df.drop(columns=['vote','bill_id','people_id']).values
Y = df['vote'].values
vote = model.predict(X)

In [10]:
# print results
df['prediction'] = vote
df = df[['prediction','people_id']]
pd.merge(df, test_people_df, on=['people_id'])[300:310]

Unnamed: 0,prediction,people_id,name,first_name,middle_name,last_name,suffix,nickname,party_id,party,role_id,role,district,followthemoney_eid,votesmart_id,opensecrets_id,ballotpedia
300,2,18292,Lisa Rochester,Lisa,Blunt,Rochester,,,1,D,1,Rep,HD-DE,38420440,173249,N00038414,Lisa_Blunt_Rochester
301,2,18293,Anthony Brown,Anthony,G.,Brown,,,1,D,1,Rep,HD-MD-4,32184404,19344,N00036999,Anthony_Brown_(Maryland)
302,1,18294,Ted Budd,Ted,,Budd,,,2,R,1,Rep,HD-NC-13,40620470,171489,N00039551,Ted_Budd
303,2,18295,Salud Carbajal,Salud,O.,Carbajal,,,1,D,1,Rep,HD-CA-24,39707235,81569,N00037015,Salud_Carbajal
304,1,18296,Liz Cheney,Liz,,Cheney,,,2,R,1,Rep,HD-WY,15475560,171319,N00035504,Liz_Cheney
305,2,18297,Luis Correa,Luis,,Correa,,,1,D,1,Rep,HD-CA-46,6398747,9732,N00037260,Lou_Correa
306,2,18298,Charlie Crist,Charlie,,Crist,,,1,D,1,Rep,HD-FL-13,1736080,24311,N00002942,Charlie_Crist
307,2,18299,Val Demings,Val,Butler,Demings,,,1,D,1,Rep,HD-FL-10,69035,137637,N00033449,Val_Demings
308,1,18300,Neal Dunn,Neal,P.,Dunn,,,2,R,1,Rep,HD-FL-2,117671,166297,N00037442,Neal_Dunn
309,2,18301,Adriano Espaillat,Adriano,,Espaillat,,,1,D,1,Rep,HD-NY-13,6512516,14379,N00034549,Adriano_Espaillat


## Validation

Let's see our our intial approch preformers by testing the accuracy on a past bills that have been voted on. This is a random bill.

In [11]:
df = test_combined_df.copy(deep=True)

In [12]:
X = df.drop(columns=['vote','bill_id','people_id']).values
Y = df['vote'].values
vote = model.predict(X)

In [13]:
# print results
df['prediction'] = vote
df = df[['vote','prediction','people_id']]
pd.merge(df, test_people_df, on=['people_id'])[300:310]

Unnamed: 0,vote,prediction,people_id,name,first_name,middle_name,last_name,suffix,nickname,party_id,party,role_id,role,district,followthemoney_eid,votesmart_id,opensecrets_id,ballotpedia
300,2,2,18292,Lisa Rochester,Lisa,Blunt,Rochester,,,1,D,1,Rep,HD-DE,38420440,173249,N00038414,Lisa_Blunt_Rochester
301,2,2,18293,Anthony Brown,Anthony,G.,Brown,,,1,D,1,Rep,HD-MD-4,32184404,19344,N00036999,Anthony_Brown_(Maryland)
302,1,1,18294,Ted Budd,Ted,,Budd,,,2,R,1,Rep,HD-NC-13,40620470,171489,N00039551,Ted_Budd
303,2,2,18295,Salud Carbajal,Salud,O.,Carbajal,,,1,D,1,Rep,HD-CA-24,39707235,81569,N00037015,Salud_Carbajal
304,1,1,18296,Liz Cheney,Liz,,Cheney,,,2,R,1,Rep,HD-WY,15475560,171319,N00035504,Liz_Cheney
305,2,2,18297,Luis Correa,Luis,,Correa,,,1,D,1,Rep,HD-CA-46,6398747,9732,N00037260,Lou_Correa
306,2,2,18298,Charlie Crist,Charlie,,Crist,,,1,D,1,Rep,HD-FL-13,1736080,24311,N00002942,Charlie_Crist
307,2,2,18299,Val Demings,Val,Butler,Demings,,,1,D,1,Rep,HD-FL-10,69035,137637,N00033449,Val_Demings
308,1,1,18300,Neal Dunn,Neal,P.,Dunn,,,2,R,1,Rep,HD-FL-2,117671,166297,N00037442,Neal_Dunn
309,2,2,18301,Adriano Espaillat,Adriano,,Espaillat,,,1,D,1,Rep,HD-NY-13,6512516,14379,N00034549,Adriano_Espaillat


In [14]:
print("1:YAY 2:NAY 3:NV \n")

print("Actual Vote Count")
print(df.vote.value_counts())

print("\nPredicted Vote Count")
print(df.prediction.value_counts())

1:YAY 2:NAY 3:NV 

Actual Vote Count
1    226
2    183
3     18
Name: vote, dtype: int64

Predicted Vote Count
1    233
2    193
3      1
Name: prediction, dtype: int64
