# Natural Language Processing - Bill Text Exploration

**This analysis includes combined text of bill titles and summaries**

Transform the raw data into feature vectors and these new features will be created using the existing dataset. Structure as follows:

Data Exploration
- Word Cloud 

Vectorizers
- Custom and Spacy Tokenizer
- Count Vectors as features
- TF-IDF Vectors as features

- Word level
- N-Gram level

Character level
- Word Embeddings as features
- Text / NLP based features
- Topic Models as features

https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

In [1]:
import mysql.connector 
import numpy as np
import pandas as pd
import config_final
import requests

from sodapy import Socrata
import sqlalchemy as db

import config_final as config
from schema import DbSchema


bills_db = DbSchema(config)

# topics_db = bills_db.query('SELECT * from topics')

In [101]:
passed_bills = bills_db.query("""
    SELECT
        cb.BillID,
        cb.Cong,
        cb.NameFull,
        cb.Party,
        cb.Summary,
        cb.Title,
        tp.dominant_topic
    FROM con_bills.current_bills as cb
    JOIN con_bills.topics as tp
    ON cb.BillID = tp.BillID
    WHERE cb.PassH = 1
    AND cb.Cong >=110
    """)
passed_bills.head(20)

Unnamed: 0,BillID,Cong,NameFull,Party,Summary,Title,dominant_topic
0,110-HR-1,110,Bennie Thompson,100.0,Implementing Recommendations of the 9/11 Commi...,To provide for the implementation of the recom...,5
1,110-HR-1003,110,Diane Watson,100.0,This measure has not been amended since it was...,To amend the Foreign Affairs Reform and Restru...,7
2,110-HR-1006,110,Don Young,200.0,Marine Mammal Rescue Assistance Amendments of ...,To amend the provisions of law relating to the...,0
3,110-HR-1011,110,Frederick Boucher,100.0,Virginia Ridge and Valley Act of 2007 - Design...,To designate additional National Forest System...,11
4,110-HR-1014,110,Lois Capps,100.0,"Heart Disease Education, Analysis Research, an...","To amend the Federal Food, Drug, and Cosmetic ...",12
5,110-HR-1019,110,Luis Fortuno,200.0,(This measure has not been amended since it wa...,To designate the United States customhouse bui...,7
6,110-HR-1021,110,Barney Frank,100.0,(This measure has not been amended since it wa...,To direct the Secretary of the Interior to con...,11
7,110-HR-1025,110,Jerry Moran,200.0,(This measure has not been amended since it wa...,To authorize the Secretary of the Interior to ...,11
8,110-HR-1036,110,Don Young,200.0,Requires the Administrator of the General Serv...,To authorize the Administrator of General Serv...,9
9,110-HR-1045,110,Leonard Boswell,100.0,(This measure has not been amended since it wa...,To designate the Federal building located at 2...,7


In [100]:
not_passed_bills = bills_db.query("""
    SELECT
        cb.BillID,
        cb.Cong,
        cb.NameFull,
        cb.Party,
        cb.Summary,
        cb.Title,
        tp.dominant_topic
    FROM con_bills.current_bills as cb
    JOIN con_bills.topics as tp
    ON cb.BillID = tp.BillID
    WHERE cb.PassH = 0
    AND cb.Cong >=110
    """)
not_passed_bills.head()

Unnamed: 0,BillID,Cong,NameFull,Party,Summary,Title,dominant_topic
0,110-HR-10,110,Nancy Pelosi,100.0,,Reserved for Speaker.,10
1,110-HR-100,110,Susan Davis,100.0,Veterans' Equity in Education Act of 2007 - Am...,To amend the Higher Education Act of 1965 to p...,4
2,110-HR-1000,110,Eleanor Norton,100.0,Edward William Brooke III Congressional Gold M...,To award a congressional gold medal to Edward ...,10
3,110-HR-1001,110,John Spratt,100.0,Amends the Caribbean Basin Economic Recovery A...,To amend the Haitian Hemispheric Opportunity t...,5
4,110-HR-1002,110,John Spratt,100.0,Imposes an additional duty rate of 27.5 % ad v...,To authorize appropriate action if the negotia...,1


In [96]:
passed_bills.shape

(4025, 6)

In [97]:
not_passed_bills.shape

(47042, 7)

**Final Cleaning:**

In [3]:
def final_clean(df):
    
    df['Summary'].fillna('None', inplace = True)
    
    df['combined_text'] = df[['Title', 'Summary']].astype(str).apply(' '.join, axis=1)
    
    return df

In [43]:
passed_df = final_clean(passed_bills)

In [44]:
passed_df = final_clean(not_passed_bills)

In [45]:
passed_df.shape

(4025, 7)

In [46]:
not_passed_df.shape

(47042, 7)

**Combine Title and Summary columns:**

# Review of Text

**Split Training and Testing Data**

In [7]:
from sklearn import preprocessing

In [82]:
# from sklearn.model_selection import train_test_split

X = passed_df['Title']


X1 = not_passed_df['Title']


In [83]:

# X_train, X_test, y_train1, y_test1 = train_test_split(X, y, random_state=2)

In [84]:
# passed_df['combined_text']

In [85]:
passed_df.loc[10][6]

"To authorize the Secretary of the Interior to conduct a study to determine the suitability and feasibility of designating the Soldiers' Memorial Military Museum located in St. Louis, Missouri, as a unit of the National Park System. (This measure has not been amended since it was introduced. The expanded summary of the House passed version is repeated here.)"

## Feature Engineering


**Cleaning Text**

Test both the spacy tokenizer and personalized tokenizer against the data.

In [86]:
import spacy
from spacy.lang.en import English
import en_core_web_sm
import string
import re

nlp = English()
stop_words = spacy.lang.en.stop_words.STOP_WORDS

nlp.Defaults.stop_words |= {"bill","amend", "purpose", "united", "state", "states", "secretary", "act", "federal", "provide"}

replace_with_space = re.compile('[/(){}\[\]\|@,;]')

just_words = re.compile('[^a-zA-Z\s]')


In [87]:
# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def tokenizer(text):
    
    #lowercase everything
    lower_text = text.lower()
    
    #remove punctuation
#     no_pun_text = lower_text.translate(str.maketrans('', '', string.punctuation))
    
    #get rid of weird characters
    text = replace_with_space.sub('',lower_text)
    
    #remove numbers
    just_words_text = just_words.sub('', text)
    
    #add spacy tokenizer
    mytokens = nlp(just_words_text, disable=['parser', 'ner'])
#     print(mytokens)
    
    #for POS tagging
#     mytokens = [word for word in mytokens if (word.pos_ == 'NOUN') or (word.pos_ == 'VERB') or (word.pos_ == 'ADJ') or (word.pos_ == 'ADV')]
    
    #lemmatize
    mytokens = [word.lemma_.strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    
    #MAP SPECIFIC WORDS to others (veteran from veterans)

    #add stopwords
    mytokens = [word for word in mytokens if word not in stop_words and word not in punctuations]
    
    return mytokens
    

In [112]:
test_fun = passed_df.iloc[2233][2:6]
test_fun

Party                                                           200
Summary           FHA Emergency Fiscal Solvency Act of 2012 - (S...
Title             To help ensure the fiscal solvency of the FHA ...
dominant_topic                                                    8
Name: 2233, dtype: object

In [111]:
tokenizer(test_fun)


['help',
 'ensure',
 'fiscal',
 'solvency',
 'fha',
 'mortgage',
 'insurance',
 'program',
 'housing',
 'urban',
 'development',
 'fha',
 'emergency',
 'fiscal',
 'solvency',
 'sec',
 'national',
 'housing',
 'nha',
 'direct',
 'housing',
 'urban',
 'development',
 'hud',
 'currently',
 'authorize',
 'establish',
 'collect',
 'additional',
 'annual',
 'premium',
 'payment',
 'year',
 'term',
 'insured',
 'mortgage',
 'remain',
 'insured',
 'principal',
 'balance',
 'certain',
 'adjustment',
 'certain',
 'period',
 'increase',
 'year',
 'annual',
 'premium',
 'insured',
 'mortgage',
 'original',
 'principal',
 'obligation',
 'exceed',
 'remain',
 'principal',
 'balance']

**CountVectorizer**

Every row represents a document in the corpus, every column represents a term in the document, every cell represents the frequency count of a particular term in a particular document.

Tuning: analyzer, vectorizer, max_features, max_df, min_df, n_grams

Explore:

min_df:

- min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".
- min_df = 5 means "ignore terms that appear in less than 5 documents".

max_df: Attempt to remove heavily used words.

- max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
- max_df = 25 means "ignore terms that appear in more than 25 documents".

In [90]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(tokenizer = tokenizer, max_df = 0.90, max_features = 10000) # max_df=0.90, min_df=10
transformed = vectorizer.fit_transform(X)
print(len(vectorizer.get_feature_names()))

5903


In [91]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer1 = CountVectorizer(tokenizer = tokenizer, max_df = 0.90, max_features = 10000) # max_df=0.90, min_df=10
transformed1 = vectorizer1.fit_transform(X1)
print(len(vectorizer1.get_feature_names()))

10000


**Exploring Stored Words**

Think about the number of words and how to decrease that list!

Lemmatization should be something to further consider, also limiting specific words (pronouns?) used often

In [36]:
len(vectorizer.get_feature_names())

10000

In [115]:
vectorizer.get_feature_names

<bound method CountVectorizer.get_feature_names of CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.9, max_features=10000, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=<function tokenizer at 0x1a42c001e0>,
                vocabulary=None)>

In [113]:
# vec = CountVectorizer().fit(corpus)
# bag_of_words = vec.transform(corpus)
def get_top_words(transformed_corpus, n=None):
    sum_words = transformed_corpus.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [114]:
#Top words in PASS
get_top_words(transformed)[:50]

[('service', 761),
 ('certain', 631),
 ('program', 601),
 ('national', 556),
 ('designate', 549),
 ('code', 536),
 ('title', 515),
 ('locate', 495),
 ('facility', 458),
 ('office', 444),
 ('authorize', 427),
 ('security', 421),
 ('postal', 383),
 ('post', 367),
 ('department', 354),
 ('building', 331),
 ('require', 323),
 ('establish', 276),
 ('land', 257),
 ('extend', 253),
 ('health', 247),
 ('improve', 245),
 ('veterans', 244),
 ('year', 243),
 ('public', 236),
 ('fiscal', 209),
 ('street', 208),
 ('relate', 204),
 ('system', 203),
 ('direct', 201),
 ('homeland', 197),
 ('veteran', 191),
 ('development', 188),
 ('appropriation', 172),
 ('water', 170),
 ('law', 168),
 ('agency', 168),
 ('authority', 165),
 ('interior', 164),
 ('reauthorize', 159),
 ('grant', 156),
 ('protection', 150),
 ('assistance', 149),
 ('small', 149),
 ('use', 149),
 ('energy', 146),
 ('new', 146),
 ('fund', 145),
 ('business', 142),
 ('information', 141)]

In [94]:
#Top words in NOT PASS
get_top_words(transformed1)[:50]

[('eatonville', 8891),
 ('definition', 7444),
 ('prohibit', 4993),
 ('santiago', 4881),
 ('matching', 4551),
 ('work', 4439),
 ('lake', 4267),
 ('mobilization', 3302),
 ('reorganization', 2911),
 ('carryover', 2879),
 ('leahysmith', 2714),
 ('currently', 2654),
 ('restructure', 2199),
 ('bush', 2170),
 ('holloway', 2167),
 ('fulbright', 2154),
 ('positive', 2128),
 ('m', 2050),
 ('residual', 2050),
 ('instance', 1978),
 ('turnaround', 1858),
 ('payable', 1766),
 ('andean', 1705),
 ('sovereign', 1702),
 ('troy', 1628),
 ('determine', 1590),
 ('managerial', 1551),
 ('polygraph', 1542),
 ('cosigner', 1534),
 ('clearance', 1533),
 ('ildefonso', 1530),
 ('spoil', 1514),
 ('asbury', 1506),
 ('longitudinal', 1417),
 ('researcher', 1368),
 ('organic', 1318),
 ('royal', 1312),
 ('knowledge', 1303),
 ('revolve', 1286),
 ('bountiful', 1214),
 ('acre', 1202),
 ('fashion', 1171),
 ('en', 1155),
 ('morning', 1136),
 ('tropical', 1122),
 ('millendermcdonald', 1106),
 ('purchase', 1072),
 ('successful

In [95]:
# import random

# #get ten random words from each

# for i in range(10):
#     word_id = random.randint(0, 2454) #second should be len of cv
#     print(vectorizer.get_feature_names()[word_id])

**Lemmas for Both Passed and Not Passed**

In [16]:
! pip install counter



In [24]:
all_lemmas = df.combined_text.apply(tokenizer)


0    [implementation, recommendation, national, com...
1                                   [reserve, speaker]
2    [high, education, prevent, veteran, contributi...
3    [award, congressional, gold, medal, edward, wi...
4    [haitian, hemispheric, opportunity, partnershi...
Name: combined_text, dtype: object

In [25]:
all_lemmas.head()

0    [implementation, recommendation, national, com...
1                                   [reserve, speaker]
2    [high, education, prevent, veteran, contributi...
3    [award, congressional, gold, medal, edward, wi...
4    [haitian, hemispheric, opportunity, partnershi...
Name: combined_text, dtype: object

In [18]:
total_lemmas = []
for doc in all_lemmas:
    total_lemmas.extend(doc)

In [19]:
len(total_lemmas)

1905442

In [20]:
from collections import Counter

all_words = dict(Counter(total_lemmas))

In [21]:
more_words = {key: value for key, value in all_words.items() if value >= 50}

In [23]:
more_words

{'implementation': 913,
 'recommendation': 671,
 'national': 13463,
 'commission': 3347,
 'terrorist': 579,
 'attack': 306,
 'implement': 1814,
 'reserve': 1431,
 'speaker': 175,
 'high': 3022,
 'education': 8243,
 'prevent': 1786,
 'veteran': 5667,
 'contribution': 1493,
 'benefit': 4577,
 'reduce': 3077,
 'student': 2992,
 'financial': 2753,
 'assistance': 6533,
 'equity': 709,
 'respect': 4020,
 'calculation': 140,
 'receive': 2941,
 'eligible': 2793,
 'year': 6629,
 'factor': 318,
 'eligibility': 1232,
 'formula': 328,
 'represent': 160,
 'estimate': 337,
 'educational': 2317,
 'military': 2821,
 'pay': 3476,
 'deduction': 1805,
 'behalf': 412,
 'active': 945,
 'duty': 9307,
 'armed': 2266,
 'force': 3307,
 'award': 2728,
 'congressional': 1657,
 'gold': 626,
 'medal': 797,
 'edward': 92,
 'william': 136,
 'iii': 220,
 'recognition': 675,
 'service': 17856,
 'nation': 639,
 'african': 156,
 'american': 2327,
 'elect': 393,
 'vote': 579,
 'senate': 1104,
 'haitian': 56,
 'opportunit

**Putting it back together**

In [None]:
df_document_topic.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)

In [None]:
test = pd.concat([df_document_topic, df], axis=1)

In [None]:
test.head()

In [None]:
#Looking at dominant topic contribution by proportion
# import pandas as pd
# import seaborn as sns
# import matplotlib.pyplot as plt

# %matplotlib inline

# plt.figure(figsize=(20,5))

# x1, y1, hue = "top_word", "proportion", "Cong"
# hue_order = ["1", "0"]
# data=test

# (test[x1]
#  .groupby(test[hue])
#  .value_counts(normalize=True)
#  .rename(y1)
#  .reset_index()
#  .pipe((sns.barplot, "data"), x=x1, y=y1, hue=hue).set_title('Bill Topic Percentage Proposals by Congress'))


In [None]:
#Dataframe of lemmas
lemma_df = pd.DataFrame.from_dict(more_words, orient='index', columns=['count'])
lemma_df = lemma_df.reset_index()
lemma_df.head()

In [None]:
#review of the top words
for word in sorted(more_words, key=more_words.get, reverse=True):
    print(word, more_words[word])

**t-SNE**

# Adding Words to MySQL:

Create a separate table for topics and link with BillID Primary Key

https://stackoverflow.com/questions/53518217/adding-topic-distribution-outcome-of-topic-model-to-pandas-dataframe

In [None]:
#make sure it is the right shape first!
test.shape

In [None]:
test.head()

In [None]:
test2 = test.drop(columns=['IntrDate', 'PLawNum', 'PLawDate'])


In [None]:
test2.head()

In [None]:
final_topics = test2.to_dict(orient = 'records')
final_topics[0]

In [None]:
# from schema import DbSchema

# bills_db = DbSchema(config_final)


In [None]:
# query = db.insert(bills_db.topics_table)
# bills_db.connection.execute(query, final_topics)