# Natural Language Processing - Bill Text Exploration

**This analysis includes combined text of bill titles and summaries**

Transform the raw data into feature vectors and these new features will be created using the existing dataset. Structure as follows:

Data Exploration
- Word Cloud 

Vectorizers
- Custom and Spacy Tokenizer
- Count Vectors as features
- TF-IDF Vectors as features

- Word level
- N-Gram level

Character level
- Word Embeddings as features
- Text / NLP based features
- Topic Models as features

https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

In [1]:
import mysql.connector 
import numpy as np
import pandas as pd
import config_final
import requests

from sodapy import Socrata
import sqlalchemy as db


In [2]:
def query(q):
    try:
        return pd.read_sql_query(q, engine)
    # Pass when no data is returned    
    except ResourceClosedError:
        pass

def query_list(col, table, distinct = True):
    elts = ['SELECT',
            'DISTINCT' if distinct else '',
            col,
            'FROM',
            table]
    query_str = ' '.join(elts)
    df = query(query_str)
    l = df.iloc[:,0].tolist()
    return l

In [3]:
engine = db.create_engine(f'mysql+mysqlconnector://{config_final.user}:{config_final.password}@{config_final.host}/con_bills')

connection = engine.connect()
metadata=db.MetaData()

In [4]:
df = query('SELECT BillID, Title, Summary, PassH, Cong FROM con_bills.current_bills WHERE Cong >= 110')


In [5]:
df.shape

(51067, 5)

In [6]:
df.tail()

Unnamed: 0,BillID,Title,Summary,PassH,Cong
51062,114-S-995,A bill to establish congressional trade negoti...,Bipartisan Congressional Trade Priorities and ...,0,114
51063,114-S-996,A bill to facilitate nationwide availability o...,Volunteer Income Tax Assistance (VITA) Act,0,114
51064,114-S-997,A bill to extend the authorization for the maj...,"Department of Veterans Affairs Construction, A...",0,114
51065,114-S-998,A bill to establish a process for the consider...,American Manufacturing Competitiveness Act of ...,0,114
51066,114-S-999,A bill to amend the Small Business Act to prov...,Small Business Development Centers Improvement...,0,114


**Final Cleaning:**

In [7]:
df['Summary'].isnull().sum()

50

In [8]:
df['Summary'].fillna('None', inplace = True)

In [9]:
df['Summary'].isnull().sum()

0

In [10]:
df['PassH'].value_counts()

0    47042
1     4025
Name: PassH, dtype: int64

In [11]:
blanks = []

for i, billID, title, summary, PassH, Cong in df.itertuples():  # iterate over the DataFrame
    if type(summary)==str:            # avoid NaN values
        if summary.isspace():         # test 'plot' for whitespace
            blanks.append(i)
                  
len(blanks)

0

**Combine Title and Summary columns:**

In [12]:
df['combined_text'] = df[['Title', 'Summary']].astype(str).apply(' '.join, axis=1)

In [13]:
df.head()

Unnamed: 0,BillID,Title,Summary,PassH,Cong,combined_text
0,110-HR-1,To provide for the implementation of the recom...,Implementing Recommendations of the 9/11 Commi...,1,110,To provide for the implementation of the recom...
1,110-HR-10,Reserved for Speaker.,,0,110,Reserved for Speaker.
2,110-HR-100,To amend the Higher Education Act of 1965 to p...,Veterans' Equity in Education Act of 2007 - Am...,0,110,To amend the Higher Education Act of 1965 to p...
3,110-HR-1000,To award a congressional gold medal to Edward ...,Edward William Brooke III Congressional Gold M...,0,110,To award a congressional gold medal to Edward ...
4,110-HR-1001,To amend the Haitian Hemispheric Opportunity t...,Amends the Caribbean Basin Economic Recovery A...,0,110,To amend the Haitian Hemispheric Opportunity t...


# Topic Modeling

**Split Training and Testing Data**

In [14]:
from sklearn import preprocessing

In [15]:
from sklearn.model_selection import train_test_split

X = df['combined_text']
y = df['PassH']

X_train, X_test, y_train1, y_test1 = train_test_split(X, y)

In [16]:
X_test.head()

21098    A bill to maintain the free flow of informatio...
34819    To provide for an exchange of land between the...
6620     To provide tax relief for the victims of Hurri...
41499    To direct the Secretary of the Interior to est...
13076    To address global hunger and improve food secu...
Name: combined_text, dtype: object

In [17]:
df['PassH'].head()

0    1
1    0
2    0
3    0
4    0
Name: PassH, dtype: int64

Encode our target column so that it can be used in machine learning models (may not be necessary since the data is already binary)

In [18]:
encoder = preprocessing.LabelEncoder()

y_train = encoder.fit_transform(y_train1)
y_test = encoder.fit_transform(y_test1)

In [19]:
y_train

array([0, 0, 0, ..., 0, 1, 0])

## Feature Engineering


**Cleaning Text**

Test both the spacy tokenizer and personalized tokenizer against the data.

In [20]:
import spacy
from spacy.lang.en import English
# For part of speech tagging
import en_core_web_sm

nlp = English()
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [21]:
#Define the spacy tokenizer
spacy_tokenizer = spacy.load('en_core_web_sm', disable =['tagger', 'parser', 'ner'])

In [22]:
import string
import re

replace_with_space = re.compile('[/(){}\[\]\|@,;]')

just_words = re.compile('[^a-zA-Z\s]')


In [23]:
# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_sm')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

def tokenizer(text):
    
    #lowercase everything
    lower_text = text.lower()
    
    #remove punctuation
#     no_pun_text = lower_text.translate(str.maketrans('', '', string.punctuation))
    
    #get rid of weird characters
    text = replace_with_space.sub('',lower_text)
    
    #remove numbers
    just_words_text = just_words.sub('', text)
    
    #add spacy tokenizer
    mytokens = nlp(just_words_text, disable=['parser', 'ner'])
#     print(mytokens)
    
    #for POS tagging
#     mytokens = [word for word in mytokens if (word.pos_ == 'NOUN') or (word.pos_ == 'VERB') or (word.pos_ == 'ADJ') or (word.pos_ == 'ADV')]
    
    #lemmatize
    mytokens = [word.lemma_.strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    #add stopwords
    mytokens = [word for word in mytokens if word not in spacy_stopwords and word not in punctuations]
    
    return mytokens
    

In [24]:
test_fun = df.iloc[3486][5]
test_fun

'To amend title 18 of the United States Code to clarify the scope of the child pornography laws, and for other purposes. Enhancing the Effective Prosecution of Child Pornography Act of 2007 - Amends the federal criminal code to: (1) include child pornography activities and the production of such pornography for importation into the United States as predicate crimes for money laundering prosecutions; and (2) define "possess" with respect to crimes of child sexual exploitation and child pornography to include accessing by computer visual depictions of child pornography with the intent to view.'

In [25]:
tokenizer(test_fun)


['amend',
 'title',
 'united',
 'states',
 'code',
 'clarify',
 'scope',
 'child',
 'pornography',
 'law',
 'purpose',
 'enhance',
 'effective',
 'prosecution',
 'child',
 'pornography',
 'act',
 'amend',
 'federal',
 'criminal',
 'code',
 'include',
 'child',
 'pornography',
 'activity',
 'production',
 'pornography',
 'importation',
 'united',
 'states',
 'predicate',
 'crime',
 'money',
 'laundering',
 'prosecution',
 'define',
 'possess',
 'respect',
 'crime',
 'child',
 'sexual',
 'exploitation',
 'child',
 'pornography',
 'include',
 'accessing',
 'computer',
 'visual',
 'depiction',
 'child',
 'pornography',
 'intent',
 'view']

# TFIDF

Hyper-parameters:
    
- min_df
- max_df
- tokenizer

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(tokenizer = tokenizer, max_df= 0.90)

tf_transformed = tf_vectorizer.fit_transform(X_train)



In [27]:
print(len(tf_vectorizer.get_feature_names()))

22474


In [39]:
import random

#get ten random words from each

for i in range(10):
    word_id = random.randint(0, 22474) #second should be len of cv
    print(tf_vectorizer.get_feature_names()[word_id])

director
hometown
micromanage
fashion
oilsave
sparta
indictment
pinyon
bottletype
authoritys


In [29]:
feature_names=tf_vectorizer.get_feature_names()

In [30]:
feature_names[45:66]

['aboriginal',
 'abort',
 'abortion',
 'abortionrelate',
 'abortionrelated',
 'aboveground',
 'abovereferenced',
 'abovetheline',
 'abraham',
 'abridge',
 'abridgment',
 'abroad',
 'abrogate',
 'abrogation',
 'abrupt',
 'absence',
 'absent',
 'absentee',
 'abshire',
 'absolute',
 'absolutely']

Further Analysis:
https://buhrmann.github.io/tfidf-analysis.html

In [31]:
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

In [41]:
top_tfidf_feats(tf_transformed, feature_names)

Unnamed: 0,feature,tfidf
0,aa,"(0, 8645)\t0.06472262807752036\n (0, 19913)..."


In [33]:
def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

In [44]:
top_feats_in_doc(tf_transformed, feature_names, 2223)

Unnamed: 0,feature,tfidf
0,frescolat,0.607325
1,mga,0.607325
2,suspension,0.262195
3,temporary,0.252684
4,extend,0.203472
5,duty,0.199491
6,harmonized,0.107374
7,tariff,0.10656
8,schedule,0.104385
9,states,0.068132


Let’s see if this topic is represented also in the overall corpus. For this, we will calculate the average tf-idf score of all words across a number of documents (in this case all documents), i.e. the average per column of a tf-idf matrix:

In [35]:
def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

In [36]:
#we provide a list of row indices which pick out the particular documents we want to inspect. 
#Providing ‘None’ indicates, somewhat counterintuitively, that we’re interested in all documents.

top_mean_feats(tf_transformed, feature_names)

Unnamed: 0,feature,tfidf
0,health,0.017477
1,duty,0.015775
2,program,0.013654
3,measure,0.013497
4,security,0.013026
5,revenue,0.012812
6,internal,0.012625
7,service,0.012597
8,tax,0.012392
9,national,0.011645


In [37]:
def top_feats_by_class(Xtr, y, features, min_tfidf=0.1, top_n=25):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    labels = np.unique(y)
    for label in labels:
        ids = np.where(y==label)
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs

In [38]:
df1 = top_feats_by_class(tf_transformed, y, feature_names)[2]
df1

IndexError: index (51066) out of range

In [None]:
all_df = top_feats_by_class(tf_transformed, y, feature_names)

**Save the vectorizer!**

https://stackoverflow.com/questions/29788047/keep-tfidf-result-for-predicting-new-content-using-scikit-for-python

# Adding topics to dataframe:

https://stackoverflow.com/questions/53518217/adding-topic-distribution-outcome-of-topic-model-to-pandas-dataframe