# Bills TF-IDF

This Python notebook visualizes the TF-IDF analysis done to the bills in the 18th Congress of House of Representatives

## Preliminaries

Make sure that you have the following libraries installed on your machine before running the cells below to avoid any errors.
- `pandas`
- `nltk`
- `tqdm`
- `sklearn`

All libraries were installed via `pip` using the command: `pip install <library-name>`. If you have `pip` installed in your machine, then you can easily install the following libraries using the command shown above.

## PART 1: Upload the Dataframe

Upload the CSV file as a pandas Data Frame and print some entries to verify that the file was uploaded and read correctly.

In [1]:
import pandas as pd

df = pd.read_csv('18th_hor_bills_dataset_2.csv')

In [2]:
df.iloc[1]

ID                                                                                HB00002
Full Title                              AN ACT CREATING THE DEPARTMENT OF OVERSEAS FIL...
Author Count                                                                            3
is_partylist                                                                            0
party_1-pacman                                                                          0
                                                              ...                        
ref_defeat_covid-19_ad-hoc_committee                                                    0
ref_the_whole_house                                                                     0
ref_mindanao_affairs                                                                    0
ref_west_philippine_sea                                                                 0
approved                                                                                1
Name: 1, L

# PART 2: Load the Stop Words and Extract the Bill Title

A list of English stopwords can be loaded using the `nltk` library. We can download the `stopwords` first before loading it to a variable `stops`.

In [3]:
#Run this cell only if 'stopwords' has not been downloaded or if the succeeding cell throws an error
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\james\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords
stops = stopwords.words('english')
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Extract the 'Full Title' of the bills first into a separate dataset to prepare it for TF-IDF analysis.

In [5]:
df_full_title = df['Full Title'].copy(deep = True)
df_full_title.head()

0    AN ACT INSTITUTIONALIZING A NATIONAL VALUES, E...
1    AN ACT CREATING THE DEPARTMENT OF OVERSEAS FIL...
2    AN ACT PROVIDING FOR A NATIONAL PROGRAM TO SUP...
3    AN ACT CREATING THE EMERGENCY RESPONSE DEPARTM...
4    AN ACT INSTITUTIONALIZING MICROFINANCE PROGRAM...
Name: Full Title, dtype: object

As we prepare the list of bills for TF-IDF, we can now remove the stop words and punctuations from each of the full titles.

Let us define a function `remove_punctuation` that removes the punctuations and retains the letters from each of the token, and test the function over a set of strings

In [6]:
def remove_punctuation(token):
    import string
    return token.translate(str.maketrans('', '', string.punctuation))

In [7]:
test_str_1 = 'Shout out to my ex, you are really quite the man; you made my heart break, and that made me who I\'m'
remove_punctuation(test_str_1)

'Shout out to my ex you are really quite the man you made my heart break and that made me who Im'

Let us define a function `remove_stop_words` that does the following:
<ul>
    <li> Split the bill into a list of tokens or words </li>
    <li> If a stop word, as indicated in variable a defined above, is included, remove them from the list of tokens </li>
    <li> Combine the remaining words into a string separated by spaces </li>
</ul>
We can then test the newly created function to some test entries, before finally iterating it over.

In [8]:
def remove_stop_words(bill_string, stopwords):
    lis = bill_string.split() #split the string according to spaces
    to_return = [] #define a new list of words
    
    for i in lis:
        if i.lower() not in stopwords:
            to_return.append(i)
    
    return remove_punctuation(" ".join(to_return))

In [9]:
remove_stop_words(df_full_title.iloc[1], stops)

'ACT CREATING DEPARTMENT OVERSEAS FILIPINO WORKERS OFW FOREIGN EMPLOYMENT DEFINING POWERS FUNCTIONS APPROPRIATING FUNDS THEREFOR RATIONALIZING ORGANIZATION FUNCTIONS GOVERNMENT AGENCIES RELATED MIGRATION PURPOSES'

In [10]:
remove_stop_words(df_full_title.iloc[103], stops)

'ACT PROVIDING COMPREHENSIVE CIVIL REGISTRATION SYSTEM'

Upon running the function initially over all bills, the function threw an error over at entry 9278, which corresponds to 'HB09290'. The full title of this house bill is an empty string. Upon cross-checking the data with the website, it was found out that the website was not able to include the full title of the actual bill. Fortunately, the full title of this bill can be accessed using the link: https://hrep-website.s3.ap-southeast-1.amazonaws.com/legisdocs/basic_18/HB09290.pdf

The name of the bill will then be hardcoded to its corresponding entry, 9278, before running the function over all bills.

In [11]:
df_full_title.iloc[9278] = "AN ACT TO IMPROVE ACCESS TO PRESCHOOL, PRIMARY, AND SECONDARY EDUCATION OF HOMELESS CHILDREN AND YOUTH"
df_full_title.iloc[9278]

'AN ACT TO IMPROVE ACCESS TO PRESCHOOL, PRIMARY, AND SECONDARY EDUCATION OF HOMELESS CHILDREN AND YOUTH'

The function will be applied to all bills, and test the new data by sampling entries.

In [12]:
from tqdm.notebook import tqdm

for i in tqdm(range(len(df_full_title))):
    df_full_title.iloc[i] = remove_stop_words(df_full_title.iloc[i], stops)

  0%|          | 0/10840 [00:00<?, ?it/s]

In [13]:
df_full_title.iloc[9278]

'ACT IMPROVE ACCESS PRESCHOOL PRIMARY SECONDARY EDUCATION HOMELESS CHILDREN YOUTH'

In [14]:
df_full_title.iloc[34]

'ACT ESTABLISHING BENHAM RISE RESEARCH DEVELOPMENT INSTITUTE PROVIDING FUNDS THEREFOR PURPOSES'

# PART 3: Extract the Bag of Words

A <i>bag of words</i> is then defined as a set of unique words generated from the list of full titles. This bag of words can then be extracted according to a specified number of `n`-grams. 

Let us define a dictionary of `n`-grams that can be initialized over a range. These can be tweaked whenever we want to adjust the range of `n`-grams that we want to extract from the data.

In [15]:
#you can tweak lower_limit and upper_limit depending on the number of n-grams that you want to be extracted
lower_limit = 2
upper_limit = 4

bag_of_words_per_n_gram = {x: [] for x in range(lower_limit, upper_limit + 1)}

Let us define a function `extract_n_grams` that extracts `n`-grams from a particular string, and then appending it to the variable `bag_of_words_per_n_gram`, which is a dictionary of `n`-grams, depending on the set lower limit and upper limit (given by `lower_limit` and `upper_limit` variables, respectively).

In [16]:
from nltk.util import ngrams

def extract_n_grams(title, bag, lower, upper):
    '''a function that extracts n-grams from the lower to the upper limit.
    Parameters of the function include:
    =>title - input string to be extracted
    =>bag - the bag of words
    =>lower - lower limit
    =>upper - upper limit'''
    
    spl = title.split()
    for i in range(lower, upper+1):
        temp = set(ngrams(spl, i))
        bag[i] = list(set(bag[i]).union(temp))

    return

We now iterate the function over all bills

In [17]:
'''
for i in tqdm(range(len(df_full_title))):
    extract_n_grams(df_full_title.iloc[i], bag_of_words_per_n_gram, lower_limit, upper_limit)
'''

'\nfor i in tqdm(range(len(df_full_title))):\n    extract_n_grams(df_full_title.iloc[i], bag_of_words_per_n_gram, lower_limit, upper_limit)\n'

To check how many unique `n`-grams identified in the range, you may run the cell below:

In [18]:
'''for i in range(lower_limit, upper_limit + 1):
    bag_of_words_per_n_gram[i] = list(map(lambda x: " ".join(x), bag_of_words_per_n_gram[i]))
    print("There are {} unique {}-grams identified".format(len(bag_of_words_per_n_gram[i]), i))'''

'for i in range(lower_limit, upper_limit + 1):\n    bag_of_words_per_n_gram[i] = list(map(lambda x: " ".join(x), bag_of_words_per_n_gram[i]))\n    print("There are {} unique {}-grams identified".format(len(bag_of_words_per_n_gram[i]), i))'

# PART 4: Perform TF-IDF

For this portion, we will be using the `TfidfVectorizer` function from `sklearn.feature_extraction.text`. This function needs the following input parameters:
- `vocabulary`: The bag of words extracted in the previous part, given by `bag_of_words_per_n_gram`
- `stop-words`
- `ngram_range`: The range of n-grams, given by the `lower_limit` and `upper_limit`

Before calling `TfidfVectorizer`, we need to combine the dictionary `bag_of_words_per_n_gram` into a single list to feed into the function

In [19]:
'''
vocabs = []

for i in range(lower_limit, upper_limit+1):
    vocabs += list(set(bag_of_words_per_n_gram[i]))

vocabs = [i.lower() for i in vocabs]

#transform it into a set to ensure that there are no duplicates
vocabs = list(set(vocabs))

print(len(vocabs))
'''

'\nvocabs = []\n\nfor i in range(lower_limit, upper_limit+1):\n    vocabs += list(set(bag_of_words_per_n_gram[i]))\n\nvocabs = [i.lower() for i in vocabs]\n\n#transform it into a set to ensure that there are no duplicates\nvocabs = list(set(vocabs))\n\nprint(len(vocabs))\n'

Perform the actual TF-IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000, stop_words = 'english', ngram_range=(lower_limit, upper_limit))
tfs = tfidf.fit_transform(df_full_title)

In [21]:
feature_names = tfidf.get_feature_names_out()
corpus_index = [n for n in range(len(df_full_title))]
rows, cols = tfs.nonzero()

# PART 5: Preparing the Results for Modelling

In order to have a better understanding of the TF-IDF analysis done, we can prepare the results for modelling. The following code cells run the following:
- Convert the variable `tfs`, which contains the actual results of the TF-IDF analysis, into a pandas DataFrame. `tfs.toarray()` is the data of the output in array form while its columns can be obtained using the `get_feature_names` method
- Concatenate the newly created dataframe with `df_full_title` to get an association with the full title, along with the results of the TF-IDF. The new created will now have the following features/columns: [`full_title`, `bag_of_words_0`, `bag_of_words_1`,...]

In [22]:
df1 = pd.DataFrame(tfs.toarray(), columns = feature_names)

df1.head()

Unnamed: 0,10 republic,10 republic act,100 beds,1000 beds,10632 republic,10632 republic act,10632 republic act 10656,10656 republic,10656 republic act,10656 republic act 10923,...,zamboanga sibugay,zone appropriating,zone appropriating funds,zone appropriating funds therefor,zone authority,zone authority appropriating,zone authority appropriating funds,zone freeport,zone providing,ꞌan act
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


1.5 mins runtime

In [23]:
final = pd.concat([df, df1], axis=1)

In [24]:
#do not run this cell yet! We have to find ways to export this without reaching a 1 GB file size
# final.to_csv('18th_hor_bills_final.csv', encoding='utf-8', chunksize=250000)

# PART 6: The Actual Modelling Part

In [25]:
final['approved'].value_counts()

0    7254
1    3586
Name: approved, dtype: int64

In [26]:
final = final.drop(['ID', 'Full Title'], axis=1)

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
small = shuffle(final)
train_set, test_set, train_stat, test_stat = train_test_split(small.drop('approved', axis=1), small['approved'], test_size=1/5)

In [28]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(train_set)
train_set = scaler.transform(train_set)
test_set = scaler.transform(test_set)

In [29]:
from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95)
pca.fit(train_set)

In [30]:
pca.n_components_

1142

In [31]:
train_set = pca.transform(train_set)
test_set = pca.transform(test_set)

In [32]:
'''from sklearn.linear_model import LogisticRegressionCV
model = LogisticRegressionCV(solver='saga', cv=5, max_iter=5000, n_jobs=4)
model.fit(train_set, train_stat)'''

Instead of retraining, you can simply load the pickled model.

In [81]:
import pickle
model = pickle.load(open('logregcvmodel.sav', 'rb'))

0.8242619926199262

In [49]:
model.score(test_set,test_stat)

0.8242619926199262

In [50]:
model.scores_

{1: array([[0.78847262, 0.82074928, 0.82074928, 0.81440922, 0.81095101,
         0.80864553, 0.80806916, 0.80806916, 0.80806916, 0.80806916],
        [0.78674352, 0.82536023, 0.82478386, 0.81325648, 0.80634006,
         0.80691643, 0.80691643, 0.80691643, 0.80691643, 0.80691643],
        [0.78373702, 0.81141869, 0.80622837, 0.80046136, 0.79873126,
         0.79930796, 0.79757785, 0.79757785, 0.79757785, 0.79757785],
        [0.79296424, 0.82929642, 0.82525952, 0.80968858, 0.80103806,
         0.79930796, 0.79930796, 0.79930796, 0.79930796, 0.79930796],
        [0.77508651, 0.81314879, 0.81314879, 0.81545559, 0.8160323 ,
         0.81372549, 0.81545559, 0.81430219, 0.81430219, 0.81430219]])}

In [52]:
predictions = model.predict(test_set)

In [53]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(test_stat, predictions)

TN, FP, FN, TP = confusion_matrix(test_stat, predictions).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

accuracy =  (TP+TN) /(TP+FP+TN+FN)

print('Accuracy of the binary classification = {:0.3f}'.format(accuracy))

True Positive(TP)  =  445
False Positive(FP) =  93
True Negative(TN)  =  1342
False Negative(FN) =  288
Accuracy of the binary classification = 0.824


In [77]:
test_stat

8667    0
3036    0
4449    0
5664    1
8388    0
       ..
9161    0
6096    0
4860    0
7379    0
7547    0
Name: approved, Length: 2168, dtype: int64

In [78]:
model.predict_proba(test_set)[:,1]

array([0.08129334, 0.20051739, 0.14112781, ..., 0.09307892, 0.73290162,
       0.2420878 ])

In [75]:
from sklearn.metrics import brier_score_loss
brier_score = brier_score_loss(test_stat, model.predict_proba(test_set)[:,1])
brier_score

0.12771165185479205

In [76]:
from sklearn.metrics import log_loss
loss = log_loss(test_stat, model.predict_proba(test_set)[:,1])
loss

0.4052578552509612

In [38]:
small.drop('approved',axis=1).columns

Index(['Author Count', 'is_partylist', 'party_1-pacman', 'party_a teacher',
       'party_aambis-owa', 'party_abang lingkod', 'party_abono',
       'party_act-cis', 'party_act-teachers', 'party_agap',
       ...
       'zamboanga sibugay', 'zone appropriating', 'zone appropriating funds',
       'zone appropriating funds therefor', 'zone authority',
       'zone authority appropriating', 'zone authority appropriating funds',
       'zone freeport', 'zone providing', 'ꞌan act'],
      dtype='object', length=3276)

In [39]:
model.n_features_in_

1142

In [40]:
pca_cols = small.drop('approved', axis=1).columns
strong_rel = pd.DataFrame(pca.components_,columns=small.drop('approved', axis=1).columns)
strong_rel

Unnamed: 0,Author Count,is_partylist,party_1-pacman,party_a teacher,party_aambis-owa,party_abang lingkod,party_abono,party_act-cis,party_act-teachers,party_agap,...,zamboanga sibugay,zone appropriating,zone appropriating funds,zone appropriating funds therefor,zone authority,zone authority appropriating,zone authority appropriating funds,zone freeport,zone providing,ꞌan act
0,0.114401,0.100202,0.028969,0.039771,0.046010,0.039054,0.036009,0.063906,0.037448,0.042447,...,0.000642,-0.001110,-0.001110,-0.001087,0.007771,0.009985,0.009985,0.000799,-0.000615,-0.001051
1,0.000456,-0.000998,0.001240,-0.000237,0.001918,0.002554,-0.000149,-0.003343,-0.002213,0.004736,...,-0.000615,-0.000653,-0.000653,-0.000648,-0.000281,-0.000192,-0.000192,-0.000282,-0.000254,-0.000176
2,0.002296,0.001529,0.000503,0.000885,0.000968,0.000633,0.000773,0.000284,0.000262,0.001499,...,-0.000292,-0.000496,-0.000496,-0.000491,-0.000024,0.000062,0.000062,-0.000186,-0.000212,-0.000249
3,-0.048775,-0.048469,-0.024810,-0.030233,-0.037736,-0.031457,-0.028824,-0.023791,-0.012221,-0.033591,...,0.002181,-0.000946,-0.000946,-0.000939,-0.007816,-0.010028,-0.010028,-0.001912,-0.000208,-0.000181
4,-0.003492,-0.001548,-0.002286,0.003718,-0.003389,-0.003723,-0.003060,-0.009880,-0.002160,-0.003603,...,-0.000572,-0.000718,-0.000718,-0.000716,-0.001237,-0.001557,-0.001557,-0.000521,-0.000179,0.000116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1137,0.000456,-0.000435,0.050618,-0.019117,-0.007898,-0.017610,-0.000049,-0.003585,0.019729,-0.023992,...,0.004021,0.013596,0.013596,0.013338,-0.028250,-0.016412,-0.016412,0.015105,-0.014750,-0.011140
1138,-0.001914,0.002197,-0.053485,-0.003568,-0.025716,0.029120,-0.065895,0.032444,0.011080,0.029750,...,-0.003113,-0.006317,-0.006317,-0.007541,0.019823,-0.000159,-0.000159,-0.007450,0.004226,-0.006240
1139,0.000508,-0.004019,0.031213,-0.025826,-0.061451,0.010958,-0.030905,0.038779,0.017450,-0.033143,...,0.013643,-0.001714,-0.001714,-0.001422,-0.003346,0.003645,0.003645,0.008112,0.006423,-0.004060
1140,-0.000229,0.000121,0.034352,0.046426,-0.022764,-0.008468,-0.011415,0.005593,0.027247,-0.019482,...,0.008886,0.012267,0.012267,0.010022,0.022901,0.002295,0.002295,0.009711,-0.017725,0.000083


In [41]:
import numpy as np
def get_max_column(row):
    if type(row) is float:
        return
    max_i = np.argmax(row.abs())
    return (pca_cols[max_i],row[max_i])

contributions = strong_rel.apply(get_max_column, axis=1)
contributions


0                   (Author Count, 0.11440066447174617)
1                       (act 10923, 0.1372788123611265)
2                    (11 12 public, 0.1465452997657302)
3       (judiciary reorganization, 0.12882047050415404)
4                 (building grant, 0.16686654219698946)
                             ...                       
1137                     (act 2016, 0.1043683004787834)
1138                  (ref_health, -0.1245832064352632)
1139                 (known civil, 0.12879892865476758)
1140     (government procurement, -0.13213510499375436)
1141          (overseas filipinos, -0.1064238353044729)
Length: 1142, dtype: object

In [42]:
out = pd.DataFrame(contributions.tolist(), columns=['factor','coef'])
out

Unnamed: 0,factor,coef
0,Author Count,0.114401
1,act 10923,0.137279
2,11 12 public,0.146545
3,judiciary reorganization,0.128820
4,building grant,0.166867
...,...,...
1137,act 2016,0.104368
1138,ref_health,-0.124583
1139,known civil,0.128799
1140,government procurement,-0.132135


In [43]:
abs_values = strong_rel.abs().sum().sort_values(ascending=False)

In [44]:
abs_values

party_an waray             22.900045
national health            22.880870
act strengthening          22.838235
cebu known                 22.728858
located barangay           22.705563
                             ...    
ni ani kita store           1.625504
ni ani                      1.625504
ani kita store barangay     1.625504
ani kita                    1.625504
Author Count                1.347907
Length: 3276, dtype: float64

In [45]:
abs_values.to_csv('greatest_abs_value.csv', encoding='utf-8')

In [46]:
non_abs_values = strong_rel.sum().sort_values(ascending=False)
non_abs_values

road network               4.013970
ref_visayas_development    3.430597
pandemic purposes          3.143547
ref_west_philippine_sea    3.032071
national policy            2.499424
                             ...   
presently known           -2.050348
act authorizing           -2.057190
citizen service           -2.116261
government agencies       -2.157544
act defining              -2.309154
Length: 3276, dtype: float64

In [47]:
non_abs_values.to_csv('non_abs_values.csv', encoding='utf-8')

In [48]:
out.to_csv('factors_and_coefs.csv', encoding='utf-8')

In [79]:
import pickle
pickle.dump(model, open('logregcvmodel.sav', 'wb'))