# Bills TF-IDF

This Python notebook visualizes the TF-IDF analysis done to the bills in the 18th Congress of House of Representatives

## Preliminaries

Make sure that you have the following libraries installed on your machine before running the cells below to avoid any errors.
- `pandas`
- `nltk`
- `tqdm`
- `sklearn`

All libraries were installed via `pip` using the command: `pip install <library-name>`. If you have `pip` installed in your machine, then you can easily install the following libraries using the command shown above.

## PART 1: Upload the Dataframe

Upload the CSV file as a pandas Data Frame and print some entries to verify that the file was uploaded and read correctly.

In [1]:
import pandas as pd

df = pd.read_csv('18th_hor_bills_dataset_2.csv')

In [2]:
df.iloc[1]

ID                                                                                HB00002
Full Title                              AN ACT CREATING THE DEPARTMENT OF OVERSEAS FIL...
Author Count                                                                            3
is_partylist                                                                            0
party_1-pacman                                                                          0
                                                              ...                        
ref_defeat_covid-19_ad-hoc_committee                                                    0
ref_the_whole_house                                                                     0
ref_mindanao_affairs                                                                    0
ref_west_philippine_sea                                                                 0
approved                                                                                1
Name: 1, L

# PART 2: Load the Stop Words and Extract the Bill Title

A list of English stopwords can be loaded using the `nltk` library. We can download the `stopwords` first before loading it to a variable `stops`.

In [3]:
#Run this cell only if 'stopwords' has not been downloaded or if the succeeding cell throws an error
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\james\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords
stops = stopwords.words('english')
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Extract the 'Full Title' of the bills first into a separate dataset to prepare it for TF-IDF analysis.

In [5]:
df_full_title = df['Full Title'].copy(deep = True)
df_full_title.head()

0    AN ACT INSTITUTIONALIZING A NATIONAL VALUES, E...
1    AN ACT CREATING THE DEPARTMENT OF OVERSEAS FIL...
2    AN ACT PROVIDING FOR A NATIONAL PROGRAM TO SUP...
3    AN ACT CREATING THE EMERGENCY RESPONSE DEPARTM...
4    AN ACT INSTITUTIONALIZING MICROFINANCE PROGRAM...
Name: Full Title, dtype: object

As we prepare the list of bills for TF-IDF, we can now remove the stop words and punctuations from each of the full titles.

Let us define a function `remove_punctuation` that removes the punctuations and retains the letters from each of the token, and test the function over a set of strings

In [6]:
def remove_punctuation(token):
    import string
    return token.translate(str.maketrans('', '', string.punctuation))

In [7]:
test_str_1 = 'Shout out to my ex, you are really quite the man; you made my heart break, and that made me who I am'
remove_punctuation(test_str_1)

'Shout out to my ex you are really quite the man you made my heart break and that made me who I am'

Let us define a function `remove_stop_words` that does the following:
<ul>
    <li> Split the bill into a list of tokens or words </li>
    <li> If a stop word, as indicated in variable a defined above, is included, remove them from the list of tokens </li>
    <li> Combine the remaining words into a string separated by spaces </li>
</ul>
We can then test the newly created function to some test entries, before finally iterating it over.

In [8]:
def remove_stop_words(bill_string, stopwords):
    lis = bill_string.split() #split the string according to spaces
    to_return = [] #define a new list of words
    
    for i in lis:
        if i.lower() not in stopwords:
            to_return.append(i)
    
    return remove_punctuation(" ".join(to_return))

In [9]:
remove_stop_words(df_full_title.iloc[1], stops)

'ACT CREATING DEPARTMENT OVERSEAS FILIPINO WORKERS OFW FOREIGN EMPLOYMENT DEFINING POWERS FUNCTIONS APPROPRIATING FUNDS THEREFOR RATIONALIZING ORGANIZATION FUNCTIONS GOVERNMENT AGENCIES RELATED MIGRATION PURPOSES'

In [10]:
remove_stop_words(df_full_title.iloc[103], stops)

'ACT PROVIDING COMPREHENSIVE CIVIL REGISTRATION SYSTEM'

Upon running the function initially over all bills, the function threw an error over at entry 9278, which corresponds to 'HB09290'. The full title of this house bill is an empty string. Upon cross-checking the data with the website, it was found out that the website was not able to include the full title of the actual bill. Fortunately, the full title of this bill can be accessed using the link: https://hrep-website.s3.ap-southeast-1.amazonaws.com/legisdocs/basic_18/HB09290.pdf

The name of the bill will then be hardcoded to its corresponding entry, 9278, before running the function over all bills.

In [11]:
df_full_title.iloc[9278] = "AN ACT TO IMPROVE ACCESS TO PRESCHOOL, PRIMARY, AND SECONDARY EDUCATION OF HOMELESS CHILDREN AND YOUTH"
df_full_title.iloc[9278]

'AN ACT TO IMPROVE ACCESS TO PRESCHOOL, PRIMARY, AND SECONDARY EDUCATION OF HOMELESS CHILDREN AND YOUTH'

The function will be applied to all bills, and test the new data by sampling entries.

In [12]:
from tqdm.notebook import tqdm

for i in tqdm(range(len(df_full_title))):
    df_full_title.iloc[i] = remove_stop_words(df_full_title.iloc[i], stops)

  0%|          | 0/10840 [00:00<?, ?it/s]

In [13]:
df_full_title.iloc[9278]

'ACT IMPROVE ACCESS PRESCHOOL PRIMARY SECONDARY EDUCATION HOMELESS CHILDREN YOUTH'

In [14]:
df_full_title.iloc[34]

'ACT ESTABLISHING BENHAM RISE RESEARCH DEVELOPMENT INSTITUTE PROVIDING FUNDS THEREFOR PURPOSES'

# PART 3: Extract the Bag of Words

A <i>bag of words</i> is then defined as a set of unique words generated from the list of full titles. This bag of words can then be extracted according to a specified number of `n`-grams. 

Let us define a dictionary of `n`-grams that can be initialized over a range. These can be tweaked whenever we want to adjust the range of `n`-grams that we want to extract from the data.

In [15]:
#you can tweak lower_limit and upper_limit depending on the number of n-grams that you want to be extracted
lower_limit = 2
upper_limit = 3

bag_of_words_per_n_gram = {x: [] for x in range(lower_limit, upper_limit + 1)}

Let us define a function `extract_n_grams` that extracts `n`-grams from a particular string, and then appending it to the variable `bag_of_words_per_n_gram`, which is a dictionary of `n`-grams, depending on the set lower limit and upper limit (given by `lower_limit` and `upper_limit` variables, respectively).

In [16]:
from nltk.util import ngrams

def extract_n_grams(title, bag, lower, upper):
    '''a function that extracts n-grams from the lower to the upper limit.
    Parameters of the function include:
    =>title - input string to be extracted
    =>bag - the bag of words
    =>lower - lower limit
    =>upper - upper limit'''
    
    spl = title.split()
    for i in range(lower, upper+1):
        temp = set(ngrams(spl, i))
        bag[i] = list(set(bag[i]).union(temp))

    return

We now iterate the function over all bills

In [17]:
'''
for i in tqdm(range(len(df_full_title))):
    extract_n_grams(df_full_title.iloc[i], bag_of_words_per_n_gram, lower_limit, upper_limit)
'''

'\nfor i in tqdm(range(len(df_full_title))):\n    extract_n_grams(df_full_title.iloc[i], bag_of_words_per_n_gram, lower_limit, upper_limit)\n'

To check how many unique `n`-grams identified in the range, you may run the cell below:

In [18]:
'''for i in range(lower_limit, upper_limit + 1):
    bag_of_words_per_n_gram[i] = list(map(lambda x: " ".join(x), bag_of_words_per_n_gram[i]))
    print("There are {} unique {}-grams identified".format(len(bag_of_words_per_n_gram[i]), i))'''

'for i in range(lower_limit, upper_limit + 1):\n    bag_of_words_per_n_gram[i] = list(map(lambda x: " ".join(x), bag_of_words_per_n_gram[i]))\n    print("There are {} unique {}-grams identified".format(len(bag_of_words_per_n_gram[i]), i))'

# PART 4: Perform TF-IDF

For this portion, we will be using the `TfidfVectorizer` function from `sklearn.feature_extraction.text`. This function needs the following input parameters:
- `vocabulary`: The bag of words extracted in the previous part, given by `bag_of_words_per_n_gram`
- `stop-words`
- `ngram_range`: The range of n-grams, given by the `lower_limit` and `upper_limit`

Before calling `TfidfVectorizer`, we need to combine the dictionary `bag_of_words_per_n_gram` into a single list to feed into the function

In [19]:
'''
vocabs = []

for i in range(lower_limit, upper_limit+1):
    vocabs += list(set(bag_of_words_per_n_gram[i]))

vocabs = [i.lower() for i in vocabs]

#transform it into a set to ensure that there are no duplicates
vocabs = list(set(vocabs))

print(len(vocabs))
'''

'\nvocabs = []\n\nfor i in range(lower_limit, upper_limit+1):\n    vocabs += list(set(bag_of_words_per_n_gram[i]))\n\nvocabs = [i.lower() for i in vocabs]\n\n#transform it into a set to ensure that there are no duplicates\nvocabs = list(set(vocabs))\n\nprint(len(vocabs))\n'

Perform the actual TF-IDF

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=3000, stop_words = 'english', ngram_range=(lower_limit, upper_limit))
tfs = tfidf.fit_transform(df_full_title)

In [21]:
feature_names = tfidf.get_feature_names_out()
corpus_index = [n for n in range(len(df_full_title))]
rows, cols = tfs.nonzero()

# PART 5: Preparing the Results for Modelling

In order to have a better understanding of the TF-IDF analysis done, we can prepare the results for modelling. The following code cells run the following:
- Convert the variable `tfs`, which contains the actual results of the TF-IDF analysis, into a pandas DataFrame. `tfs.toarray()` is the data of the output in array form while its columns can be obtained using the `get_feature_names` method
- Concatenate the newly created dataframe with `df_full_title` to get an association with the full title, along with the results of the TF-IDF. The new created will now have the following features/columns: [`full_title`, `bag_of_words_0`, `bag_of_words_1`,...]

In [22]:
df1 = pd.DataFrame(tfs.toarray(), columns = feature_names)

df1.head()

Unnamed: 0,10 republic,10 republic act,100 beds,1000 beds,10632 republic,10632 republic act,10656 republic,10656 republic act,10868 known,10868 known centenarians,...,zamboanga del norte,zamboanga del sur,zamboanga sibugay,zone appropriating,zone appropriating funds,zone authority,zone authority appropriating,zone freeport,zone providing,ꞌan act
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


1.5 mins runtime

In [23]:
final = pd.concat([df, df1], axis=1)

In [24]:
#do not run this cell yet! We have to find ways to export this without reaching a 1 GB file size
# final.to_csv('18th_hor_bills_final.csv', encoding='utf-8', chunksize=250000)

# PART 6: The Actual Modelling Part

In [25]:
final['approved'].value_counts()

0    7254
1    3586
Name: approved, dtype: int64

In [26]:
final = final.drop(['ID', 'Full Title'], axis=1)

In [27]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
small = shuffle(final)
train_set, test_set, train_stat, test_stat = train_test_split(small.drop('approved', axis=1), small['approved'], test_size=1/7.0, random_state=0)

In [28]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(train_set)
train_set = scaler.transform(train_set)
test_set = scaler.transform(test_set)

In [29]:
from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95)
pca.fit(train_set)

In [30]:
pca.components_

array([[ 1.19360479e-01,  1.06842573e-01,  3.24372534e-02, ...,
         1.04900825e-03, -4.17127604e-04, -1.10793483e-03],
       [ 8.00527382e-03,  8.15722115e-03,  5.85734749e-03, ...,
         1.01797087e-04, -2.58933515e-04, -2.79875973e-04],
       [ 8.10777041e-03,  7.75174665e-03,  7.40371352e-03, ...,
         3.39359745e-04,  8.56638779e-06,  1.91406444e-04],
       ...,
       [-5.45022284e-05,  7.51204586e-03,  2.70691919e-02, ...,
         1.63755131e-02,  4.86929521e-02, -3.56172453e-03],
       [ 8.96821406e-04,  6.28620778e-03,  3.26106406e-02, ...,
         1.16554069e-02, -1.08798068e-02,  7.53664609e-03],
       [-2.20666619e-04,  2.05483491e-03,  6.93360570e-03, ...,
        -7.21227358e-04, -2.99473707e-02,  6.20806454e-04]])

In [31]:
pca.n_components_

1329

In [32]:
train_set = pca.transform(train_set)
test_set = pca.transform(test_set)

In [33]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='saga')
model.fit(train_set, train_stat)



In [34]:
predictions = model.predict(test_set)

In [35]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(test_stat, predictions)

TN, FP, FN, TP = confusion_matrix(test_stat, predictions).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

accuracy =  (TP+TN) /(TP+FP+TN+FN)

print('Accuracy of the binary classification = {:0.3f}'.format(accuracy))

True Positive(TP)  =  360
False Positive(FP) =  109
True Negative(TN)  =  922
False Negative(FN) =  158
Accuracy of the binary classification = 0.828
