# CS 180 Machine Project Notebook [18th Congress Bills Analysis]

This Python notebook visualizes the TF-IDF analysis and Machine Learning models done to the 18th Congress (House of Representatives) bills. The following procedures were made and are documented in this notebook:
- Term Frequency - Inverse Document Frequency (TF-IDF)
- MODEL 1: Principal Component Analysis (PCA) + Logistic Regression
- MODEL 2: Incremental PCA + Support Vector Machines (SVM)

This notebook was prepared by **GROUP 3 - *Blank Space***:
- James Matthew Borines
- Michael Benjamin Morco
- Kyle Gabriel Reynoso

## Preliminaries

Make sure to have the following libraries installed on your machine before running the cells below to avoid any errors.
- `pandas`
- `nltk`
- `tqdm`
- `sklearn`

All libraries were installed via `pip` using the command: `pip install <library-name>`. If you have `pip` installed in your machine, then you can easily install the following libraries using the command shown above.

**NOTE**: Some of the code cells below have a running time of at least 30 minutes, with some even reaching an hour.

## Outline of this Notebook
[PART 1: Prepare the Dataframe](#part1) <br/>
[PART 2: Load the Stop Words and Extract the Bill Title](#part2) <br/>
[PART 3: Perform TF-IDF](#part3) <br/>
[PART 4: Preparing the Results for Modelling](#part4) <br/>
[PART 5: The Actual Modelling Part](#part5) <br/>
&emsp;[PART 5.1: PCA + Logistic Regression](#part51) <br/>
&emsp;[PART 5.2: Incremental PCA + SVM](#part52) <br/>
&emsp;[PART 5.3: Analysis of Two Models](#part53)

<a id = "part1"></a>
## PART 1: Prepare the Dataframe

For this part, we prepare and load the *18th House of Representatives Bills Dataset*, which was preprocessed. The dataset contains the following features:
- Important Characteristics of a House Bill (e.g. ID, Full Title, Number of Authors, etc.)
- One-Hot Encoding of the status of the *Significance* and *Primary Referral*

In [1]:
import pandas as pd

df = pd.read_csv('18th_hor_bills_dataset_2.csv')

In [36]:
df.shape

(10840, 279)

In [2]:
#load a particular instance of the dataset
df.iloc[1]

ID                                                                                HB00002
Full Title                              AN ACT CREATING THE DEPARTMENT OF OVERSEAS FIL...
Author Count                                                                            3
is_partylist                                                                            0
party_1-pacman                                                                          0
                                                              ...                        
ref_defeat_covid-19_ad-hoc_committee                                                    0
ref_the_whole_house                                                                     0
ref_mindanao_affairs                                                                    0
ref_west_philippine_sea                                                                 0
approved                                                                                1
Name: 1, L

<a id = "part2"></a>
# PART 2: Load the Stop Words and Extract the Bill Title

A list of English stopwords can be loaded using the `nltk` library. We can download the `stopwords` first before loading it to a variable `stops`.

In [3]:
#Run this cell only if 'stopwords' has not been downloaded or if the succeeding cell throws an error
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords
stops = stopwords.words('english')
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Extract the 'Full Title' of the bills first into a separate dataset to prepare it for TF-IDF analysis.

In [5]:
df_full_title = df['Full Title'].copy(deep = True)
df_full_title.head()

0    AN ACT INSTITUTIONALIZING A NATIONAL VALUES, E...
1    AN ACT CREATING THE DEPARTMENT OF OVERSEAS FIL...
2    AN ACT PROVIDING FOR A NATIONAL PROGRAM TO SUP...
3    AN ACT CREATING THE EMERGENCY RESPONSE DEPARTM...
4    AN ACT INSTITUTIONALIZING MICROFINANCE PROGRAM...
Name: Full Title, dtype: object

As we prepare the list of bills for TF-IDF, we can now remove the stop words and punctuations from each of the full titles.

Let us define a function `remove_punctuation` that removes the punctuations and retains the letters from each of the token, and test the function over a set of strings.

In [6]:
def remove_punctuation(token):
    import string
    return token.translate(str.maketrans('', '', string.punctuation))

In [7]:
test_str_1 = 'Shout out to my ex, you are really quite the man; you made my heart break, and that made me who I\'m'
remove_punctuation(test_str_1)

'Shout out to my ex you are really quite the man you made my heart break and that made me who Im'

Let us define a function `remove_stop_words` that does the following:
<ul>
    <li> Split the bill into a list of tokens or words </li>
    <li> If a stop word, as indicated in variable a defined above, is included, remove them from the list of tokens </li>
    <li> Combine the remaining words into a string separated by spaces </li>
</ul>
We can then test the newly created function to some test entries, before finally iterating it over.

In [8]:
def remove_stop_words(bill_string, stopwords):
    lis = bill_string.split() #split the string according to spaces
    to_return = [] #define a new list of words
    
    for i in lis:
        if i.lower() not in stopwords:
            to_return.append(i)
    
    return remove_punctuation(" ".join(to_return))

In [9]:
remove_stop_words(df_full_title.iloc[1], stops)

'ACT CREATING DEPARTMENT OVERSEAS FILIPINO WORKERS OFW FOREIGN EMPLOYMENT DEFINING POWERS FUNCTIONS APPROPRIATING FUNDS THEREFOR RATIONALIZING ORGANIZATION FUNCTIONS GOVERNMENT AGENCIES RELATED MIGRATION PURPOSES'

In [10]:
remove_stop_words(df_full_title.iloc[103], stops)

'ACT PROVIDING COMPREHENSIVE CIVIL REGISTRATION SYSTEM'

Upon running the function initially over all bills, the function threw an error over at entry 9278, which corresponds to 'HB09290'. The full title of this house bill is an empty string. Upon cross-checking the data with the website, it was found out that the website was not able to include the full title of the actual bill. Fortunately, the full title of this bill can be accessed using the link: https://hrep-website.s3.ap-southeast-1.amazonaws.com/legisdocs/basic_18/HB09290.pdf

The name of the bill will then be hardcoded to its corresponding entry, 9278, before running the function over all bills.

In [11]:
df_full_title.iloc[9278] = "AN ACT TO IMPROVE ACCESS TO PRESCHOOL, PRIMARY, AND SECONDARY EDUCATION OF HOMELESS CHILDREN AND YOUTH"
df_full_title.iloc[9278]

'AN ACT TO IMPROVE ACCESS TO PRESCHOOL, PRIMARY, AND SECONDARY EDUCATION OF HOMELESS CHILDREN AND YOUTH'

The function will be applied to all bills, and test the new data by sampling entries.

In [12]:
from tqdm.notebook import tqdm

for i in tqdm(range(len(df_full_title))):
    df_full_title.iloc[i] = remove_stop_words(df_full_title.iloc[i], stops)

  0%|          | 0/10840 [00:00<?, ?it/s]

In [13]:
df_full_title.iloc[10839]

'ACT CONVERTING DARAGAPILAR DIVERSION ROAD MUNICIPALITY DARAGA PROVINCE ALBAY MUNICIPALITY PILAR PROVINCE SORSOGON NATIONAL ROAD APPROPRIATING FUNDS THEREFOR'

In [14]:
df_full_title.iloc[34]

'ACT ESTABLISHING BENHAM RISE RESEARCH DEVELOPMENT INSTITUTE PROVIDING FUNDS THEREFOR PURPOSES'

<a id = "part3"></a>
# PART 3: Perform TF-IDF

For this portion, we will be using the `TfidfVectorizer` function from `sklearn.feature_extraction.text`. This function needs the following input parameters:
- `max_features`: Maximum number of n-grams to be considered as features
- `stop-words`
- `ngram_range`: The range of n-grams, given by the `lower_limit` and `upper_limit`

Only certain n-grams were considered upon performing TF-IDF since the dataframe will have a size of 17.1 GiB when the full vocabulary (~211000 features) is considered for TF-IDF, which cannot be processed by machines with 16 GB of memory. We can customize the `ngram_range` and the `max_features` fields to customize the results of the TF-IDF analysis. We can then manipulate the following variables:
- `lower_limit`: Minimum number of n-grams to be considered
- `upper_limit`: Maximum number of n-grams to be considered
- `no_words`: Number of words to be considered as features for TF-IDF. The top `no_words` words according to frequency will only be considered. If your machine has more than 16 GB of RAM, you may want to consider omitting `no_words` and the `max_features` parameter of `TfidfVectorizer`

In [15]:
#you can tweak lower_limit and upper_limit depending on the number of n-grams that you want to be extracted
lower_limit = 2
upper_limit = 4
no_words = 40000

Perform the actual TF-IDF

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features = no_words, stop_words = 'english', ngram_range=(lower_limit, upper_limit))
tfs = tfidf.fit_transform(df_full_title)

<a id = "part4"></a>
# PART 4: Preparing the Results for Modelling

After performing TF-IDF, we can now prepare the original dataframe and the TF-IDF results dataframe for modelling. The following code cells run the following:
- Convert the variable `tfs`, which contains the actual results of the TF-IDF analysis, into a pandas DataFrame. `tfs.toarray()` is the data of the output in array form while its columns can be obtained using the `get_feature_names` method
- Concatenate the newly created dataframe with `df_full_title` to get an association with the full title, along with the results of the TF-IDF. The new created will now have the following features/columns: [`full_title`, `bag_of_words_0`, `bag_of_words_1`,...]

In [17]:
feature_names = tfidf.get_feature_names_out() #get the bag of words used in TF-IDF

In [18]:
df1 = pd.DataFrame(tfs.toarray(), columns = feature_names)

df1.head()

Unnamed: 0,10 11,10 11 12,10 11 12 presidential,10 12,10 12 amending,10 12 amending purpose,10 50,10 50 beds,10 50 beds appropriating,10 additional,...,ꞌthe revised,ꞌtobacco regulation,ꞌtobacco regulation act,ꞌtobacco regulation act 2003ꞌ,ꞌuniversal health,ꞌuniversal health care,ꞌuniversal health care actꞌ,ꞌurban development,ꞌurban development housing,ꞌurban development housing act
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


`df1` now contains the TF-IDF results of all bills, with each column pertaining to a $n$-gram that was used as a *vocabulary*. We then concatenate this dataframe to the existing dataframe `df`. Runtime for the concatenate command depends on the speed of the machine, but its running time on average is 1.5 minutes.

In [19]:
final = pd.concat([df, df1], axis=1)

The `final` dataframe now contains all the features from the preprocessing aspect and the TF-IDF analysis.

<a id = "part5"></a>
# PART 5: The Actual Modelling Part

The modelling phase can be divided into two parts:
1. Principal Component Analysis (PCA) + Logistic Regression
2. Incremental PCA + Support Vector Machines

Before modelling, the `final` dataframe will be preprocessed first by:
- Shuffling the dataset to avoid any bias that may come along the process
- Splitting our dataset into training and test sets using `train_test_split` from `sklearn.model_selection`
- Scaling our training and test set using `StandardScaler`

In [20]:
final['approved'].value_counts()

0    7254
1    3586
Name: approved, dtype: int64

Based on the value counts of the `final` dataframe, there are:
- 7254 bills that are treated as "Not Approved" (66.92%)
- 3586 bills that are treated as "Approved" (33.08%)

In [21]:
#drop the ID and Full Title fields from the final dataframe
final = final.drop(['ID', 'Full Title'], axis=1)

We then split our dataset into training and test sets, and scale our data using `StandardScaler`.
- We divide our dataset into 80:20 ratio: 80% for the training set and the remaining 20% for the test set
- To ensure that the ratio of Unapproved to Approved bills are kept constant throughout the split, we stratify our split based on the `approved` feature.
- `train_set` and `train_stat` contain the features and the status output of the bills for the training portion of the dataset, respectively.
- `test_set` and `test_stat` contain the features and the status output of the bills for the test portion of the dataset, respectively.

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
small = shuffle(final)
train_set, test_set, train_stat, test_stat = train_test_split(small.drop('approved', axis=1), 
                                                              small['approved'], test_size=1/5,
                                                             stratify = small['approved'])

In [23]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

scaler.fit(train_set)
train_set = scaler.transform(train_set)
test_set = scaler.transform(test_set)

Upon running `value_counts()` for both training and test sets,
- For the training set, the number of Unapproved is 5803 (66.92%) while the number of Approved is 2869 (33.08%)
- For the test set, the number of Unapproved is 1451 (66.92%) while the number of Approved is 717 (33.08%)
Both the training and test sets are consistent in terms of the ratio of *Unapproved* to *Approved* bills.

In [24]:
train_stat.value_counts()

0    5803
1    2869
Name: approved, dtype: int64

In [25]:
test_stat.value_counts()

0    1451
1     717
Name: approved, dtype: int64

<a id = "part51"></a>
## PART 5.1: Principal Component Analysis + Logistic Regression

For this part, given 3000 maximum features and with `ngram_range` of $[2, 4]$, principal component analysis (PCA) from `sklearn.decomposition` was used to reduce the dimensionality of the features, and then run a *Logistic Regression* model.

In [29]:
from sklearn.decomposition import PCA
# We then fit our training set using PCA with 95% variability
pca = PCA(.95)
pca.fit(train_set)

We then transform both our training and testing set using the `transform` method. 

In [30]:
#determine new number of components/features after fitting the dataset
pca.n_components_

1142

In [31]:
#transform both training and test set according to PCA
train_set = pca.transform(train_set)
test_set = pca.transform(test_set)

We import `LogisticRegressionCV` from `sklearn.linear_model` and set up an instance under the variable `model`. We can now feed our training set, represented by `train_set` and `train_stat` to fit into the logistic regression model. However, the code cell block below was commented out since it will take a long time to re-run the model.

In [32]:
# from sklearn.linear_model import LogisticRegressionCV
# model = LogisticRegressionCV(solver='saga', cv=5, max_iter=5000, n_jobs=4)
# model.fit(train_set, train_stat)

Instead of retraining, you can simply load the pickled model.

In [81]:
import pickle
model = pickle.load(open('logregcvmodel.sav', 'rb'))

0.8242619926199262

### Performance Metrics

In [49]:
model.score(test_set,test_stat)

0.8242619926199262

In [50]:
model.scores_

{1: array([[0.78847262, 0.82074928, 0.82074928, 0.81440922, 0.81095101,
         0.80864553, 0.80806916, 0.80806916, 0.80806916, 0.80806916],
        [0.78674352, 0.82536023, 0.82478386, 0.81325648, 0.80634006,
         0.80691643, 0.80691643, 0.80691643, 0.80691643, 0.80691643],
        [0.78373702, 0.81141869, 0.80622837, 0.80046136, 0.79873126,
         0.79930796, 0.79757785, 0.79757785, 0.79757785, 0.79757785],
        [0.79296424, 0.82929642, 0.82525952, 0.80968858, 0.80103806,
         0.79930796, 0.79930796, 0.79930796, 0.79930796, 0.79930796],
        [0.77508651, 0.81314879, 0.81314879, 0.81545559, 0.8160323 ,
         0.81372549, 0.81545559, 0.81430219, 0.81430219, 0.81430219]])}

In [52]:
predictions = model.predict(test_set)

In [53]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(test_stat, predictions)

TN, FP, FN, TP = confusion_matrix(test_stat, predictions).ravel()

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

accuracy =  (TP+TN) /(TP+FP+TN+FN)

print('Accuracy of the binary classification = {:0.3f}'.format(accuracy))

True Positive(TP)  =  445
False Positive(FP) =  93
True Negative(TN)  =  1342
False Negative(FN) =  288
Accuracy of the binary classification = 0.824


In [77]:
test_stat

8667    0
3036    0
4449    0
5664    1
8388    0
       ..
9161    0
6096    0
4860    0
7379    0
7547    0
Name: approved, Length: 2168, dtype: int64

In [78]:
model.predict_proba(test_set)[:,1]

array([0.08129334, 0.20051739, 0.14112781, ..., 0.09307892, 0.73290162,
       0.2420878 ])

In [75]:
from sklearn.metrics import brier_score_loss
brier_score = brier_score_loss(test_stat, model.predict_proba(test_set)[:,1])
brier_score

0.12771165185479205

In [76]:
from sklearn.metrics import log_loss
loss = log_loss(test_stat, model.predict_proba(test_set)[:,1])
loss

0.4052578552509612

### Analyzing PCA Components

In [38]:
small.drop('approved',axis=1).columns

Index(['Author Count', 'is_partylist', 'party_1-pacman', 'party_a teacher',
       'party_aambis-owa', 'party_abang lingkod', 'party_abono',
       'party_act-cis', 'party_act-teachers', 'party_agap',
       ...
       'zamboanga sibugay', 'zone appropriating', 'zone appropriating funds',
       'zone appropriating funds therefor', 'zone authority',
       'zone authority appropriating', 'zone authority appropriating funds',
       'zone freeport', 'zone providing', 'ꞌan act'],
      dtype='object', length=3276)

In [39]:
model.n_features_in_

1142

In [40]:
pca_cols = small.drop('approved', axis=1).columns
strong_rel = pd.DataFrame(pca.components_,columns=small.drop('approved', axis=1).columns)
strong_rel

Unnamed: 0,Author Count,is_partylist,party_1-pacman,party_a teacher,party_aambis-owa,party_abang lingkod,party_abono,party_act-cis,party_act-teachers,party_agap,...,zamboanga sibugay,zone appropriating,zone appropriating funds,zone appropriating funds therefor,zone authority,zone authority appropriating,zone authority appropriating funds,zone freeport,zone providing,ꞌan act
0,0.114401,0.100202,0.028969,0.039771,0.046010,0.039054,0.036009,0.063906,0.037448,0.042447,...,0.000642,-0.001110,-0.001110,-0.001087,0.007771,0.009985,0.009985,0.000799,-0.000615,-0.001051
1,0.000456,-0.000998,0.001240,-0.000237,0.001918,0.002554,-0.000149,-0.003343,-0.002213,0.004736,...,-0.000615,-0.000653,-0.000653,-0.000648,-0.000281,-0.000192,-0.000192,-0.000282,-0.000254,-0.000176
2,0.002296,0.001529,0.000503,0.000885,0.000968,0.000633,0.000773,0.000284,0.000262,0.001499,...,-0.000292,-0.000496,-0.000496,-0.000491,-0.000024,0.000062,0.000062,-0.000186,-0.000212,-0.000249
3,-0.048775,-0.048469,-0.024810,-0.030233,-0.037736,-0.031457,-0.028824,-0.023791,-0.012221,-0.033591,...,0.002181,-0.000946,-0.000946,-0.000939,-0.007816,-0.010028,-0.010028,-0.001912,-0.000208,-0.000181
4,-0.003492,-0.001548,-0.002286,0.003718,-0.003389,-0.003723,-0.003060,-0.009880,-0.002160,-0.003603,...,-0.000572,-0.000718,-0.000718,-0.000716,-0.001237,-0.001557,-0.001557,-0.000521,-0.000179,0.000116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1137,0.000456,-0.000435,0.050618,-0.019117,-0.007898,-0.017610,-0.000049,-0.003585,0.019729,-0.023992,...,0.004021,0.013596,0.013596,0.013338,-0.028250,-0.016412,-0.016412,0.015105,-0.014750,-0.011140
1138,-0.001914,0.002197,-0.053485,-0.003568,-0.025716,0.029120,-0.065895,0.032444,0.011080,0.029750,...,-0.003113,-0.006317,-0.006317,-0.007541,0.019823,-0.000159,-0.000159,-0.007450,0.004226,-0.006240
1139,0.000508,-0.004019,0.031213,-0.025826,-0.061451,0.010958,-0.030905,0.038779,0.017450,-0.033143,...,0.013643,-0.001714,-0.001714,-0.001422,-0.003346,0.003645,0.003645,0.008112,0.006423,-0.004060
1140,-0.000229,0.000121,0.034352,0.046426,-0.022764,-0.008468,-0.011415,0.005593,0.027247,-0.019482,...,0.008886,0.012267,0.012267,0.010022,0.022901,0.002295,0.002295,0.009711,-0.017725,0.000083


In [41]:
import numpy as np
def get_max_column(row):
    if type(row) is float:
        return
    max_i = np.argmax(row.abs())
    return (pca_cols[max_i],row[max_i])

contributions = strong_rel.apply(get_max_column, axis=1)
contributions


0                   (Author Count, 0.11440066447174617)
1                       (act 10923, 0.1372788123611265)
2                    (11 12 public, 0.1465452997657302)
3       (judiciary reorganization, 0.12882047050415404)
4                 (building grant, 0.16686654219698946)
                             ...                       
1137                     (act 2016, 0.1043683004787834)
1138                  (ref_health, -0.1245832064352632)
1139                 (known civil, 0.12879892865476758)
1140     (government procurement, -0.13213510499375436)
1141          (overseas filipinos, -0.1064238353044729)
Length: 1142, dtype: object

In [42]:
out = pd.DataFrame(contributions.tolist(), columns=['factor','coef'])
out

Unnamed: 0,factor,coef
0,Author Count,0.114401
1,act 10923,0.137279
2,11 12 public,0.146545
3,judiciary reorganization,0.128820
4,building grant,0.166867
...,...,...
1137,act 2016,0.104368
1138,ref_health,-0.124583
1139,known civil,0.128799
1140,government procurement,-0.132135


In [43]:
abs_values = strong_rel.abs().sum().sort_values(ascending=False)

In [44]:
abs_values

party_an waray             22.900045
national health            22.880870
act strengthening          22.838235
cebu known                 22.728858
located barangay           22.705563
                             ...    
ni ani kita store           1.625504
ni ani                      1.625504
ani kita store barangay     1.625504
ani kita                    1.625504
Author Count                1.347907
Length: 3276, dtype: float64

In [45]:
abs_values.to_csv('greatest_abs_value.csv', encoding='utf-8')

In [46]:
non_abs_values = strong_rel.sum().sort_values(ascending=False)
non_abs_values

road network               4.013970
ref_visayas_development    3.430597
pandemic purposes          3.143547
ref_west_philippine_sea    3.032071
national policy            2.499424
                             ...   
presently known           -2.050348
act authorizing           -2.057190
citizen service           -2.116261
government agencies       -2.157544
act defining              -2.309154
Length: 3276, dtype: float64

In [47]:
non_abs_values.to_csv('non_abs_values.csv', encoding='utf-8')

In [48]:
out.to_csv('factors_and_coefs.csv', encoding='utf-8')

In [79]:
import pickle
pickle.dump(model, open('logregcvmodel.sav', 'wb'))

<a id = "part52"></a>
## PART 5.2: Incremental PCA + Support Vector Machines

For this part, we will try out Incremental PCA and SVM to predict the status of the 18th Congress bills. The results and performance of the SVM model will then be compared to [PART 5.1](#part51)

In [26]:
from sklearn.decomposition import IncrementalPCA

ipca = IncrementalPCA()
ipca.fit(train_set)

In [27]:
#determine new number of components after applying Incremental PCA
ipca.n_components_

8672

In [28]:
#transform the features training set and test set using transform method
train_set = ipca.transform(train_set)
test_set = ipca.transform(test_set)

After conducting Incremental PCA, we could now do Support Vector Machines using `sklearn.svm.SVC`

**NOTE**: The estimated running time for this model is ~1 hour. This is to be expected since the output of SVM is being converted to probability estimates through [Platt scaling](https://scikit-learn.org/stable/modules/svm.html#scores-probabilities). 

In [29]:
from sklearn import svm

classifier_instance = svm.SVC(kernel = "rbf", probability = True) #create instance of SVM
classifier_instance.fit(train_set, train_stat)

In [30]:
classifier_instance.score(test_set, test_stat)

0.8805350553505535

After fitting the training set into the model, we can now determine the model output for our test set.

In [31]:
proba = classifier_instance.predict_proba(test_set)
svm_predict = [int(x > 0.4) for x in proba[:, 1]]

[0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 

### Performance Measurement

The performance measurement of SVM can be measured through its accuracy, precision, and recall.
- A confusion matrix was generated that unravels the True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) results upon feeding the test set to the model.
- Accuracy is then computed as: $\frac{TP + TN}{TP + FP + TN + FN}$
- Precision is then computed as: $\frac{TP}{TP + FP}$
- Recall is then computed as: $\frac{TP}{TP + FN}$

The performance measurement of SVM can also be measured using *Brier Score* and *Log Loss*

In [32]:
from sklearn.metrics import confusion_matrix

TN, FP, FN, TP = confusion_matrix(test_stat, svm_predict).ravel() #construct confusion matrix

print('True Positive(TP)  = ', TP)
print('False Positive(FP) = ', FP)
print('True Negative(TN)  = ', TN)
print('False Negative(FN) = ', FN)

accuracy =  (TP+TN) /(TP+FP+TN+FN) * 100
precision = TP/(TP + FP) * 100
recall = TP/(TP + FN) * 100

print('Accuracy of the SVM binary classification = {:0.3f}%'.format(accuracy))
print('Precision = {:0.3f}%'.format(precision))
print('Recall = {:0.3f}%'.format(recall))

True Positive(TP)  =  579
False Positive(FP) =  123
True Negative(TN)  =  1328
False Negative(FN) =  138
Accuracy of the SVM binary classification = 87.961%
Precision = 82.479%
Recall = 80.753%


In [34]:
#determine brier score
from sklearn.metrics import brier_score_loss
brier_score = brier_score_loss(test_stat, classifier_instance.predict_proba(test_set)[:,1])
brier_score

0.09422114184656694

In [35]:
#determine log loss score
from sklearn.metrics import log_loss
loss = log_loss(test_stat, proba)
loss

0.33058951411280874

### Analyzing Incremental PCA Components

In [38]:
small.drop('approved',axis=1).columns

Index(['Author Count', 'is_partylist', 'party_1-pacman', 'party_a teacher',
       'party_aambis-owa', 'party_abang lingkod', 'party_abono',
       'party_act-cis', 'party_act-teachers', 'party_agap',
       ...
       'ꞌthe revised', 'ꞌtobacco regulation', 'ꞌtobacco regulation act',
       'ꞌtobacco regulation act 2003ꞌ', 'ꞌuniversal health',
       'ꞌuniversal health care', 'ꞌuniversal health care actꞌ',
       'ꞌurban development', 'ꞌurban development housing',
       'ꞌurban development housing act'],
      dtype='object', length=40276)

In [39]:
classifier_instance.n_features_in_

8672

In [40]:
ipca_cols = small.drop('approved', axis=1).columns
strong_rel = pd.DataFrame(ipca.components_ , columns = small.drop('approved', axis=1).columns)
strong_rel

Unnamed: 0,Author Count,is_partylist,party_1-pacman,party_a teacher,party_aambis-owa,party_abang lingkod,party_abono,party_act-cis,party_act-teachers,party_agap,...,ꞌthe revised,ꞌtobacco regulation,ꞌtobacco regulation act,ꞌtobacco regulation act 2003ꞌ,ꞌuniversal health,ꞌuniversal health care,ꞌuniversal health care actꞌ,ꞌurban development,ꞌurban development housing,ꞌurban development housing act
0,0.004185,0.003649,0.003187,0.004051,0.005035,0.002615,0.000497,0.003020,0.000162,0.000715,...,-0.000047,-0.000034,-0.000034,-0.000034,-0.000047,-0.000047,-0.000047,-0.000082,-0.000082,-0.000082
1,-0.001232,-0.001048,0.000693,-0.000788,0.001836,-0.000727,-0.000328,-0.001124,-0.000438,-0.000385,...,-0.000027,-0.000021,-0.000021,-0.000021,-0.000030,-0.000030,-0.000030,-0.000055,-0.000055,-0.000055
2,-0.002297,-0.002195,-0.000718,-0.000773,-0.001023,-0.000882,-0.000753,-0.001438,-0.000866,-0.000917,...,-0.000003,-0.000010,-0.000010,-0.000010,-0.000023,-0.000023,-0.000023,-0.000091,-0.000091,-0.000091
3,-0.002071,-0.001864,-0.000754,-0.000772,-0.001078,-0.000831,-0.000765,-0.001563,-0.000918,-0.000919,...,-0.000023,-0.000021,-0.000021,-0.000021,-0.000040,-0.000040,-0.000040,-0.000066,-0.000066,-0.000066
4,-0.008511,-0.007613,-0.002241,-0.002870,-0.003636,-0.002990,-0.002752,-0.004987,-0.002840,-0.003449,...,0.000005,-0.000002,-0.000002,-0.000002,-0.000015,-0.000015,-0.000015,-0.000133,-0.000133,-0.000133
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8667,0.000768,0.003182,-0.000295,-0.000089,-0.000174,-0.000143,-0.000093,-0.000359,-0.000199,-0.000100,...,0.002613,0.000422,0.000422,0.000422,0.002900,0.002900,0.002900,0.002188,0.002188,-0.000433
8668,-0.004325,0.007376,-0.000577,-0.000174,-0.000340,-0.000279,-0.000183,-0.000702,-0.000389,-0.000196,...,-0.000840,-0.001208,-0.001208,-0.001208,-0.004090,-0.004090,-0.004090,0.001465,0.001465,0.002383
8669,-0.003505,0.007773,-0.000626,-0.000189,-0.000369,-0.000303,-0.000199,-0.000763,-0.000422,-0.000213,...,0.003945,0.001482,0.001482,0.001482,-0.003126,-0.003126,-0.003126,-0.000973,-0.000973,-0.002545
8670,-0.005394,0.004478,-0.000302,-0.000091,-0.000178,-0.000146,-0.000096,-0.000368,-0.000203,-0.000103,...,-0.001791,0.001748,0.001748,0.001748,-0.004692,-0.004692,-0.004692,0.000314,0.000314,-0.000787


In [42]:
import numpy as np
def get_max_column(row):
    if type(row) is float:
        return
    max_i = np.argmax(row.abs())
    return (ipca_cols[max_i],row[max_i])

contributions = strong_rel.apply(get_max_column, axis=1)
contributions

0                   (204 222 237, 0.07601929338460822)
1                   (180 192 193, 0.08254409874651862)
2                      (act 6734, 0.10214981277679375)
3          (10 11 12 presidential, 0.1016549539082161)
4                 (disability hiv, 0.1094445956144341)
                             ...                      
8667     (act defining penalizing, 0.1568322664708046)
8668         (science technology, 0.11836812789295442)
8669    (act promoting inclusive, 0.12665274000897608)
8670    (accredited bantay dagat, 0.25395096724053934)
8671      (care laws reorganizing, 0.4593307043554901)
Length: 8672, dtype: object

In [43]:
out = pd.DataFrame(contributions.tolist(), columns=['factor','coef'])
out

Unnamed: 0,factor,coef
0,204 222 237,0.076019
1,180 192 193,0.082544
2,act 6734,0.102150
3,10 11 12 presidential,0.101655
4,disability hiv,0.109445
...,...,...
8667,act defining penalizing,0.156832
8668,science technology,0.118368
8669,act promoting inclusive,0.126653
8670,accredited bantay dagat,0.253951


In [45]:
abs_values = strong_rel.abs().sum().sort_values(ascending=False)

abs_values

4th district                     57.161296
province_baguio_city             56.852322
province_apayao                  56.073755
renewable energy                 56.032275
act mandating                    55.970867
                                   ...    
muñoz national high school        0.000000
universitygeneral santos          0.000000
wealth local                      0.000000
wealth local development          0.000000
explosives incendiary devices     0.000000
Length: 40276, dtype: float64

<a id = "part53"></a>
## PART 5.3: Analysis of Two Models

*insert text here*

# PART 6: Determining Factors that Contribute to the Classification

In [None]:
#insert stuff here