# All about classifiers

This takes off from the index [Philippines SONA file](https://github.com/pmagtulis/ph-sona.git). We will be using a CSV file here that can be found in the repository. 

The purpose of this notebook is to dig deeper into the different State of the Nation Addresses of Philippine presidents, this time by training classifiers.

## Do all your imports

In [11]:
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate, train_test_split
import stopwordsiso as stopwords
import altair as alt

## Read CSV

In [2]:
df=pd.read_csv('csv/leaders.csv')
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...
2,Elpidio Quirino,1951-01-22,The State of the Nation,http://www.officialgazette.gov.ph/1951/01/22/e...,"Legislative Building, Manila","Second Congress, Second Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
3,Elpidio Quirino,1952-01-28,The State of the Nation,http://www.officialgazette.gov.ph/1952/01/28/e...,"Legislative Building, Manila","Second Congress, Third Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
4,Elpidio Quirino,1953-01-26,The State of the Nation,http://www.officialgazette.gov.ph/1953/01/26/e...,"Legislative Building, Manila","Second Congress, Fourth Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...


## Parameters

We will be using the same parameters as the original notebook.

In [275]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

y_columns = ['president', 'speeches']
BINARY=True
NGRAM_RANGE=(2,2)
MIN_DF=5
STPWORDS=stopwords.stopwords(["en", "tl"]) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag', 'nga']) #adds more Tagalog stopwords not included in the package 
# TfidfVectorizer
vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Training a classifier

In here, we will be comparing **pre-martial law** and **post-martial law** presidents to test the hypothesis of how different were the contents of their speeches were to each other.

First we begin by cleaning the dataset.

### Convert to datetime

This is crucial since we will be using the dates to create a new column that will serve as our classifier for both **pre-martial law** and **post-martial law** presidents.

In [276]:
df.dtypes

president             object
date          datetime64[ns]
title                 object
link                  object
venue                 object
session               object
speech                object
classifier            object
y                      int64
dtype: object

In [277]:
df.date = pd.to_datetime(df.date)
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech,classifier,y
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...,PE,0
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...,PE,0
2,Elpidio Quirino,1951-01-22,The State of the Nation,http://www.officialgazette.gov.ph/1951/01/22/e...,"Legislative Building, Manila","Second Congress, Second Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...,PE,0
3,Elpidio Quirino,1952-01-28,The State of the Nation,http://www.officialgazette.gov.ph/1952/01/28/e...,"Legislative Building, Manila","Second Congress, Third Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...,PE,0
4,Elpidio Quirino,1953-01-26,The State of the Nation,http://www.officialgazette.gov.ph/1953/01/26/e...,"Legislative Building, Manila","Second Congress, Fourth Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...,PE,0


### Add a binary identifier column

This can either be **pre_ml** or **post_ml** depending on date the speech was delivered.

In [278]:
df['classifier'] = np.where(df['date']>= '1986-01-01', 'PO', 'PE')
df.head(2)

Unnamed: 0,president,date,title,link,venue,session,speech,classifier,y
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...,PE,0
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...,PE,0


## Tokenize, train and test

In [279]:
X = vectorizer.fit_transform(df['speech'])
df['y'] = (df.classifier == 'PO').astype(int)
y = df['y']
df



Unnamed: 0,president,date,title,link,venue,session,speech,classifier,y
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...,PE,0
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...,PE,0
2,Elpidio Quirino,1951-01-22,The State of the Nation,http://www.officialgazette.gov.ph/1951/01/22/e...,"Legislative Building, Manila","Second Congress, Second Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...,PE,0
3,Elpidio Quirino,1952-01-28,The State of the Nation,http://www.officialgazette.gov.ph/1952/01/28/e...,"Legislative Building, Manila","Second Congress, Third Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...,PE,0
4,Elpidio Quirino,1953-01-26,The State of the Nation,http://www.officialgazette.gov.ph/1953/01/26/e...,"Legislative Building, Manila","Second Congress, Fourth Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...,PE,0
5,Ramon Magsaysay,1954-01-25,Address on the State of the Nation,http://www.officialgazette.gov.ph/1954/01/25/r...,"Legislative Building, Manila","Third Congress, First Session",\nAddress\nof\nHis Excellency Ramon Magsaysay\...,PE,0
6,Ramon Magsaysay,1955-01-24,Address on the State of the Nation,http://www.officialgazette.gov.ph/1955/01/24/r...,"Legislative Building, Manila","Third Congress, Second Session",\nAddress\nof\nHis Excellency Ramon Magsaysay\...,PE,0
7,Ramon Magsaysay,1956-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1956/01/23/r...,"Legislative Building, Manila","Third Congress, Third Session",\nAddress\nof\nHis Excellency Ramon Magsaysay\...,PE,0
8,Ramon Magsaysay,1957-01-28,Address on the State of the Nation,http://www.officialgazette.gov.ph/1957/01/28/r...,"Legislative Building, Manila","Third Congress, Fourth Session",\nAddress\nof\nHis Excellency Ramon Magsaysay\...,PE,0
9,Diosdado Macapagal,1962-01-22,Five-Year Integrated Socio-Economic Program fo...,http://www.officialgazette.gov.ph/1962/01/22/d...,"Legislative Building, Manila","Fifth Congress, First Session",\nMessage\nof\nHis Excellency Diosdado Macapag...,PE,0


In [280]:
# Train Classifier
clf = MultinomialNB(alpha=1.0e-10, class_prior=None, fit_prior=True)
clf.fit(X, y)

MultinomialNB(alpha=1e-10)

In [281]:
# # Redo cross validation in a way that allows us to 
# # better understand what is happening
# train_df, test_df = train_test_split(
#      df, test_size=0.2, random_state=3)

# vectorizer.fit(df['speech'])

# X_test = vectorizer.transform(test_df['speech'])
# X_train = vectorizer.transform(train_df['speech'])
# y_test = test_df['y']
# y_train = train_df['y']

# # Train Classifier
# clf = MultinomialNB(alpha=1.0e-10, class_prior=None, fit_prior=True)
# clf.fit(X_train, y_train)

In [282]:
print(clf.classes_)
print(clf.class_count_)
print(clf.class_log_prior_)

# features
print(clf.feature_count_)
print(clf.feature_log_prob_)  # log ( prob(w|martial law) )
# print(clf.n_features_)
# print(clf.n_features_in_)
# print(clf.feature_names_in_)



[0 1]
[13. 36.]
[-1.32687094 -0.30830136]
[[ 0.  0.  3. ...  0.  0.  1.]
 [ 5.  6.  2. ... 18. 17.  4.]]
[[-30.34106932 -30.34106932  -6.2166061  ... -30.34106932 -30.34106932
   -7.31521839]
 [ -6.79256887  -6.61024731  -7.7088596  ...  -5.51163502  -5.56879344
   -7.01571242]]


In [283]:
clf.feature_log_prob_.shape

(2, 762)

In [284]:
word_count = pd.DataFrame(clf.feature_count_, 
                             columns=vectorizer.get_feature_names())

word_log_prob = pd.DataFrame(clf.feature_log_prob_, 
                             columns=vectorizer.get_feature_names())

summary_df = pd.concat([word_count, word_log_prob], axis=0)
summary_df = summary_df.T
summary_df.columns = ['count_0', 'count_1', 'log_prob_0', 'log_prob_1']
summary_df



Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
abu sayyaf,0.0,5.0,-30.341069,-6.792569
access quality,0.0,6.0,-30.341069,-6.610247
active participation,3.0,2.0,-6.216606,-7.708860
address excellency,5.0,16.0,-5.705780,-5.629418
address rodrigo,0.0,6.0,-30.341069,-6.610247
...,...,...,...,...
workers government,0.0,12.0,-30.341069,-5.917100
youtube transcript,0.0,17.0,-30.341069,-5.568793
youtube watch,0.0,18.0,-30.341069,-5.511635
youtube youtube,0.0,17.0,-30.341069,-5.568793


In [285]:
df.y.value_counts()

1    36
0    13
Name: y, dtype: int64

In [287]:
summary_df.sort_values(by='log_prob_1', ascending=False).head(25)

Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
city july,1.0,36.0,-7.315218,-4.818488
quezon city,4.0,36.0,-5.928924,-4.818488
batasang pambansa,0.0,34.0,-30.341069,-4.875646
senate president,3.0,31.0,-6.216606,-4.96802
local government,2.0,27.0,-6.622071,-5.10617
chief justice,1.0,26.0,-7.315218,-5.14391
nation address,1.0,26.0,-7.315218,-5.14391
delivered batasang,0.0,26.0,-30.341069,-5.14391
diplomatic corps,0.0,26.0,-30.341069,-5.14391
armed forces,8.0,25.0,-5.235777,-5.183131


In [288]:
summary_df.sort_values(by='log_prob_0', ascending=False).head(25)

Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
president philippines,13.0,25.0,-4.750269,-5.183131
president speaker,12.0,6.0,-4.830312,-6.610247
economic development,12.0,8.0,-4.830312,-6.322565
irrigation systems,10.0,2.0,-5.012633,-7.70886
congress nation,10.0,4.0,-5.012633,-7.015712
central bank,10.0,6.0,-5.012633,-6.610247
national economic,9.0,3.0,-5.117994,-7.303394
industrial development,9.0,2.0,-5.117994,-7.70886
national economy,9.0,2.0,-5.117994,-7.70886
national security,9.0,7.0,-5.117994,-6.456097


In [169]:
train_df.query('speech.str.contains("natin")')

Unnamed: 0,president,date,title,link,venue,session,speech,classifier,y
36,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",\nCLICK HERE TO WATCH THE VIDEOS\nState of the...,PO,1
23,Fidel V. Ramos,1997-07-28,The Challenges Still Ahead,http://www.officialgazette.gov.ph/1997/07/28/f...,"Batasang Pambansa, Quezon City","Tenth Congress, Third Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,PO,1
37,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,PO,1
25,Joseph Ejercito Estrada,1999-07-26,A Poverty-Free Philippines,http://www.officialgazette.gov.ph/1999/07/26/j...,"Batasang Pambansa, Quezon City","Eleventh Congress, Second Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,PO,1
16,Corazon C. Aquino,1990-07-23,The State of the Nation,http://www.officialgazette.gov.ph/1990/07/23/c...,"Batasang Pambansa, Quezon City","Eighth Congress, Fourth Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,PO,1
30,Gloria Macapagal-Arroyo,2004-07-26,Fourth State of the Nation Address,http://www.officialgazette.gov.ph/2004/07/26/g...,"Batasang Pambansa, Quezon City","Thirteenth Congress, First Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,PO,1
7,Ramon Magsaysay,1956-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1956/01/23/r...,"Legislative Building, Manila","Third Congress, Third Session",\nAddress\nof\nHis Excellency Ramon Magsaysay\...,PE,0
40,Benigno S. Aquino III,2014-07-28,Fifth State of the Nation Address,http://www.officialgazette.gov.ph/2014/07/28/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Second Session",\n[youtube]https://www.youtube.com/watch?v=VJx...,PO,1
27,Gloria Macapagal-Arroyo,2001-07-23,First State of the Nation Address,http://www.officialgazette.gov.ph/2001/07/23/g...,"Batasang Pambansa, Quezon City","Twelfth Congress, First Session",\nFirst State of the Nation AddressBy Gloria M...,PO,1
34,Gloria Macapagal-Arroyo,2008-07-28,Eighth State of the Nation Address,http://www.officialgazette.gov.ph/2008/07/28/g...,"Batasang Pambansa, Quezon City","Fourteenth Congress, Second Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,PO,1


In [233]:
summary_df.to_csv('summary_df.csv')

In [231]:
# # Test Classifier
# # 5-fold cross-validation
# scoring = ['accuracy', 'precision', 'recall', 'f1']
# scores = cross_validate(clf, X, y, scoring=scoring, cv=4)
# display(pd.DataFrame(scores).round(2))

# pd.DataFrame(scores)[
#     ['test_accuracy','test_precision','test_recall','test_f1']]\
#     .mean().round(2)

In [232]:
# pd.DataFrame(np.concatenate((clf.feature_count_, clf.feature_log_prob_), axis=0),
#             index=['pre-ml_count', 'post-ml_count', 'postml_log_proba', 'preml_log_proba'],
#             columns=vectorizer.get_feature_names_out()
#             )\
#     .T.sort_values(by='postml_log_proba', ascending=False)\
#     .head(10)