# All about classifiers

This takes off from the index [Philippines SONA file](https://github.com/pmagtulis/ph-sona.git). We will be using a CSV file here that can be found in the repository. 

The purpose of this notebook is to dig deeper into the different State of the Nation Addresses of Philippine presidents, this time by training classifiers on two specific presidents' speeches: **Benigno Aquino** and **Rodrigo Duterte** selected because they delivered their SONA in Filipino.

## Do all your imports

In [20]:
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate, train_test_split
import stopwordsiso as stopwords
import altair as alt

## Read CSV

In [21]:
df=pd.read_csv('leaders.csv')
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...
2,Elpidio Quirino,1951-01-22,The State of the Nation,http://www.officialgazette.gov.ph/1951/01/22/e...,"Legislative Building, Manila","Second Congress, Second Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
3,Elpidio Quirino,1952-01-28,The State of the Nation,http://www.officialgazette.gov.ph/1952/01/28/e...,"Legislative Building, Manila","Second Congress, Third Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
4,Elpidio Quirino,1953-01-26,The State of the Nation,http://www.officialgazette.gov.ph/1953/01/26/e...,"Legislative Building, Manila","Second Congress, Fourth Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...


## Clean the data

We only want **Aquino** and **Duterte** speeches for consistency purposes, because they are all in Filipino.

In [22]:
df = df.drop(df.index[0:36]).reset_index(drop=True)
df = df.drop(df.index[12])
df

Unnamed: 0,president,date,title,link,venue,session,speech
0,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",\nCLICK HERE TO WATCH THE VIDEOS\nState of the...
1,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...
2,Benigno S. Aquino III,2012-07-23,Third State of the Nation Address,http://www.officialgazette.gov.ph/2012/07/23/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Third Session",\nVIDEO: [youtube]http://www.youtube.com/watch...
3,Benigno S. Aquino III,2013-07-22,Fourth State of the Nation Address,http://www.officialgazette.gov.ph/2013/07/22/b...,"Batasang Pambansa, Quezon City","Sixteenth Congress, First Session",\nVideo:\n[youtube]http://www.youtube.com/watc...
4,Benigno S. Aquino III,2014-07-28,Fifth State of the Nation Address,http://www.officialgazette.gov.ph/2014/07/28/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Second Session",\n[youtube]https://www.youtube.com/watch?v=VJx...
5,Benigno S. Aquino III,2015-07-27,Sixth State of the Nation Address,http://www.officialgazette.gov.ph/2015/07/27/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Third Session",\nThe 2015 State of the Nation Address. (Photo...
6,Rodrigo Roa Duterte,2016-07-25,State of the Nation Address,https://www.officialgazette.gov.ph/2016/07/25/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, First Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
7,Rodrigo Roa Duterte,2017-07-24,Second State of the Nation Address,https://www.officialgazette.gov.ph/2017/07/24/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Second Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
8,Rodrigo Roa Duterte,2018-07-23,Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
9,Rodrigo Roa Duterte,2019-07-22,Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...


## Parameters

We will be using the same parameters as the original notebook.

In [56]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

y_columns = ['president', 'speeches']
BINARY=True
NGRAM_RANGE=(2,2)
MIN_DF=2
STPWORDS=stopwords.stopwords(["en", "tl"]) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag', 'nga', 'rodrigo', 'roa', 'benigno',
                'complex', 'congress', 'house', 'representatives',
                'session', 'hall', 'executive', 'secretary',
                'senate', 'president', 'vice',
                'leonor', 'robredo', 'excellency', 'salvador',
                'medialdea', 'belmonte', 'feliciano',
                'chief', 'quezon', 'city', 'jejomar', 'binay',
                'batasang', 'pambansa', 'joseph', 'ejercito',
                'fidel', 'valdez', 'ramos', 'iii', 'philippines']) #adds more Tagalog stopwords not included in the package 
# TfidfVectorizer
vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Training a classifier

In here, we will be comparing **pre-martial law** and **post-martial law** presidents to test the hypothesis of how different were the contents of their speeches were to each other.

First we begin by cleaning the dataset.

### Convert to datetime

This is crucial since we will be using the dates to create a new column that will serve as our classifier for both **pre-martial law** and **post-martial law** presidents.

In [57]:
df.dtypes

president             object
date          datetime64[ns]
title                 object
link                  object
venue                 object
session               object
speech                object
classifier            object
y                      int64
dtype: object

In [58]:
df.date = pd.to_datetime(df.date)
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech,classifier,y
0,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",\nCLICK HERE TO WATCH THE VIDEOS\nState of the...,A,0
1,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,A,0
2,Benigno S. Aquino III,2012-07-23,Third State of the Nation Address,http://www.officialgazette.gov.ph/2012/07/23/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Third Session",\nVIDEO: [youtube]http://www.youtube.com/watch...,A,0
3,Benigno S. Aquino III,2013-07-22,Fourth State of the Nation Address,http://www.officialgazette.gov.ph/2013/07/22/b...,"Batasang Pambansa, Quezon City","Sixteenth Congress, First Session",\nVideo:\n[youtube]http://www.youtube.com/watc...,A,0
4,Benigno S. Aquino III,2014-07-28,Fifth State of the Nation Address,http://www.officialgazette.gov.ph/2014/07/28/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Second Session",\n[youtube]https://www.youtube.com/watch?v=VJx...,A,0


### Add a binary identifier column

This can either be **Duterte** or **Aquino** depending on date the speech was delivered. We will use **Duterte** as **1**.

In [59]:
df['classifier'] = np.where(df['date']>= '2016-01-01', 'D', 'A')
df.head(2)

Unnamed: 0,president,date,title,link,venue,session,speech,classifier,y
0,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",\nCLICK HERE TO WATCH THE VIDEOS\nState of the...,A,0
1,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,A,0


## Tokenize, train and test

In [60]:
X = vectorizer.fit_transform(df['speech'])
df['y'] = (df.classifier == 'D').astype(int)
y = df['y']
df



Unnamed: 0,president,date,title,link,venue,session,speech,classifier,y
0,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",\nCLICK HERE TO WATCH THE VIDEOS\nState of the...,A,0
1,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",\nVIDEO:\n[youtube]http://www.youtube.com/watc...,A,0
2,Benigno S. Aquino III,2012-07-23,Third State of the Nation Address,http://www.officialgazette.gov.ph/2012/07/23/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Third Session",\nVIDEO: [youtube]http://www.youtube.com/watch...,A,0
3,Benigno S. Aquino III,2013-07-22,Fourth State of the Nation Address,http://www.officialgazette.gov.ph/2013/07/22/b...,"Batasang Pambansa, Quezon City","Sixteenth Congress, First Session",\nVideo:\n[youtube]http://www.youtube.com/watc...,A,0
4,Benigno S. Aquino III,2014-07-28,Fifth State of the Nation Address,http://www.officialgazette.gov.ph/2014/07/28/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Second Session",\n[youtube]https://www.youtube.com/watch?v=VJx...,A,0
5,Benigno S. Aquino III,2015-07-27,Sixth State of the Nation Address,http://www.officialgazette.gov.ph/2015/07/27/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Third Session",\nThe 2015 State of the Nation Address. (Photo...,A,0
6,Rodrigo Roa Duterte,2016-07-25,State of the Nation Address,https://www.officialgazette.gov.ph/2016/07/25/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, First Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...,D,1
7,Rodrigo Roa Duterte,2017-07-24,Second State of the Nation Address,https://www.officialgazette.gov.ph/2017/07/24/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Second Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...,D,1
8,Rodrigo Roa Duterte,2018-07-23,Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...,D,1
9,Rodrigo Roa Duterte,2019-07-22,Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...,D,1


In [61]:
# Train Classifier
clf = MultinomialNB(alpha=1.0e-10, class_prior=None, fit_prior=True)
clf.fit(X, y)

In [62]:
# # Redo cross validation in a way that allows us to 
# # better understand what is happening
# train_df, test_df = train_test_split(
#      df, test_size=0.2, random_state=3)

# vectorizer.fit(df['speech'])

# X_test = vectorizer.transform(test_df['speech'])
# X_train = vectorizer.transform(train_df['speech'])
# y_test = test_df['y']
# y_train = train_df['y']

# # Train Classifier
# clf = MultinomialNB(alpha=1.0e-10, class_prior=None, fit_prior=True)
# clf.fit(X_train, y_train)

In [63]:
print(clf.classes_)
print(clf.class_count_)
print(clf.class_log_prior_)

# features
print(clf.feature_count_)
print(clf.feature_log_prob_)  # log ( prob(w|martial law) )
# print(clf.n_features_)
# print(clf.n_features_in_)
# print(clf.feature_names_in_)



[0 1]
[6. 6.]
[-0.69314718 -0.69314718]
[[2. 2. 2. ... 3. 1. 1.]
 [0. 0. 0. ... 0. 1. 1.]]
[[ -7.51914996  -7.51914996  -7.51914996 ...  -7.11368485  -8.21229714
   -8.21229714]
 [-30.64016308 -30.64016308 -30.64016308 ... -30.64016308  -7.61431215
   -7.61431215]]


In [64]:
clf.feature_log_prob_.shape

(2, 2294)

In [65]:
word_count = pd.DataFrame(clf.feature_count_, 
                             columns=vectorizer.get_feature_names())

word_log_prob = pd.DataFrame(clf.feature_log_prob_, 
                             columns=vectorizer.get_feature_names())

summary_df = pd.concat([word_count, word_log_prob], axis=0)
summary_df = summary_df.T
summary_df.columns = ['count_0', 'count_1', 'log_prob_0', 'log_prob_1']
summary_df



Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
_________________________ read,2.0,0.0,-7.519150,-30.640163
aabot pesos,2.0,0.0,-7.519150,-30.640163
aabot raw,2.0,0.0,-7.519150,-30.640163
aanhin naman,2.0,0.0,-7.519150,-30.640163
aaral philippine,3.0,0.0,-7.113685,-30.640163
...,...,...,...,...
youtube transcript,3.0,0.0,-7.113685,-30.640163
youtube watch,4.0,0.0,-6.826003,-30.640163
youtube youtube,3.0,0.0,-7.113685,-30.640163
yun lang,1.0,1.0,-8.212297,-7.614312


In [66]:
df.y.value_counts()

0    6
1    6
Name: y, dtype: int64

In [67]:
summary_df.sort_values(by='log_prob_1', ascending=False).head(25)

Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
illegal drugs,0.0,6.0,-31.238148,-5.822553
diplomatic corps,2.0,6.0,-7.51915,-5.822553
supreme court,1.0,6.0,-8.212297,-5.822553
local government,5.0,6.0,-6.602859,-5.822553
metro manila,5.0,6.0,-6.602859,-5.822553
filipino people,1.0,6.0,-8.212297,-5.822553
delivered july,6.0,6.0,-6.420538,-5.822553
address duterte,0.0,6.0,-31.238148,-5.822553
government units,3.0,6.0,-7.113685,-5.822553
duterte delivered,0.0,6.0,-31.238148,-5.822553


In [68]:
summary_df.sort_values(by='log_prob_0', ascending=False).head(25)

Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
lang naman,6.0,3.0,-6.420538,-6.5157
nakaraang taon,6.0,0.0,-6.420538,-30.640163
bilyong piso,6.0,0.0,-6.420538,-30.640163
taon lang,6.0,0.0,-6.420538,-30.640163
pribadong sektor,6.0,0.0,-6.420538,-30.640163
nation address,6.0,6.0,-6.420538,-5.822553
noong nakaraang,6.0,0.0,-6.420538,-30.640163
aquino delivered,6.0,0.0,-6.420538,-30.640163
susunod taon,6.0,1.0,-6.420538,-7.614312
address aquino,6.0,0.0,-6.420538,-30.640163


In [55]:
summary_df.query(summary_df.str.contains("natin"))

AttributeError: 'DataFrame' object has no attribute 'str'

In [18]:
# summary_df.to_csv('summary_df.csv')