# All about classifiers

This takes off from the index [Philippines SONA file](https://github.com/pmagtulis/ph-sona.git). We will be using a CSV file here that can be found in the repository. 

The purpose of this notebook is to dig deeper into the different State of the Nation Addresses of Philippine presidents, this time by training classifiers on two specific presidents' speeches: **Benigno Aquino** and **Rodrigo Duterte** selected because they delivered their SONA in Filipino.

## Do all your imports

In [6]:
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_validate, train_test_split
import stopwordsiso as stopwords
import altair as alt

## Read CSV

In [7]:
df=pd.read_csv('csv/merged.csv')
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech
0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, gentlemen of the National Assemb..."
1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, Gentlemen of the National Assemb..."
2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session","Mr. Speaker, Gentlemen of the National Assemb..."
3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",Gentlemen of the National Assembly: The state...
4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",Gentlemen of the National Assembly: I take pl...


## Clean the data

We only want **Aquino** and **Duterte** speeches for consistency purposes, because they are all in Filipino.

In [8]:
df = df.drop(df.index[0:71]).reset_index()
df = df.drop(df.index[12])
df

Unnamed: 0,index,president,date,title,link,venue,session,speech
0,71,Benigno S. Aquino III,"July 26, 2010",State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",Maraming salamat po. Maupo po tayong lahat. S...
1,72,Benigno S. Aquino III,"July 25, 2011",Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",Senate President Juan Ponce Enrile; Speaker F...
2,73,Benigno S. Aquino III,"July 23, 2012",Third State of the Nation Address,http://www.officialgazette.gov.ph/2012/07/23/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Third Session",Maraming salamat po. Maupo ho tayong lahat. S...
3,74,Benigno S. Aquino III,"July 22, 2013",Fourth State of the Nation Address,http://www.officialgazette.gov.ph/2013/07/22/b...,"Batasang Pambansa, Quezon City","Sixteenth Congress, First Session",Marami pong salamat. Maupo ho tayong lahat. B...
4,75,Benigno S. Aquino III,"July 28, 2014",Fifth State of the Nation Address,http://www.officialgazette.gov.ph/2014/07/28/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Second Session",Bise Presidente Jejomar Binay; dating Pangulo...
5,76,Benigno S. Aquino III,"July 27, 2015",Sixth State of the Nation Address,http://www.officialgazette.gov.ph/2015/07/27/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Third Session",Maraming salamat po. Maupo ho tayo lahat. Bag...
6,77,Rodrigo Roa Duterte,"July 25, 2016",State of the Nation Address,https://www.officialgazette.gov.ph/2016/07/25/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, First Session",Thank you. Please allow me a little bit of ...
7,78,Rodrigo Roa Duterte,"July 24, 2017",Second State of the Nation Address,https://www.officialgazette.gov.ph/2017/07/24/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Second Session",Kindly sit down. Thank you for your courtes...
8,79,Rodrigo Roa Duterte,"July 23, 2018",Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",Kindly sit down. Thank you for your courtesy....
9,80,Rodrigo Roa Duterte,"July 22, 2019",Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",Thank you. Kindly sit down. Kumusta po kayo...


## Parameters

We will be using the same parameters as the original notebook.

In [9]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

y_columns = ['president', 'speeches']
BINARY=True
NGRAM_RANGE=(2,2)
MIN_DF=5
STPWORDS=stopwords.stopwords(["en", "tl"]) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag', 'nga', 'rodrigo', 'roa', 'benigno',
                'complex', 'congress', 'house', 'representatives',
                'session', 'hall', 'executive secretary',
                'senate president', 'vice president',
                'leonor', 'robredo', 'excellency',
                'medialdea', 'belmonte', 'feliciano',
                'chief justice', 'quezon city']) #adds more Tagalog stopwords not included in the package 
# TfidfVectorizer
vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Training a classifier

In here, we will be comparing **pre-martial law** and **post-martial law** presidents to test the hypothesis of how different were the contents of their speeches were to each other.

First we begin by cleaning the dataset.

### Convert to datetime

This is crucial since we will be using the dates to create a new column that will serve as our classifier for both **pre-martial law** and **post-martial law** presidents.

In [10]:
df.dtypes

index         int64
president    object
date         object
title        object
link         object
venue        object
session      object
speech       object
dtype: object

In [11]:
df.date = pd.to_datetime(df.date)
df.head()

Unnamed: 0,index,president,date,title,link,venue,session,speech
0,71,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",Maraming salamat po. Maupo po tayong lahat. S...
1,72,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",Senate President Juan Ponce Enrile; Speaker F...
2,73,Benigno S. Aquino III,2012-07-23,Third State of the Nation Address,http://www.officialgazette.gov.ph/2012/07/23/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Third Session",Maraming salamat po. Maupo ho tayong lahat. S...
3,74,Benigno S. Aquino III,2013-07-22,Fourth State of the Nation Address,http://www.officialgazette.gov.ph/2013/07/22/b...,"Batasang Pambansa, Quezon City","Sixteenth Congress, First Session",Marami pong salamat. Maupo ho tayong lahat. B...
4,75,Benigno S. Aquino III,2014-07-28,Fifth State of the Nation Address,http://www.officialgazette.gov.ph/2014/07/28/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Second Session",Bise Presidente Jejomar Binay; dating Pangulo...


### Add a binary identifier column

This can either be **Duterte** or **Aquino** depending on date the speech was delivered. We will use **Duterte** as **1**.

In [12]:
df['classifier'] = np.where(df['date']>= '2016-01-01', 'D', 'A')
df.head(2)

Unnamed: 0,index,president,date,title,link,venue,session,speech,classifier
0,71,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",Maraming salamat po. Maupo po tayong lahat. S...,A
1,72,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",Senate President Juan Ponce Enrile; Speaker F...,A


## Tokenize, train and test

In [13]:
X = vectorizer.fit_transform(df['speech'])
df['y'] = (df.classifier == 'D').astype(int)
y = df['y']
df



Unnamed: 0,index,president,date,title,link,venue,session,speech,classifier,y
0,71,Benigno S. Aquino III,2010-07-26,State of the Nation Address,http://www.officialgazette.gov.ph/2010/07/26/s...,"Batasang Pambansa, Quezon City","Fifteenth Congress, First Session",Maraming salamat po. Maupo po tayong lahat. S...,A,0
1,72,Benigno S. Aquino III,2011-07-25,Second State of the Nation Address,http://www.officialgazette.gov.ph/2011/07/25/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Second Session",Senate President Juan Ponce Enrile; Speaker F...,A,0
2,73,Benigno S. Aquino III,2012-07-23,Third State of the Nation Address,http://www.officialgazette.gov.ph/2012/07/23/b...,"Batasang Pambansa, Quezon City","Fifteenth Congress, Third Session",Maraming salamat po. Maupo ho tayong lahat. S...,A,0
3,74,Benigno S. Aquino III,2013-07-22,Fourth State of the Nation Address,http://www.officialgazette.gov.ph/2013/07/22/b...,"Batasang Pambansa, Quezon City","Sixteenth Congress, First Session",Marami pong salamat. Maupo ho tayong lahat. B...,A,0
4,75,Benigno S. Aquino III,2014-07-28,Fifth State of the Nation Address,http://www.officialgazette.gov.ph/2014/07/28/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Second Session",Bise Presidente Jejomar Binay; dating Pangulo...,A,0
5,76,Benigno S. Aquino III,2015-07-27,Sixth State of the Nation Address,http://www.officialgazette.gov.ph/2015/07/27/p...,"Batasang Pambansa, Quezon City","Sixteenth Congress, Third Session",Maraming salamat po. Maupo ho tayo lahat. Bag...,A,0
6,77,Rodrigo Roa Duterte,2016-07-25,State of the Nation Address,https://www.officialgazette.gov.ph/2016/07/25/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, First Session",Thank you. Please allow me a little bit of ...,D,1
7,78,Rodrigo Roa Duterte,2017-07-24,Second State of the Nation Address,https://www.officialgazette.gov.ph/2017/07/24/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Second Session",Kindly sit down. Thank you for your courtes...,D,1
8,79,Rodrigo Roa Duterte,2018-07-23,Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",Kindly sit down. Thank you for your courtesy....,D,1
9,80,Rodrigo Roa Duterte,2019-07-22,Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",Thank you. Kindly sit down. Kumusta po kayo...,D,1


In [14]:
# Train Classifier
clf = MultinomialNB(alpha=1.0e-10, class_prior=None, fit_prior=True)
clf.fit(X, y)

In [15]:
# # Redo cross validation in a way that allows us to 
# # better understand what is happening
# train_df, test_df = train_test_split(
#      df, test_size=0.2, random_state=3)

# vectorizer.fit(df['speech'])

# X_test = vectorizer.transform(test_df['speech'])
# X_train = vectorizer.transform(train_df['speech'])
# y_test = test_df['y']
# y_train = train_df['y']

# # Train Classifier
# clf = MultinomialNB(alpha=1.0e-10, class_prior=None, fit_prior=True)
# clf.fit(X_train, y_train)

In [16]:
print(clf.classes_)
print(clf.class_count_)
print(clf.class_log_prior_)

# features
print(clf.feature_count_)
print(clf.feature_log_prob_)  # log ( prob(w|martial law) )
# print(clf.n_features_)
# print(clf.n_features_in_)
# print(clf.feature_names_in_)



[0 1]
[6. 6.]
[-0.69314718 -0.69314718]
[[5. 2. 5. 3. 5. 2. 6. 5. 5. 5. 1. 3. 6. 0. 2. 0. 5. 1. 1. 6. 1. 3. 1. 0.
  5. 3. 1. 5. 0. 3. 6. 5. 3. 5. 5. 5. 5. 5. 5. 3. 3. 5. 2. 0. 6. 6. 3. 3.
  5. 3. 0. 5. 5. 6. 3. 2. 5. 5. 5. 5. 5. 5. 5. 4. 5. 6. 6. 4. 4. 4. 6. 1.
  3. 2. 3. 5. 6. 1. 4. 5. 5. 5. 5. 5. 3. 3. 0. 5. 1. 6. 4. 3. 5. 0. 5. 6.
  5. 0. 6. 0. 0. 5. 1. 6. 6. 3. 5. 5. 5. 2. 5. 1. 6. 2. 1. 3. 3. 2. 3. 5.
  3. 1.]
 [3. 3. 1. 5. 0. 4. 0. 0. 0. 6. 4. 2. 0. 5. 6. 5. 4. 6. 4. 1. 6. 4. 4. 5.
  2. 6. 4. 1. 6. 2. 0. 5. 2. 0. 0. 0. 0. 0. 0. 2. 2. 0. 3. 5. 3. 2. 5. 2.
  6. 2. 5. 1. 0. 1. 2. 4. 0. 0. 0. 0. 6. 0. 0. 2. 0. 0. 1. 4. 1. 1. 0. 4.
  2. 5. 3. 1. 0. 5. 2. 0. 0. 0. 0. 0. 2. 4. 6. 0. 4. 0. 1. 2. 3. 6. 0. 0.
  0. 6. 6. 5. 5. 0. 6. 1. 0. 4. 0. 0. 0. 3. 0. 4. 1. 6. 4. 5. 3. 3. 4. 3.
  4. 4.]]
[[ -4.48413186  -5.40042259  -4.48413186  -4.99495748  -4.48413186
   -5.40042259  -4.3018103   -4.48413186  -4.48413186  -4.48413186
   -6.09356977  -4.99495748  -4.3018103  -29.1194207   -5.40042259


In [17]:
clf.feature_log_prob_.shape

(2, 122)

In [18]:
word_count = pd.DataFrame(clf.feature_count_, 
                             columns=vectorizer.get_feature_names())

word_log_prob = pd.DataFrame(clf.feature_log_prob_, 
                             columns=vectorizer.get_feature_names())

summary_df = pd.concat([word_count, word_log_prob], axis=0)
summary_df = summary_df.T
summary_df.columns = ['count_0', 'count_1', 'log_prob_0', 'log_prob_1']
summary_df



Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
alam naman,5.0,3.0,-4.484132,-4.581560
alternative learning,2.0,3.0,-5.400423,-4.581560
araw araw,5.0,1.0,-4.484132,-5.680173
armed forces,3.0,5.0,-4.994957,-4.070735
babes singson,5.0,0.0,-4.484132,-28.706024
...,...,...,...,...
wala kayong,2.0,3.0,-5.400423,-4.581560
wala naman,3.0,4.0,-4.994957,-4.293878
wala tayong,5.0,3.0,-4.484132,-4.581560
west philippine,3.0,4.0,-4.994957,-4.293878


In [19]:
df.y.value_counts()

0    6
1    6
Name: y, dtype: int64

In [20]:
summary_df.sort_values(by='log_prob_1', ascending=False).head(25)

Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
metro manila,5.0,6.0,-4.484132,-3.888413
diplomatic corps,2.0,6.0,-5.400423,-3.888413
illegal drugs,0.0,6.0,-29.119421,-3.888413
senate president,6.0,6.0,-4.30181,-3.888413
government units,3.0,6.0,-4.994957,-3.888413
secretary salvador,0.0,6.0,-29.119421,-3.888413
salvador cabinet,0.0,6.0,-29.119421,-3.888413
filipino people,1.0,6.0,-6.09357,-3.888413
vice president,2.0,6.0,-5.400423,-3.888413
president maria,0.0,6.0,-29.119421,-3.888413


In [16]:
summary_df.sort_values(by='log_prob_0', ascending=False).head(25)

Unnamed: 0,count_0,count_1,log_prob_0,log_prob_1
address aquino,6.0,0.0,-4.446565,-28.917495
sandatahang lakas,6.0,0.0,-4.446565,-28.917495
quezon city,6.0,6.0,-4.446565,-4.099885
pribadong sektor,6.0,0.0,-4.446565,-28.917495
president philippines,6.0,6.0,-4.446565,-4.099885
fidel valdez,6.0,1.0,-4.446565,-5.891644
philippines delivered,6.0,6.0,-4.446565,-4.099885
pambansa quezon,6.0,6.0,-4.446565,-4.099885
iii president,6.0,0.0,-4.446565,-28.917495
noong nakaraang,6.0,0.0,-4.446565,-28.917495


In [17]:
train_df.query('speech.str.contains("natin")')

NameError: name 'train_df' is not defined

In [18]:
summary_df.to_csv('summary_df.csv')

In [None]:
# # Test Classifier
# # 5-fold cross-validation
# scoring = ['accuracy', 'precision', 'recall', 'f1']
# scores = cross_validate(clf, X, y, scoring=scoring, cv=4)
# display(pd.DataFrame(scores).round(2))

# pd.DataFrame(scores)[
#     ['test_accuracy','test_precision','test_recall','test_f1']]\
#     .mean().round(2)

In [None]:
# pd.DataFrame(np.concatenate((clf.feature_count_, clf.feature_log_prob_), axis=0),
#             index=['pre-ml_count', 'post-ml_count', 'postml_log_proba', 'preml_log_proba'],
#             columns=vectorizer.get_feature_names_out()
#             )\
#     .T.sort_values(by='postml_log_proba', ascending=False)\
#     .head(10)