# All about classifiers

This takes off from the index [Philippines SONA file](https://github.com/pmagtulis/ph-sona.git). We will be using a CSV file here that can be found in the repository. 

The purpose of this notebook is to dig deeper into the different State of the Nation Addresses of Philippine presidents, this time by training classifiers.

The question we would like to answer is **"How different are SONA of pre-martial law and post-martial law presidents?"**

## Do all your imports

In [1]:
import pandas as pd
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, recall_score, precision_score, accuracy_score, f1_score
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
import stopwordsiso as stopwords
import altair as alt

## Read CSV

In [2]:
df=pd.read_csv('leaders.csv')
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...
2,Elpidio Quirino,1951-01-22,The State of the Nation,http://www.officialgazette.gov.ph/1951/01/22/e...,"Legislative Building, Manila","Second Congress, Second Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
3,Elpidio Quirino,1952-01-28,The State of the Nation,http://www.officialgazette.gov.ph/1952/01/28/e...,"Legislative Building, Manila","Second Congress, Third Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
4,Elpidio Quirino,1953-01-26,The State of the Nation,http://www.officialgazette.gov.ph/1953/01/26/e...,"Legislative Building, Manila","Second Congress, Fourth Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...


We include SONAs from Elpidio Quirino to current president, Ferdinand R. Marcos Jr. Meanwhile, we exclude SONAs from Manuel Quezon and Jose P. Laurel to control for the effects of World War II that could have been the subject of their speeches.

## Parameters

We will be using the same parameters as the original notebook.

In [3]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

y_columns = ['president', 'speeches']
BINARY=False
NGRAM_RANGE=(1,1)
MIN_DF=0
STPWORDS=stopwords.stopwords(["en", "tl"]) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag', 'nga', 'ang']) #adds more Tagalog stopwords not included in the package 

vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Training a classifier

In here, we will be comparing **pre-martial law** and **post-martial law** presidents to test the hypothesis of how different were the contents of their speeches were to each other.

First we begin by cleaning the dataset.

### Convert to datetime

This is crucial since we will be using the dates to create a new column that will serve as our classifier for both **pre-martial law** and **post-martial law** presidents.

In [4]:
df.dtypes

president    object
date         object
title        object
link         object
venue        object
session      object
speech       object
dtype: object

In [5]:
df.date = pd.to_datetime(df.date)
df.head()

Unnamed: 0,president,date,title,link,venue,session,speech
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...
2,Elpidio Quirino,1951-01-22,The State of the Nation,http://www.officialgazette.gov.ph/1951/01/22/e...,"Legislative Building, Manila","Second Congress, Second Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
3,Elpidio Quirino,1952-01-28,The State of the Nation,http://www.officialgazette.gov.ph/1952/01/28/e...,"Legislative Building, Manila","Second Congress, Third Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...
4,Elpidio Quirino,1953-01-26,The State of the Nation,http://www.officialgazette.gov.ph/1953/01/26/e...,"Legislative Building, Manila","Second Congress, Fourth Session",\nMessage\nof\nHis Excellency Elpidio Quirino\...


### Add a binary identifier column

This can either be **pre_ml** or **post_ml** depending on date the speech was delivered.

In [6]:
df['classifier'] = np.where(df['date']>= '1986-01-01', 'Y', 'N')
df.head(2)

Unnamed: 0,president,date,title,link,venue,session,speech,classifier
0,Elpidio Quirino,1949-01-24,The Most Urgent Aim of the Administration,http://www.officialgazette.gov.ph/1949/01/24/e...,"Legislative Building, Manila","First Congress, Third Session",\nState-of-the-Nation Message\nof\nHis Excelle...,N
1,Elpidio Quirino,1950-01-23,Address on the State of the Nation,http://www.officialgazette.gov.ph/1950/01/23/e...,"Delivered via radio broadcast from Baltimore, ...","Second Congress, First Session",\nAddress\nof\nHis Excellency Elpidio Quirino\...,N


## Tokenize, train and test

In [7]:
X = vectorizer.fit_transform(df['speech'])
y = np.array(df['classifier'])
# 1 is post-ml
y = (y == 'Y').astype('int') 



In [8]:
# Train Classifier
clf = MultinomialNB(alpha=1.0e-10, class_prior=None, fit_prior=True)
clf.fit(X, y)

In [9]:
# Test Classifier
# 5-fold cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1']
scores = cross_validate(clf, X, y, scoring=scoring, cv=5)
display(pd.DataFrame(scores).round(2))

pd.DataFrame(scores)[
    ['test_accuracy','test_precision','test_recall','test_f1']]\
    .mean().round(2)

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
0,0.0,0.01,1.0,1.0,1.0,1.0
1,0.0,0.01,1.0,1.0,1.0,1.0
2,0.0,0.01,1.0,1.0,1.0,1.0
3,0.0,0.0,1.0,1.0,1.0,1.0
4,0.0,0.01,1.0,1.0,1.0,1.0


test_accuracy     1.0
test_precision    1.0
test_recall       1.0
test_f1           1.0
dtype: float64

In [10]:
pd.DataFrame(np.concatenate((clf.feature_count_, clf.feature_log_prob_), axis=0),
            index=['pre-ml_count', 'post-ml_count', 'postml_log_proba', 'preml_log_proba'],
            columns=vectorizer.get_feature_names_out()
            )\
    .T.sort_values(by='postml_log_proba', ascending=False)\
    .head(10)

Unnamed: 0,pre-ml_count,post-ml_count,postml_log_proba,preml_log_proba
government,418.0,595.0,-4.43794,-5.103089
people,371.0,478.0,-4.557219,-5.32204
economic,331.0,232.0,-4.671303,-6.044913
program,283.0,213.0,-4.827974,-6.130359
development,250.0,276.0,-4.95196,-5.87125
national,236.0,336.0,-5.00959,-5.67454
public,221.0,225.0,-5.075259,-6.07555
country,179.0,352.0,-5.286036,-5.62802
projects,151.0,85.0,-5.456142,-7.048999
production,149.0,26.0,-5.469475,-8.233554


## Linear Support Vector Classification

In [11]:
# Linear Support Vector Classification.
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X, y)
scores = cross_validate(clf, X, y, scoring=scoring, cv=4)
pd.DataFrame(scores).describe().round(2)[1:3]

Unnamed: 0,fit_time,score_time,test_accuracy,test_precision,test_recall,test_f1
mean,0.0,0.0,0.98,0.98,1.0,0.99
std,0.0,0.0,0.04,0.05,0.0,0.03
