# Aquino SONAs

This processes all collated Duterte SONA. Reminder to run the [Philippines SONA](https://github.com/pmagtulis/ph-sona.git) scraper file to collect the **merged** CSV file here.

## Do all your imports

In [1]:
import pandas as pd
import numpy as np
import re
import altair as alt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import stopwordsiso as stopwords

## Read CSV

In [2]:
merged= pd.read_csv('merged.csv')
merged

Unnamed: 0.1,Unnamed: 0,president,date,title,link,venue,session,speech
0,0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
1,1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
2,2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
3,3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
4,4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
...,...,...,...,...,...,...,...,...
79,79,Rodrigo Roa Duterte,"July 23, 2018",Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
80,80,Rodrigo Roa Duterte,"July 22, 2019",Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
81,81,Rodrigo Roa Duterte,"July 27, 2020",Fifth State of the Nation Address,https://www.officialgazette.gov.ph/2020/07/27/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Second Session",\n\n\n\n\n\n\n5TH STATE OF THE NATION ADDRESS ...
82,82,Rodrigo Roa Duterte,"July 26, 2021",Sixth State of the Nation Address,https://www.officialgazette.gov.ph/2021/07/26/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Third Session",\n\n\tState of the Nation Address of \n\tRodri...


## Isolate 'Aquino' speeches

The merged file contains all speeches by Philippine presidents since 1935. 

In [3]:
aquino = merged[(merged['president'] == 'Benigno S. Aquino III')] #Aquino

## Text analysis

Now, we can proceed with the text analysis proper. First stop, we set the parameters in the immediate cell below, most importantly the stopwords we want our analysis to disregard.

In [4]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

In [83]:
y_columns = ['president', 'speeches']
BINARY=False
NGRAM_RANGE=(1,1)
MIN_DF=0
STPWORDS=stopwords.stopwords(["en", 'tl']) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag', 'nga', 'naman', 'natin', 'kayo',
                'nating', 'natin', 'tayong', 'lang']) #adds more Tagalog stopwords not included in the package 

vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Vectorizing

Simple counting of words that occur in a speech.

In [84]:
X = vectorizer.fit_transform(aquino['speech'])
X



<6x8110 sparse matrix of type '<class 'numpy.int64'>'
	with 14409 stored elements in Compressed Sparse Row format>

In [85]:
aquino_vectors = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
# [print(x) for x in marcosjr.speech]
aquino_vectors.round(2)

Unnamed: 0,____________________,_________________________,aabang,aabot,aabuso,aabusuhin,aabutan,aabutin,aabuting,aagrabyado,...,yumaman,yumaong,yumuko,yumuyuko,yun,yuri,zambales,zamboanga,zone,zte
0,0,0,0,4,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,2,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,3,1,1,0,2,0,1,0,...,0,0,0,1,1,0,0,0,0,0
3,0,0,0,3,0,1,0,1,0,0,...,1,0,1,0,2,0,1,0,0,0
4,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,1,1,1,6,2,0
5,3,1,2,3,3,0,0,1,0,0,...,0,1,0,0,15,0,0,2,0,1


In [86]:
aquino_vectors = aquino_vectors.transpose() #swapping columns and row positions

In [87]:
aquino_vectors.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']
aquino_vectors.sort_values('SONA1', ascending=False).head(20)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
pesos,21,9,28,40,18,0
noong,13,13,32,53,25,66
taon,12,25,41,39,34,35
taumbayan,11,8,11,2,5,7
mas,11,17,16,33,38,29
pondo,11,7,6,7,10,5
nangyari,9,1,1,6,2,4
buwan,9,7,10,3,5,6
batas,9,5,9,7,6,17
porsyento,9,2,2,4,3,0


## Add a 'total' mention column

Totally optional, just in case you wanted to find the total number of mentions.

In [88]:
aquino_vectors['total'] = aquino_vectors.SONA1 + aquino_vectors.SONA2 + aquino_vectors.SONA3 + aquino_vectors.SONA4 + aquino_vectors.SONA5 + aquino_vectors.SONA6


In [89]:
aquino_vectors = aquino_vectors.sort_values('total', ascending=False)
aquino_vectors.head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6,total
noong,13,13,32,53,25,66,202
taon,12,25,41,39,34,35,186
mas,11,17,16,33,38,29,144
pilipino,7,21,27,29,14,23,121
upang,5,13,25,38,11,26,118
pesos,21,9,28,40,18,0,116
ninyo,3,11,28,28,19,25,114
bansa,2,11,16,24,22,28,103
wala,3,17,9,24,19,18,90
di,0,12,9,27,8,30,86


# TF-IDF

## Aquino speeches

In [90]:
vectorizer = TfidfVectorizer(
    stop_words=STPWORDS, 
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)
X = vectorizer.fit_transform(aquino['speech'])
aquino_idf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
#[print(x) for x in speeches.sentence]
aquino_idf.round(2)



Unnamed: 0,____________________,_________________________,aabang,aabot,aabuso,aabusuhin,aabutan,aabutin,aabuting,aagrabyado,...,yumaman,yumaong,yumuko,yumuyuko,yun,yuri,zambales,zamboanga,zone,zte
0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
1,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.03,0.01,0.01,0.0,0.03,0.0,0.01,0.0,...,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,...,0.01,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0
4,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.06,0.02,0.0
5,0.03,0.01,0.01,0.01,0.02,0.0,0.0,0.01,0.0,0.0,...,0.0,0.01,0.0,0.0,0.08,0.0,0.0,0.01,0.0,0.01


In [91]:
aquino_idf2 = aquino_idf.transpose()
aquino_idf2.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']

In [92]:
aquino_idf2.sort_values('SONA4', ascending=False).head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
noong,0.140109,0.101605,0.17873,0.227114,0.136156,0.253855
pesos,0.26122,0.081185,0.180496,0.197829,0.113144,0.0
taon,0.129332,0.195394,0.228998,0.167122,0.185173,0.13462
upang,0.053888,0.101605,0.139633,0.162836,0.059909,0.100004
mas,0.118554,0.132868,0.089365,0.141411,0.206958,0.111542
yong,0.0,0.0,0.0,0.134572,0.040243,0.0
di,0.0,0.108247,0.058017,0.133535,0.050286,0.133176
pilipino,0.075444,0.164131,0.150804,0.12427,0.076248,0.088465
ninyo,0.032333,0.085973,0.156389,0.119985,0.103479,0.096157
bansa,0.021555,0.085973,0.089365,0.102844,0.119818,0.107696


## Looking for specific words

In this part, we are looking for specific words that we think made a mark during Aquino SONAs, whether because they are often mentioned, or because it is unusual for the Chief Executive to say it. 

We also include here words that we think were said because they were the topic at hand at the time the speech was delivered.

In [93]:
aquino_slice = aquino_idf[['boss', 'wangwang', 'mahirap', 'corrupt']] # you can change this
aquino_slice.sort_index().round(decimals=2)

Unnamed: 0,boss,wangwang,mahirap,corrupt
0,0.0,0.0,0.01,0.02
1,0.02,0.19,0.03,0.0
2,0.04,0.0,0.01,0.01
3,0.02,0.02,0.01,0.0
4,0.09,0.0,0.01,0.0
5,0.07,0.0,0.02,0.02


In [94]:
aquino_slice = aquino_slice.stack().reset_index()
aquino_slice = aquino_slice.rename(columns={'level_0': 'sona_no','level_1': 'term', 'tfidf': 'term', 0: 'tfidf'})
aquino_slice.head()

Unnamed: 0,sona_no,term,tfidf
0,0,boss,0.0
1,0,wangwang,0.0
2,0,mahirap,0.010778
3,0,corrupt,0.016809
4,1,boss,0.018041


In [95]:
top_tfidf = aquino_slice.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
top_tfidf.head()

Unnamed: 0,sona_no,term,tfidf
3,0,corrupt,0.016809
2,0,mahirap,0.010778
0,0,boss,0.0
1,0,wangwang,0.0
5,1,wangwang,0.187694


## Chart it

In [96]:
# # Terms in this list will get a red dot in the visualization
term_list = ['boss', 'wangwang'] # you can change this

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)

## Entire SONAs

In here, we do the same thing for all of SONA *without* isolating key words.

In [97]:
aquino_idf = aquino_idf.stack().reset_index()
aquino_idf

Unnamed: 0,level_0,level_1,0
0,0,____________________,0.000000
1,0,_________________________,0.000000
2,0,aabang,0.000000
3,0,aabot,0.043111
4,0,aabuso,0.000000
...,...,...,...
48655,5,yuri,0.000000
48656,5,zambales,0.000000
48657,5,zamboanga,0.014210
48658,5,zone,0.000000


In [98]:
aquino_idf = aquino_idf.rename(columns={'level_0': 'sona_no','level_1': 'term', 0: 'tfidf'})
aquino_idf

Unnamed: 0,sona_no,term,tfidf
0,0,____________________,0.000000
1,0,_________________________,0.000000
2,0,aabang,0.000000
3,0,aabot,0.043111
4,0,aabuso,0.000000
...,...,...,...
48655,5,yuri,0.000000
48656,5,zambales,0.000000
48657,5,zamboanga,0.014210
48658,5,zone,0.000000


In [99]:
all_aquino = aquino_idf.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
all_aquino.head()

Unnamed: 0,sona_no,term,tfidf
6220,0,pesos,0.26122
5414,0,noong,0.140109
7484,0,taon,0.129332
5300,0,natuklasan,0.121397
4187,0,mas,0.118554


In [100]:
# # Terms in this list will get a red dot in the visualization
term_list = ['boss', 'wangwang']

# adding a little randomness to break ties in term ranking
all_aquino_plusRand = all_aquino.copy()
all_aquino_plusRand['tfidf'] = all_aquino_plusRand['tfidf'] + np.random.rand(all_aquino.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(all_aquino_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)