# Aquino SONAs

This processes all collated Duterte SONA. Reminder to run the [Philippines SONA](https://github.com/pmagtulis/ph-sona.git) scraper file to collect the **merged** CSV file here.

## Do all your imports

In [1]:
import pandas as pd
import numpy as np
import re
import altair as alt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import stopwordsiso as stopwords

## Read CSV

In [2]:
merged= pd.read_csv('merged.csv')
merged

Unnamed: 0.1,Unnamed: 0,president,date,title,link,venue,session,speech
0,0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
1,1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
2,2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
3,3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
4,4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
...,...,...,...,...,...,...,...,...
79,79,Rodrigo Roa Duterte,"July 23, 2018",Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
80,80,Rodrigo Roa Duterte,"July 22, 2019",Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
81,81,Rodrigo Roa Duterte,"July 27, 2020",Fifth State of the Nation Address,https://www.officialgazette.gov.ph/2020/07/27/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Second Session",\n\n\n\n\n\n\n5TH STATE OF THE NATION ADDRESS ...
82,82,Rodrigo Roa Duterte,"July 26, 2021",Sixth State of the Nation Address,https://www.officialgazette.gov.ph/2021/07/26/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Third Session",\n\n\tState of the Nation Address of \n\tRodri...


## Isolate 'Aquino' speeches

The merged file contains all speeches by Philippine presidents since 1935. 

In [3]:
aquino = merged[(merged['president'] == 'Benigno S. Aquino III')] #Aquino

## Text analysis

Now, we can proceed with the text analysis proper. First stop, we set the parameters in the immediate cell below, most importantly the stopwords we want our analysis to disregard.

In [4]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

In [5]:
y_columns = ['president', 'speeches']
BINARY=False
NGRAM_RANGE=(1,1)
MIN_DF=0
STPWORDS=stopwords.stopwords(["en", "tl"]) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag']) #adds more Tagalog stopwords not included in the package 

vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Vectorizing

Simple counting of words that occur in a speech.

In [6]:
X = vectorizer.fit_transform(aquino['speech'])
X



<6x8117 sparse matrix of type '<class 'numpy.int64'>'
	with 14451 stored elements in Compressed Sparse Row format>

In [7]:
aquino_vectors = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
# [print(x) for x in marcosjr.speech]
aquino_vectors.round(2)

Unnamed: 0,____________________,_________________________,aabang,aabot,aabuso,aabusuhin,aabutan,aabutin,aabuting,aagrabyado,...,yumaman,yumaong,yumuko,yumuyuko,yun,yuri,zambales,zamboanga,zone,zte
0,0,0,0,4,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,2,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,3,1,1,0,2,0,1,0,...,0,0,0,1,1,0,0,0,0,0
3,0,0,0,3,0,1,0,1,0,0,...,1,0,1,0,2,0,1,0,0,0
4,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,1,1,1,6,2,0
5,3,1,2,3,3,0,0,1,0,0,...,0,1,0,0,15,0,0,2,0,1


In [8]:
aquino_vectors = aquino_vectors.transpose() #swapping columns and row positions

In [9]:
aquino_vectors.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']
aquino_vectors.sort_values('SONA1', ascending=False).head(20)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
natin,59,78,90,145,117,135
pesos,21,9,28,40,18,0
lang,19,48,73,84,59,115
naman,18,27,39,86,52,74
nating,18,40,46,67,48,55
noong,13,13,32,53,25,66
taon,12,25,41,39,34,35
taumbayan,11,8,11,2,5,7
pondo,11,7,6,7,10,5
mas,11,17,16,33,38,29


## Add a 'total' mention column

Totally optional, just in case you wanted to find the total number of mentions.

In [10]:
aquino_vectors['total'] = aquino_vectors.SONA1 + aquino_vectors.SONA2 + aquino_vectors.SONA3 + aquino_vectors.SONA4 + aquino_vectors.SONA5 + aquino_vectors.SONA6


In [11]:
aquino_vectors = aquino_vectors.sort_values('total', ascending=False)
aquino_vectors.head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6,total
natin,59,78,90,145,117,135,624
lang,19,48,73,84,59,115,398
naman,18,27,39,86,52,74,296
nating,18,40,46,67,48,55,274
nga,2,24,30,57,64,64,241
noong,13,13,32,53,25,66,202
taon,12,25,41,39,34,35,186
mas,11,17,16,33,38,29,144
tayong,9,16,14,35,26,22,122
pilipino,7,21,27,29,14,23,121


# TF-IDF

## Aquino speeches

In [14]:
vectorizer = TfidfVectorizer(
    stop_words=STPWORDS, 
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)
X = vectorizer.fit_transform(aquino['speech'])
aquino_idf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
#[print(x) for x in speeches.sentence]
aquino_idf.round(2)



Unnamed: 0,____________________,_________________________,aabang,aabot,aabuso,aabusuhin,aabutan,aabutin,aabuting,aagrabyado,...,yumaman,yumaong,yumuko,yumuyuko,yun,yuri,zambales,zamboanga,zone,zte
0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
1,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.02,0.0,0.01,0.0,0.02,0.0,0.01,0.0,...,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,...,0.01,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0
4,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.04,0.02,0.0
5,0.02,0.01,0.01,0.01,0.01,0.0,0.0,0.01,0.0,0.0,...,0.0,0.01,0.0,0.0,0.06,0.0,0.0,0.01,0.0,0.01


In [15]:
aquino_idf2 = aquino_idf.transpose()
aquino_idf2.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']

In [16]:
aquino_idf2.sort_values('SONA4', ascending=False).head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
natin,0.513665,0.465613,0.39952,0.459509,0.473314,0.402372
naman,0.156711,0.161174,0.173125,0.272537,0.210362,0.22056
lang,0.165418,0.286531,0.324055,0.266198,0.23868,0.342761
nating,0.156711,0.238776,0.204199,0.212325,0.19418,0.163929
nga,0.017412,0.143265,0.133173,0.180635,0.258907,0.190754
noong,0.113181,0.077602,0.142052,0.167959,0.101135,0.196715
pesos,0.211013,0.062006,0.143455,0.146302,0.084042,0.0
taon,0.104474,0.149235,0.182004,0.123592,0.137544,0.104319
upang,0.043531,0.077602,0.110978,0.120423,0.0445,0.077494
tayong,0.078356,0.09551,0.062148,0.110916,0.105181,0.065572


## Looking for specific words

In this part, we are looking for specific words that we think made a mark during Aquino SONAs, whether because they are often mentioned, or because it is unusual for the Chief Executive to say it. 

We also include here words that we think were said because they were the topic at hand at the time the speech was delivered.

In [20]:
aquino_slice = aquino_idf[['boss', 'wangwang', 'mahirap', 'corrupt']] # you can change this
aquino_slice.sort_index().round(decimals=2)

Unnamed: 0,boss,wangwang,mahirap,corrupt
0,0.0,0.0,0.01,0.01
1,0.01,0.14,0.02,0.0
2,0.03,0.0,0.01,0.01
3,0.02,0.01,0.01,0.0
4,0.07,0.0,0.0,0.0
5,0.06,0.0,0.01,0.02


In [21]:
aquino_slice = aquino_slice.stack().reset_index()
aquino_slice = aquino_slice.rename(columns={'level_0': 'sona_no','level_1': 'term', 'tfidf': 'term', 0: 'tfidf'})
aquino_slice.head()

Unnamed: 0,sona_no,term,tfidf
0,0,boss,0.0
1,0,wangwang,0.0
2,0,mahirap,0.008706
3,0,corrupt,0.013578
4,1,boss,0.013779


In [22]:
top_tfidf = aquino_slice.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
top_tfidf.head()

Unnamed: 0,sona_no,term,tfidf
3,0,corrupt,0.013578
2,0,mahirap,0.008706
0,0,boss,0.0
1,0,wangwang,0.0
5,1,wangwang,0.143354


## Chart it

In [23]:
# # Terms in this list will get a red dot in the visualization
term_list = ['boss', 'wangwang'] # you can change this

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)

## Entire SONAs

In here, we do the same thing for all of SONA *without* isolating key words.

In [24]:
aquino_idf = aquino_idf.stack().reset_index()
aquino_idf

Unnamed: 0,level_0,level_1,0
0,0,____________________,0.000000
1,0,_________________________,0.000000
2,0,aabang,0.000000
3,0,aabot,0.034825
4,0,aabuso,0.000000
...,...,...,...
48697,5,yuri,0.000000
48698,5,zambales,0.000000
48699,5,zamboanga,0.011012
48700,5,zone,0.000000


In [25]:
aquino_idf = aquino_idf.rename(columns={'level_0': 'sona_no','level_1': 'term', 0: 'tfidf'})
aquino_idf

Unnamed: 0,sona_no,term,tfidf
0,0,____________________,0.000000
1,0,_________________________,0.000000
2,0,aabang,0.000000
3,0,aabot,0.034825
4,0,aabuso,0.000000
...,...,...,...
48697,5,yuri,0.000000
48698,5,zambales,0.000000
48699,5,zamboanga,0.011012
48700,5,zone,0.000000


In [26]:
all_aquino = aquino_idf.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
all_aquino.head()

Unnamed: 0,sona_no,term,tfidf
5289,0,natin,0.513665
6226,0,pesos,0.211013
3172,0,lang,0.165418
5088,0,naman,0.156711
5291,0,nating,0.156711


In [28]:
# # Terms in this list will get a red dot in the visualization
term_list = ['boss', 'wangwang']

# adding a little randomness to break ties in term ranking
all_aquino_plusRand = all_aquino.copy()
all_aquino_plusRand['tfidf'] = all_aquino_plusRand['tfidf'] + np.random.rand(all_aquino.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(all_aquino_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)