# Aquino SONAs

This processes all collated Aquino's SONA. Reminder to run the [Philippines SONA](https://github.com/pmagtulis/ph-sona.git) scraper file to collect the **merged** CSV file here.

## Do all your imports

In [1]:
import pandas as pd
import numpy as np
import re
import altair as alt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import stopwordsiso as stopwords

## Read CSV

In [2]:
merged= pd.read_csv('../csv/merged.csv')
merged

Unnamed: 0,president,date,title,link,venue,session,speech
0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, gentlemen of the National Assemb..."
1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, Gentlemen of the National Assemb..."
2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session","Mr. Speaker, Gentlemen of the National Assemb..."
3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",Gentlemen of the National Assembly: The state...
4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",Gentlemen of the National Assembly: I take pl...
...,...,...,...,...,...,...,...
79,Rodrigo Roa Duterte,"July 23, 2018",Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",Kindly sit down. Thank you for your courtesy....
80,Rodrigo Roa Duterte,"July 22, 2019",Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",Thank you. Kindly sit down. Kumusta po kayo...
81,Rodrigo Roa Duterte,"July 27, 2020",Fifth State of the Nation Address,https://www.officialgazette.gov.ph/2020/07/27/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Second Session",Kindly… Senate President Vicente Sotto III an...
82,Rodrigo Roa Duterte,"July 26, 2021",Sixth State of the Nation Address,https://www.officialgazette.gov.ph/2021/07/26/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Third Session",Kindly sit down. By far this is the most bea...


# Initial analysis

## regex

We are now ready to take an **initial analysis** of the texts that we have. For this part, I provided some examples below using **regex**.

An important note on this method: the **str.contains** and **str.extractall** functions **ONLY** count *the number of speeches* that contain the word, *not how many times* the word was mentioned in the speech. We would look into the count of the words on the speeches later at a deeper analysis.

Words we ran here are based from peer-reviewed textual studies that gauge **populism.**

### 'elite'

The word "elite" is found to have been often used by populist leaders. We find based on this initial analysis that in the case of Philippine presidents, three leaders (one of whom was **dictator** Ferdinand Marcos Sr.) were found to have included the word in their SONAs.

In [3]:
merged[merged.speech.str.contains(r"\belite", case=False, regex=True)].president.value_counts()

Ferdinand E. Marcos        2
Joseph Ejercito Estrada    1
Rodrigo Roa Duterte        1
Name: president, dtype: int64

In [4]:
# # pd.set_option('display.max_colwidth', None)
# merged.speech.str.extractall(r'(.*\belite.+)', re.IGNORECASE)

### 'democracy' and 'demokrasya'

Dictator Ferdinand E. Marcos mentioned the word **"democracy"** in 10 of his SONAs followed by Gloria Arroyo (7 of 9 SONAs). In Filipino, Benigno Aquino III mentioned **"demokrasya"** in two of his six speeches. 



**Joseph Estrada**, whose term was cut short by a popular revolt in 2001, and **Rodrigo Duterte** mentioned the word in a single SONA. 

In [5]:
# merged[merged.speech.str.contains(r"(.*\bdemocracy.+)", case=False, regex=True)].president.value_counts()

In [6]:
# merged[merged.speech.str.contains(r"(.*\bdemokrasya.+)", case=False, regex=True)].president.value_counts()

In [7]:
# merged.speech.str.extractall(r'(.*\bdemocracy.+)', re.IGNORECASE).head(7)

In [8]:
# merged.speech.str.extractall(r'(.*\bdemokrasya.+)', re.IGNORECASE).head()

## Segregating by president

We create separate dataframes from a select number of presidents to analyze using text analysis.

In [9]:
#Post-martial law
cory = merged[(merged['president'] == 'Corazon C. Aquino')] #Cory Aquino
ramos = merged[(merged['president'] == 'Fidel V. Ramos')] #Fidel Ramos
aquino = merged[(merged['president'] == 'Benigno S. Aquino III')] #Aquino
duterte = merged[(merged['president'] == 'Rodrigo Roa Duterte')] #Duterte
erap = merged[(merged['president'] == 'Joseph Ejercito Estrada')] #Erap
arroyo = merged[(merged['president'] == 'Gloria Macapagal-Arroyo')] #Arroyo
marcosjr = merged[(merged['president'] == 'Ferdinand R. Marcos Jr.')] #Marcos Jr.

marcos = merged[(merged['president'] == 'Ferdinand E. Marcos')] #Marcos Sr.

# Pre-martial law
macapagal = merged[(merged['president'] == 'Diosdado Macapagal')] #Diosdado Macapagal
garcia = merged[(merged['president'] == 'Carlos P. Garcia')] #Carlos Garcia
magsaysay = merged[(merged['president'] == 'Ramon Magsaysay')] #Ramon Magsaysay
quirino = merged[(merged['president'] == 'Elpidio Quirino')] #Elpidio Quirino

## Isolate 'Aquino' speeches

The merged file contains all speeches by Philippine presidents since 1935. 

In [10]:
aquino = merged[(merged['president'] == 'Benigno S. Aquino III')] #Aquino

## Text analysis

Now, we can proceed with the text analysis proper. First stop, we set the parameters in the immediate cell below, most importantly the stopwords we want our analysis to disregard.

In [11]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

In [12]:
y_columns = ['president', 'speeches']
BINARY=False
NGRAM_RANGE=(1,1)
MIN_DF=0
STPWORDS=stopwords.stopwords(["en", 'tl']) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag', 'nga', 'naman', 'natin', 'ninyo', 'kayo',
                'nating', 'natin', 'tayong', 'lang', '____________________', '_________________________']) #adds more Tagalog stopwords not included in the package 

vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Vectorizing

Simple counting of words that occur in a speech.

In [13]:
X = vectorizer.fit_transform(aquino['speech'])
X



<6x8092 sparse matrix of type '<class 'numpy.int64'>'
	with 14296 stored elements in Compressed Sparse Row format>

In [14]:
aquino_vectors = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
# [print(x) for x in marcosjr.speech]
aquino_vectors.round(2)

Unnamed: 0,aabang,aabot,aabuso,aabusuhin,aabutan,aabutin,aabuting,aagrabyado,aakalain,aakit,...,yumaman,yumaong,yumuko,yumuyuko,yun,yuri,zambales,zamboanga,zone,zte
0,0,4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,2,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,1,1,0,2,0,1,0,0,0,...,0,0,0,1,1,0,0,0,0,0
3,0,3,0,1,0,1,0,0,0,0,...,1,0,1,0,2,0,1,0,0,0
4,1,1,0,0,0,0,0,0,1,1,...,0,0,0,0,1,1,1,6,2,0
5,2,3,3,0,0,1,0,0,0,0,...,0,1,0,0,15,0,0,2,0,1


In [15]:
aquino_vectors = aquino_vectors.transpose() #swapping columns and row positions

In [16]:
aquino_vectors.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']
aquino_vectors.sort_values('SONA1', ascending=False).head(20)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
pesos,21,9,28,40,18,0
noong,13,13,32,53,25,66
taon,12,25,41,39,34,35
taumbayan,11,8,11,2,5,7
pondo,11,7,6,7,10,5
mas,11,17,16,33,38,29
nangyari,9,1,1,6,2,4
buwan,9,7,10,3,5,6
porsyento,9,2,2,4,3,0
batas,9,5,9,7,6,17


## Add a 'total' mention column

Totally optional, just in case you wanted to find the total number of mentions.

In [17]:
aquino_vectors['total'] = aquino_vectors.SONA1 + aquino_vectors.SONA2 + aquino_vectors.SONA3


In [18]:
aquino_vectors = aquino_vectors.sort_values('total', ascending=False)
aquino_vectors.head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6,total
taon,12,25,41,39,34,35,78
pesos,21,9,28,40,18,0,58
noong,13,13,32,53,25,66,58
pilipino,7,21,27,29,14,23,55
mas,11,17,16,33,38,29,44
upang,5,13,25,38,11,26,43
taumbayan,11,8,11,2,5,7,30
bansa,2,11,16,24,22,28,29
wala,3,17,9,24,19,18,29
dating,4,15,10,8,5,11,29


# TF-IDF

## Aquino speeches

In [19]:
vectorizer = TfidfVectorizer(
    stop_words=STPWORDS, 
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)
X = vectorizer.fit_transform(aquino['speech'])
aquino_idf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
#[print(x) for x in speeches.sentence]
aquino_idf.round(2)



Unnamed: 0,aabang,aabot,aabuso,aabusuhin,aabutan,aabutin,aabuting,aagrabyado,aakalain,aakit,...,yumaman,yumaong,yumuko,yumuyuko,yun,yuri,zambales,zamboanga,zone,zte
0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
1,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.03,0.01,0.01,0.0,0.03,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0
3,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,...,0.01,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0
4,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,...,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.06,0.02,0.0
5,0.01,0.01,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.01,0.0,0.0,0.08,0.0,0.0,0.01,0.0,0.01


In [20]:
aquino_idf2 = aquino_idf.transpose()
aquino_idf2.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']

In [21]:
aquino_idf2.sort_values('SONA1', ascending=False).head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
pesos,0.26195,0.081545,0.182945,0.199358,0.113802,0.0
noong,0.140501,0.102055,0.181155,0.228869,0.136947,0.255187
taon,0.129693,0.196259,0.232105,0.168413,0.186248,0.135326
natuklasan,0.121737,0.0,0.0,0.0,0.0,0.0
taumbayan,0.118885,0.062803,0.062272,0.008637,0.027389,0.027065
pondo,0.118885,0.054953,0.033967,0.030228,0.054779,0.019332
mas,0.118885,0.133456,0.090578,0.142503,0.20816,0.112127
porsyento,0.112264,0.018121,0.013068,0.019936,0.018967,0.0
buwan,0.09727,0.054953,0.056611,0.012955,0.027389,0.023199
nangyari,0.09727,0.00785,0.005661,0.02591,0.010956,0.015466


## Looking for specific words

In this part, we are looking for specific words that we think made a mark during Aquino SONAs, whether because they are often mentioned, or because it is unusual for the Chief Executive to say it. 

We also include here words that we think were said because they were the topic at hand at the time the speech was delivered.

In [22]:
aquino_slice = aquino_idf[['boss', 'wangwang', 'mahirap', 'corrupt']] # you can change this
aquino_slice.sort_index().round(decimals=2)

Unnamed: 0,boss,wangwang,mahirap,corrupt
0,0.0,0.0,0.01,0.02
1,0.02,0.19,0.03,0.0
2,0.04,0.0,0.01,0.01
3,0.02,0.02,0.01,0.0
4,0.09,0.0,0.01,0.0
5,0.07,0.0,0.02,0.02


In [23]:
aquino_slice = aquino_slice.stack().reset_index()
aquino_slice = aquino_slice.rename(columns={'level_0': 'sona_no','level_1': 'term', 'tfidf': 'term', 0: 'tfidf'})
aquino_slice.head()

Unnamed: 0,sona_no,term,tfidf
0,0,boss,0.0
1,0,wangwang,0.0
2,0,mahirap,0.010808
3,0,corrupt,0.016856
4,1,boss,0.018121


In [24]:
top_tfidf = aquino_slice.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
top_tfidf.head()

Unnamed: 0,sona_no,term,tfidf
3,0,corrupt,0.016856
2,0,mahirap,0.010808
0,0,boss,0.0
1,0,wangwang,0.0
5,1,wangwang,0.188526


## Chart it

In [25]:
# # Terms in this list will get a red dot in the visualization
term_list = ['boss', 'wangwang'] # you can change this

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)

## Entire SONAs

In here, we do the same thing for all of SONA *without* isolating key words.

In [26]:
aquino_idf = aquino_idf.stack().reset_index()
aquino_idf

Unnamed: 0,level_0,level_1,0
0,0,aabang,0.000000
1,0,aabot,0.043231
2,0,aabuso,0.000000
3,0,aabusuhin,0.000000
4,0,aabutan,0.000000
...,...,...,...
48547,5,yuri,0.000000
48548,5,zambales,0.000000
48549,5,zamboanga,0.014285
48550,5,zone,0.000000


In [27]:
aquino_idf = aquino_idf.rename(columns={'level_0': 'sona_no','level_1': 'term', 0: 'tfidf'})
aquino_idf

Unnamed: 0,sona_no,term,tfidf
0,0,aabang,0.000000
1,0,aabot,0.043231
2,0,aabuso,0.000000
3,0,aabusuhin,0.000000
4,0,aabutan,0.000000
...,...,...,...
48547,5,yuri,0.000000
48548,5,zambales,0.000000
48549,5,zamboanga,0.014285
48550,5,zone,0.000000


In [28]:
aquino_zero = aquino_idf[aquino_idf.sona_no==0]
aquino_zero

Unnamed: 0,sona_no,term,tfidf
0,0,aabang,0.000000
1,0,aabot,0.043231
2,0,aabuso,0.000000
3,0,aabusuhin,0.000000
4,0,aabutan,0.000000
...,...,...,...
8087,0,yuri,0.000000
8088,0,zambales,0.000000
8089,0,zamboanga,0.000000
8090,0,zone,0.000000


In [29]:
aquino_zero.tfidf = aquino_zero.tfidf.round(2)
aquino_zero

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aquino_zero.tfidf = aquino_zero.tfidf.round(2)


Unnamed: 0,sona_no,term,tfidf
0,0,aabang,0.00
1,0,aabot,0.04
2,0,aabuso,0.00
3,0,aabusuhin,0.00
4,0,aabutan,0.00
...,...,...,...
8087,0,yuri,0.00
8088,0,zambales,0.00
8089,0,zamboanga,0.00
8090,0,zone,0.00


In [30]:
aquino_zero = aquino_zero[aquino_zero.tfidf>0.00]
aquino_zero

Unnamed: 0,sona_no,term,tfidf
1,0,aabot,0.04
10,0,aaklas,0.02
36,0,aasenso,0.02
81,0,additional,0.05
83,0,adhikain,0.03
...,...,...,...
8044,0,welfare,0.02
8048,0,whistleblower,0.02
8055,0,witness,0.05
8059,0,workers,0.02


In [31]:
aquino_zero = aquino_zero.sort_values('tfidf', ascending=False)
aquino_zero

Unnamed: 0,sona_no,term,tfidf
6209,0,pesos,0.26
5403,0,noong,0.14
7470,0,taon,0.13
6464,0,pondo,0.12
5290,0,natuklasan,0.12
...,...,...,...
6570,0,projects,0.01
6564,0,programang,0.01
6562,0,program,0.01
1522,0,fidel,0.01


In [32]:
# aquino_idf.to_csv('aquino_idf_complete.csv',index=False)
aquino_idf.tfidf = aquino_idf.tfidf.round(2)
aquino_idf.head()

Unnamed: 0,sona_no,term,tfidf
0,0,aabang,0.0
1,0,aabot,0.04
2,0,aabuso,0.0
3,0,aabusuhin,0.0
4,0,aabutan,0.0


In [33]:
aquino_idf = aquino_idf[aquino_idf.tfidf>0.00]
aquino_idf.head()

Unnamed: 0,sona_no,term,tfidf
1,0,aabot,0.04
10,0,aaklas,0.02
36,0,aasenso,0.02
81,0,additional,0.05
83,0,adhikain,0.03


In [34]:
# all_aquino = aquino_idf.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(50)
all_aquino = aquino_idf.sort_values(by=['tfidf', 'term'], ascending=[False,True]).groupby(['sona_no']).head(10)
all_aquino.head()

Unnamed: 0,sona_no,term,tfidf
45863,5,noong,0.26
6209,0,pesos,0.26
29679,3,noong,0.23
23654,2,taon,0.23
36545,4,mas,0.21


In [35]:
# # Terms in this list will get a red dot in the visualization
term_list = ['boss', 'wangwang']

# adding a little randomness to break ties in term ranking
all_aquino_plusRand = all_aquino.copy()
all_aquino_plusRand['tfidf'] = all_aquino_plusRand['tfidf'] + np.random.rand(all_aquino.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(all_aquino_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("term", order="ascending")],
    groupby = ["sona_no"],
)

heatmap = base.mark_rect().encode(
    color=alt.Color('tfidf:Q',
        scale=alt.Scale(scheme='blueorange'),
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.15, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + text).properties(width = 720, height=400)

In [36]:
# # # Terms in this list will get a red dot in the visualization
# term_list = ['boss', 'wangwang']

# # adding a little randomness to break ties in term ranking
# aquino_zero_plusRand = aquino_zero.copy()
# aquino_zero_plusRand['tfidf'] = aquino_zero_plusRand['tfidf'] + np.random.rand(aquino_zero.shape[0])*0.0001

# # base for all visualizations, with rank calculation
# base = alt.Chart(aquino_zero_plusRand).encode(
#     x = 'rank:O',
#     y = 'sona_no:N'
# ).transform_window(
#     rank = "rank()",
#     sort = [alt.SortField("tfidf", order="descending")],
# )

# # heatmap specification
# heatmap = base.mark_rect().encode(
#     color = 'tfidf:Q'
# )

# # red circle over terms in above list
# circle = base.mark_circle(size=100).encode(
#     color = alt.condition(
#         alt.FieldOneOfPredicate(field='term', oneOf=term_list),
#         alt.value('red'),
#         alt.value('#FFFFFF00')        
#     )
# )

# # text labels, white for darker heatmap colors
# text = base.mark_text(baseline='middle').encode(
#     text = 'term:N',
#     color = alt.condition(alt.datum.tfidf >= 0.25, alt.value('white'), alt.value('black'))
# )

# # display the three superimposed visualizations
# (heatmap + circle + text).properties(width = 720, height=450)