# Arroyo SONAs

This processes all collated Arroyo SONA. Reminder to run the [Philippines SONA](https://github.com/pmagtulis/ph-sona.git) scraper file to collect the **merged** CSV file here.

## Do all your imports

In [1]:
import pandas as pd
import numpy as np
import re
import altair as alt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import stopwordsiso as stopwords

## Read CSV

In [2]:
merged= pd.read_csv('../csv/merged.csv')
merged

Unnamed: 0,president,date,title,link,venue,session,speech
0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, gentlemen of the National Assemb..."
1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session","Mr. Speaker, Gentlemen of the National Assemb..."
2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session","Mr. Speaker, Gentlemen of the National Assemb..."
3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",Gentlemen of the National Assembly: The state...
4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",Gentlemen of the National Assembly: I take pl...
...,...,...,...,...,...,...,...
79,Rodrigo Roa Duterte,"July 23, 2018",Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",Kindly sit down. Thank you for your courtesy....
80,Rodrigo Roa Duterte,"July 22, 2019",Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",Thank you. Kindly sit down. Kumusta po kayo...
81,Rodrigo Roa Duterte,"July 27, 2020",Fifth State of the Nation Address,https://www.officialgazette.gov.ph/2020/07/27/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Second Session",Kindly… Senate President Vicente Sotto III an...
82,Rodrigo Roa Duterte,"July 26, 2021",Sixth State of the Nation Address,https://www.officialgazette.gov.ph/2021/07/26/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Third Session",Kindly sit down. By far this is the most bea...


# Initial analysis

## regex

We are now ready to take an **initial analysis** of the texts that we have. For this part, I provided some examples below using **regex**.

An important note on this method: the **str.contains** and **str.extractall** functions **ONLY** count *the number of speeches* that contain the word, *not how many times* the word was mentioned in the speech. We would look into the count of the words on the speeches later at a deeper analysis.

Words we ran here are based from peer-reviewed textual studies that gauge **populism.**

### 'elite'

The word "elite" is found to have been often used by populist leaders. We find based on this initial analysis that in the case of Philippine presidents, three leaders (one of whom was **dictator** Ferdinand Marcos Sr.) were found to have included the word in their SONAs.

In [3]:
merged[merged.speech.str.contains(r"\belite", case=False, regex=True)].president.value_counts()

Ferdinand E. Marcos        2
Joseph Ejercito Estrada    1
Rodrigo Roa Duterte        1
Name: president, dtype: int64

In [4]:
# pd.set_option('display.max_colwidth', None)
# merged.speech.str.extractall(r'(.*\belite.+)', re.IGNORECASE)

### 'democracy' and 'demokrasya'

Dictator Ferdinand E. Marcos mentioned the word **"democracy"** in 10 of his SONAs followed by Gloria Arroyo (7 of 9 SONAs). In Filipino, Benigno Aquino III mentioned **"demokrasya"** in two of his six speeches. 



**Joseph Estrada**, whose term was cut short by a popular revolt in 2001, and **Rodrigo Duterte** mentioned the word in a single SONA. 

In [5]:
# merged[merged.speech.str.contains(r"(.*\bdemocracy.+)", case=False, regex=True)].president.value_counts()

In [6]:
# merged[merged.speech.str.contains(r"(.*\bdemokrasya.+)", case=False, regex=True)].president.value_counts()

In [7]:
# merged.speech.str.extractall(r'(.*\bdemocracy.+)', re.IGNORECASE).head(7)

In [8]:
# merged.speech.str.extractall(r'(.*\bdemokrasya.+)', re.IGNORECASE).head()

## Segregating by president

We create separate dataframes from a select number of presidents to analyze using text analysis.

In [9]:
#Post-martial law
cory = merged[(merged['president'] == 'Corazon C. Aquino')] #Cory Aquino
ramos = merged[(merged['president'] == 'Fidel V. Ramos')] #Fidel Ramos
aquino = merged[(merged['president'] == 'Benigno S. Aquino III')] #Aquino
duterte = merged[(merged['president'] == 'Rodrigo Roa Duterte')] #Duterte
erap = merged[(merged['president'] == 'Joseph Ejercito Estrada')] #Erap
arroyo = merged[(merged['president'] == 'Gloria Macapagal-Arroyo')] #Arroyo
marcosjr = merged[(merged['president'] == 'Ferdinand R. Marcos Jr.')] #Marcos Jr.

marcos = merged[(merged['president'] == 'Ferdinand E. Marcos')] #Marcos Sr.

# Pre-martial law
macapagal = merged[(merged['president'] == 'Diosdado Macapagal')] #Diosdado Macapagal
garcia = merged[(merged['president'] == 'Carlos P. Garcia')] #Carlos Garcia
magsaysay = merged[(merged['president'] == 'Ramon Magsaysay')] #Ramon Magsaysay
quirino = merged[(merged['president'] == 'Elpidio Quirino')] #Elpidio Quirino

## Isolate 'Arroyo' speeches

The merged file contains all speeches by Philippine presidents since 1935. 

In [10]:
arroyo = merged[(merged['president'] == 'Gloria Macapagal-Arroyo')] #Erap

## Text analysis

Now, we can proceed with the text analysis proper. First stop, we set the parameters in the immediate cell below, most importantly the stopwords we want our analysis to disregard.

In [11]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

In [12]:
y_columns = ['president', 'speeches']
BINARY=False
NGRAM_RANGE=(1,1)
MIN_DF=0
STPWORDS=stopwords.stopwords(["en", 'tl']) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag', 'nga', 'upang','naman', 'natin', 'kayo',
                'nating', 'natin', 'tayong', 'lang', 'jayson', 'jomar', 'erwin']) #adds more Tagalog stopwords not included in the package 

vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Vectorizing

Simple counting of words that occur in a speech.

In [13]:
X = vectorizer.fit_transform(arroyo['speech'])
X



<9x5896 sparse matrix of type '<class 'numpy.int64'>'
	with 10315 stored elements in Compressed Sparse Row format>

In [14]:
arroyo_vectors = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
# [print(x) for x in marcosjr.speech]
arroyo_vectors.round(2)

Unnamed: 0,aabala,aalala,aalisin,aaral,aasikaso,aatras,aatubiling,abandon,abandoned,abated,...,yo,york,youth,yugto,yumayabong,yun,zambales,zamboanga,zone,zubiri
0,0,0,0,2,1,0,0,0,1,0,...,0,0,3,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,0,0,0,1,0,0,0,1,0,0,...,1,0,1,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
5,0,0,0,0,0,1,0,0,0,0,...,0,0,1,2,0,0,0,2,2,2
6,0,0,0,2,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,1,1
7,1,4,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,4,0,0
8,0,0,0,0,0,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [15]:
arroyo_vectors = arroyo_vectors.transpose() #swapping columns and row positions

In [16]:
arroyo_vectors.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6', 'SONA7', 'SONA8', 'SONA9']
arroyo_vectors.sort_values('SONA3', ascending=False).head(20)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6,SONA7,SONA8,SONA9
peace,5,4,11,0,4,8,4,3,4
war,3,18,9,3,1,1,0,1,1
congress,12,5,9,13,5,3,14,6,8
percent,5,0,8,0,1,0,7,0,0
giyera,0,0,7,0,0,0,0,0,0
noong,1,6,7,0,0,2,4,5,7
drug,2,11,7,1,1,0,0,0,2
barangay,5,3,6,1,1,2,3,1,3
government,8,8,6,18,12,9,9,10,7
president,7,6,6,9,5,5,5,6,12


## Add a 'total' mention column

Totally optional, just in case you wanted to find the total number of mentions.

In [17]:
arroyo_vectors['total'] = arroyo_vectors.SONA1 + arroyo_vectors.SONA2 + arroyo_vectors.SONA3 + arroyo_vectors.SONA4 + arroyo_vectors.SONA5 + arroyo_vectors.SONA6 + arroyo_vectors.SONA7 + arroyo_vectors.SONA8 + arroyo_vectors.SONA9   


In [18]:
arroyo_vectors = arroyo_vectors.sort_values('total', ascending=False)
arroyo_vectors.head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6,SONA7,SONA8,SONA9,total
government,8,8,6,18,12,9,9,10,7,87
people,9,8,2,11,9,12,9,11,12,83
congress,12,5,9,13,5,3,14,6,8,75
country,9,7,4,7,9,11,1,7,12,67
president,7,6,6,9,5,5,5,6,12,61
nation,8,10,6,4,3,9,2,8,5,55
power,7,11,4,2,2,8,8,3,4,49
national,16,7,2,3,4,3,6,4,2,47
strong,0,20,4,3,4,1,4,2,7,45
mindanao,5,0,3,0,2,9,14,7,4,44


# TF-IDF

## Arroyo speeches

In [19]:
vectorizer = TfidfVectorizer(
    stop_words=STPWORDS, 
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)
X = vectorizer.fit_transform(arroyo['speech'])
arroyo_idf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
#[print(x) for x in speeches.sentence]
arroyo_idf.round(2)



Unnamed: 0,aabala,aalala,aalisin,aaral,aasikaso,aatras,aatubiling,abandon,abandoned,abated,...,yo,york,youth,yugto,yumayabong,yun,zambales,zamboanga,zone,zubiri
0,0.0,0.0,0.0,0.03,0.02,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0
2,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.03,0.0,0.0,...,0.03,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.03,0.03,0.03
6,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.02
7,0.02,0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.07,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
arroyo_idf2 = arroyo_idf.transpose()
arroyo_idf2.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6', 'SONA7', 'SONA8', 'SONA9']

In [21]:
arroyo_idf2.sort_values('SONA7', ascending=False).head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6,SONA7,SONA8,SONA9
airport,0.0,0.0,0.0,0.0,0.0,0.163861,0.245815,0.015229,0.0
road,0.0,0.011449,0.015877,0.0,0.0,0.12919,0.161503,0.0,0.01349
mindanao,0.042273,0.0,0.038561,0.0,0.041179,0.085574,0.122034,0.068042,0.043684
park,0.0,0.0,0.0,0.0,0.0,0.0,0.111577,0.0,0.0
plant,0.0,0.0,0.0,0.0,0.0,0.0,0.109946,0.03503,0.0
mayor,0.0,0.0,0.0,0.0,0.0,0.014896,0.109251,0.015229,0.0
construction,0.026492,0.0,0.0,0.0,0.032257,0.0,0.109251,0.0,0.0
governor,0.0,0.0,0.0,0.0,0.0,0.044689,0.109251,0.0,0.01711
agribusiness,0.011703,0.0,0.0,0.020527,0.0,0.052647,0.108596,0.0,0.0
congress,0.082947,0.037888,0.094578,0.157609,0.084166,0.023321,0.099771,0.047681,0.071429


## Looking for specific words

In this part, we are looking for specific words that we think made a mark during Aquino SONAs, whether because they are often mentioned, or because it is unusual for the Chief Executive to say it. 

We also include here words that we think were said because they were the topic at hand at the time the speech was delivered.

In [22]:
arroyo_slice = arroyo_idf[['mahirap', 'government']] # you can change this
arroyo_slice.sort_index().round(decimals=2)

Unnamed: 0,mahirap,government
0,0.0,0.06
1,0.0,0.06
2,0.02,0.06
3,0.02,0.22
4,0.0,0.2
5,0.0,0.07
6,0.0,0.06
7,0.0,0.08
8,0.02,0.06


In [23]:
arroyo_slice = arroyo_slice.stack().reset_index()
arroyo_slice = arroyo_slice.rename(columns={'level_0': 'sona_no','level_1': 'term', 'tfidf': 'term', 0: 'tfidf'})
arroyo_slice.head()

Unnamed: 0,sona_no,term,tfidf
0,0,mahirap,0.0
1,0,government,0.055298
2,1,mahirap,0.0
3,1,government,0.060621
4,2,mahirap,0.020138


In [24]:
top_tfidf = arroyo_slice.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
top_tfidf.head()

Unnamed: 0,sona_no,term,tfidf
1,0,government,0.055298
0,0,mahirap,0.0
3,1,government,0.060621
2,1,mahirap,0.0
5,2,government,0.063052


## Chart it

In [25]:
# # Terms in this list will get a red dot in the visualization
term_list = ['mahirap', 'wangwang'] # you can change this

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)

## Entire SONAs

In here, we do the same thing for all of SONA *without* isolating key words.

In [26]:
arroyo_idf = arroyo_idf.stack().reset_index()
arroyo_idf

Unnamed: 0,level_0,level_1,0
0,0,aabala,0.000000
1,0,aalala,0.000000
2,0,aalisin,0.000000
3,0,aaral,0.026492
4,0,aasikaso,0.018037
...,...,...,...
53059,8,yun,0.000000
53060,8,zambales,0.000000
53061,8,zamboanga,0.000000
53062,8,zone,0.000000


In [27]:
arroyo_idf = arroyo_idf.rename(columns={'level_0': 'sona_no','level_1': 'term', 0: 'tfidf'})
arroyo_idf

Unnamed: 0,sona_no,term,tfidf
0,0,aabala,0.000000
1,0,aalala,0.000000
2,0,aalisin,0.000000
3,0,aaral,0.026492
4,0,aasikaso,0.018037
...,...,...,...
53059,8,yun,0.000000
53060,8,zambales,0.000000
53061,8,zamboanga,0.000000
53062,8,zone,0.000000


In [28]:
all_arroyo = arroyo_idf.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
all_arroyo.head()

Unnamed: 0,sona_no,term,tfidf
2811,0,law,0.114608
3615,0,national,0.110596
3675,0,ninyo,0.105331
5509,0,trabaho,0.099327
5317,0,tahanan,0.093628


In [29]:
# # Terms in this list will get a red dot in the visualization
term_list = ['boss', 'wangwang']

# adding a little randomness to break ties in term ranking
all_arroyo_plusRand = all_arroyo.copy()
all_arroyo_plusRand['tfidf'] = all_arroyo_plusRand['tfidf'] + np.random.rand(all_arroyo.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(all_arroyo_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)