# Duterte SONAs

This processes all collated Duterte SONA. Reminder to run the [Philippines SONA](https://github.com/pmagtulis/ph-sona.git) scraper file to collect the **merged** CSV file here.

## Do all your imports

In [1]:
import pandas as pd
import numpy as np
import re
import altair as alt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import stopwordsiso as stopwords

## Read CSV

In [3]:
merged= pd.read_csv('merged.csv')
merged

Unnamed: 0.1,Unnamed: 0,president,date,title,link,venue,session,speech
0,0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
1,1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
2,2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
3,3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
4,4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
...,...,...,...,...,...,...,...,...
79,79,Rodrigo Roa Duterte,"July 23, 2018",Third State of the Nation Address,https://www.officialgazette.gov.ph/2018/07/23/...,"Batasang Pambansa, Quezon City","Seventeenth Congress, Third Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
80,80,Rodrigo Roa Duterte,"July 22, 2019",Fourth State of the Nation Address,https://www.officialgazette.gov.ph/2019/07/22/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, First Session",\n\n\n\nSTATE OF THE NATION ADDRESS OF \nRODRI...
81,81,Rodrigo Roa Duterte,"July 27, 2020",Fifth State of the Nation Address,https://www.officialgazette.gov.ph/2020/07/27/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Second Session",\n\n\n\n\n\n\n5TH STATE OF THE NATION ADDRESS ...
82,82,Rodrigo Roa Duterte,"July 26, 2021",Sixth State of the Nation Address,https://www.officialgazette.gov.ph/2021/07/26/...,"Batasang Pambansa, Quezon City","Eighteenth Congress, Third Session",\n\n\tState of the Nation Address of \n\tRodri...


## Isolate 'Duterte' speeches

The merged file contains all speeches by Philippine presidents since 1935. 

In [4]:
duterte = merged[(merged['president'] == 'Rodrigo Roa Duterte')] #Duterte

## Text analysis

Now, we can proceed with the text analysis proper. First stop, we set the parameters in the immediate cell below, most importantly the stopwords we want our analysis to disregard.

In [5]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    return text #removes all numbers

In [6]:
y_columns = ['president', 'speeches']
BINARY=False
NGRAM_RANGE=(1,1)
MIN_DF=0
STPWORDS=stopwords.stopwords(["en", "tl"]) #removes Tagalog stopwords
STPWORDS.update(['yung', 'iyan', 'yan', 'diyan', 'applause', 'laughter', 'palakpakan', 'rin', 'din', 'po',
                'pong', 'pang', 'pa', 'nang', 'ng', 'pag',
                'kapag']) #adds more Tagalog stopwords not included in the package 

vectorizer = CountVectorizer(
    stop_words=STPWORDS,
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)

## Vectorizing

Simple counting of words that occur in a speech.

In [7]:
X = vectorizer.fit_transform(duterte['speech'])
X



<6x6503 sparse matrix of type '<class 'numpy.int64'>'
	with 11645 stored elements in Compressed Sparse Row format>

In [8]:
duterte_vectors = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
# [print(x) for x in marcosjr.speech]
duterte_vectors.round(2)

Unnamed: 0,aabot,aakyat,aalis,aambush,aano,aaway,abandon,abdul,abiding,ability,...,youth,yun,zamboanga,zamora,zeal,zeroing,zhao,zone,zones,zoom
0,0,0,0,0,1,1,0,0,0,1,...,0,3,0,0,0,0,0,1,1,0
1,0,0,0,1,0,2,0,0,0,1,...,3,5,2,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,2,0,0
4,0,0,1,0,0,0,0,0,1,2,...,0,0,0,0,1,1,0,0,1,0
5,0,1,1,0,0,0,1,0,1,2,...,0,3,2,1,0,0,0,1,1,2


In [9]:
duterte_vectors = duterte_vectors.transpose() #swapping columns and row positions

In [10]:
duterte_vectors.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']
duterte_vectors.sort_values('SONA1', ascending=False).head(20)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
lang,52,42,1,34,15,49
government,32,47,21,29,42,35
kasi,30,21,0,14,9,14
wala,30,24,0,22,11,19
naman,28,20,0,8,7,13
time,22,33,8,26,22,36
people,21,31,15,17,25,29
country,21,22,13,6,28,43
kayo,21,50,4,19,10,22
ninyo,19,39,0,22,7,24


## Add a 'total' mention column

Totally optional, just in case you wanted to find the total number of mentions.

In [11]:
duterte_vectors['total'] = duterte_vectors.SONA1 + duterte_vectors.SONA2 + duterte_vectors.SONA3 + duterte_vectors.SONA4 + duterte_vectors.SONA5 + duterte_vectors.SONA6

In [12]:
duterte_vectors = duterte_vectors.sort_values('total', ascending=False)
duterte_vectors.head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6,total
government,32,47,21,29,42,35,206
lang,52,42,1,34,15,49,193
time,22,33,8,26,22,36,147
people,21,31,15,17,25,29,138
country,21,22,13,6,28,43,133
kayo,21,50,4,19,10,22,126
ninyo,19,39,0,22,7,24,111
wala,30,24,0,22,11,19,106
congress,7,13,12,23,23,26,104
law,8,21,11,15,18,16,89


# TF-IDF

## Duterte speeches

In [13]:
vectorizer = TfidfVectorizer(
    stop_words=STPWORDS, 
    ngram_range=NGRAM_RANGE,
    binary=BINARY,
    min_df=MIN_DF,
    preprocessor=preprocess_text
)
X = vectorizer.fit_transform(duterte['speech'])
duterte_idf = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
#[print(x) for x in speeches.sentence]
duterte_idf.round(2)



Unnamed: 0,aabot,aakyat,aalis,aambush,aano,aaway,abandon,abdul,abiding,ability,...,youth,yun,zamboanga,zamora,zeal,zeroing,zhao,zone,zones,zoom
0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,...,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0
1,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.01,...,0.02,0.04,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0
4,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.02,...,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.0
5,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.01,...,0.0,0.02,0.02,0.01,0.0,0.0,0.0,0.01,0.01,0.02


In [14]:
duterte_idf2 = duterte_idf.transpose()
duterte_idf2.columns = ['SONA1', 'SONA2', 'SONA3', 'SONA4', 'SONA5', 'SONA6']

In [15]:
duterte_idf2.sort_values('SONA4', ascending=False).head(15)

Unnamed: 0,SONA1,SONA2,SONA3,SONA4,SONA5,SONA6
lang,0.292032,0.200921,0.010494,0.213669,0.09278,0.217765
government,0.179712,0.22484,0.220376,0.182247,0.259784,0.155546
time,0.123552,0.157866,0.083953,0.163394,0.136077,0.15999
wala,0.194452,0.13251,0.0,0.159569,0.078527,0.097456
ninyo,0.123153,0.215329,0.0,0.159569,0.049972,0.123102
tsk,0.0,0.0,0.0,0.150919,0.0,0.016419
congress,0.039312,0.06219,0.125929,0.144541,0.142263,0.115549
money,0.005616,0.06219,0.041976,0.125688,0.018556,0.066663
kayo,0.117936,0.239191,0.041976,0.119403,0.061853,0.097772
people,0.117936,0.148299,0.157411,0.106834,0.154633,0.128881


## Looking for specific words

In this part, we are looking for specific words that we think made a mark during Duterte SONAs, whether because they are often mentioned, or because it is unusual for the Chief Executive to say it. 

We also include here words that we think were said because they were the topic at hand at the time the speech was delivered.

In [16]:
duterte_slice = duterte_idf[['drug', 'drugs', 'mining', 'pandemic', 'covid', 'rice']]
duterte_slice.sort_index().round(decimals=2)

Unnamed: 0,drug,drugs,mining,pandemic,covid,rice
0,0.08,0.04,0.03,0.0,0.0,0.02
1,0.0,0.04,0.14,0.0,0.0,0.0
2,0.05,0.05,0.07,0.0,0.0,0.14
3,0.02,0.03,0.0,0.0,0.0,0.02
4,0.01,0.04,0.0,0.21,0.15,0.0
5,0.04,0.08,0.0,0.14,0.16,0.01


In [17]:
duterte_slice = duterte_slice.stack().reset_index()
duterte_slice = duterte_slice.rename(columns={'level_0': 'sona_no','level_1': 'term', 'tfidf': 'term', 0: 'tfidf'})
duterte_slice.head()

Unnamed: 0,sona_no,term,tfidf
0,0,drug,0.078624
1,0,drugs,0.044928
2,0,mining,0.026276
3,0,pandemic,0.0
4,0,covid,0.0


In [18]:
top_tfidf = duterte_slice.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
top_tfidf.head()

Unnamed: 0,sona_no,term,tfidf
0,0,drug,0.078624
1,0,drugs,0.044928
2,0,mining,0.026276
5,0,rice,0.015011
3,0,pandemic,0.0


## Chart it

In [19]:
# # Terms in this list will get a red dot in the visualization
term_list = ['drug', 'drugs']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)

## Entire SONAs

In here, we do the same thing for all of SONA *without* isolating key words.

In [20]:
duterte_idf = duterte_idf.stack().reset_index()
duterte_idf

Unnamed: 0,level_0,level_1,0
0,0,aabot,0.000000
1,0,aakyat,0.000000
2,0,aalis,0.000000
3,0,aambush,0.000000
4,0,aano,0.012652
...,...,...,...
39013,5,zeroing,0.000000
39014,5,zhao,0.000000
39015,5,zone,0.006931
39016,5,zones,0.006931


In [21]:
duterte_idf = duterte_idf.rename(columns={'level_0': 'sona_no','level_1': 'term', 0: 'tfidf'})
duterte_idf

Unnamed: 0,sona_no,term,tfidf
0,0,aabot,0.000000
1,0,aakyat,0.000000
2,0,aalis,0.000000
3,0,aambush,0.000000
4,0,aano,0.012652
...,...,...,...
39013,5,zeroing,0.000000
39014,5,zhao,0.000000
39015,5,zone,0.006931
39016,5,zones,0.006931


In [22]:
all_dutertesona = duterte_idf.sort_values(by=['sona_no','tfidf'], ascending=[True,False]).groupby(['sona_no']).head(10)
all_dutertesona.head()

Unnamed: 0,sona_no,term,tfidf
3140,0,lang,0.292032
3026,0,kasi,0.194452
6370,0,wala,0.194452
3899,0,naman,0.181488
2343,0,government,0.179712


In [23]:
# # Terms in this list will get a red dot in the visualization
term_list = ['drug', 'drugs']

# adding a little randomness to break ties in term ranking
all_dutertesona_plusRand = all_dutertesona.copy()
all_dutertesona_plusRand['tfidf'] = all_dutertesona_plusRand['tfidf'] + np.random.rand(all_dutertesona.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(all_dutertesona_plusRand).encode(
    x = 'rank:O',
    y = 'sona_no:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["sona_no"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600, height=400)