# The President's SONA

This scrapes the contents of https://www.officialgazette.gov.ph/past-sona-speeches/ for copies of the State of the Nation Addresses of Philippine presidents from 1936 to 2021.

The goal is to be able to use the SONAs for textual analysis. These speeches are delivered before congress every fourth Monday of July and widely anticipated for setting the tone of an administration. Sample analysis is provided at the latter part of the notebook. 

## Do all your imports

In [None]:
import pandas as pd

import time
import re
import numpy as np

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager

import requests
from bs4 import BeautifulSoup

## Allow Selenium to open up Chrome and automatically navigate through the website

In [2]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Could not get version for google-chrome with the any command: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
Current google-chrome version is UNKNOWN
Get LATEST chromedriver version for UNKNOWN google-chrome
Driver [/Users/prinzmagtulis/.wdm/drivers/chromedriver/mac64/96.0.4664.45/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [3]:
driver.get("https://www.officialgazette.gov.ph/past-sona-speeches/")

## Scraping proper: table

First step is to scrape all tabled information, that is, excluding all the contents of **links**.

In [None]:
rows= driver.find_elements(By.TAG_NAME, "tr")

We arrange the information into a **list of dictionaries** in preparation to transforming it into a **data frame** for pandas analysis later.

In [6]:
dataset=[]
for dicts in rows[1:]:
    data={}
    all_tds = dicts.find_elements(By.TAG_NAME, "td")
    if len(all_tds) == 5:
        prexy = data['president']= dicts.find_elements(By.TAG_NAME, "td")[0].text
        data['date']= dicts.find_elements(By.TAG_NAME, "td")[1].text
        data['title'] = dicts.find_elements(By.TAG_NAME, "td")[2].text
        try:
            data['link'] = dicts.find_elements(By.TAG_NAME, "a")[1].get_attribute('href')
        except:
            data['link'] = dicts.find_element(By.TAG_NAME, "a").get_attribute('href')
        data['venue'] = dicts.find_elements(By.TAG_NAME, "td")[3].text
        data['session'] = dicts.find_elements(By.TAG_NAME, "td")[4].text
        dataset.append(data)
    else:
        data['president'] = prexy
        data['date']= dicts.find_elements(By.TAG_NAME, "td")[0].text
        data['title'] = dicts.find_elements(By.TAG_NAME, "td")[1].text
        data['link'] = dicts.find_element(By.TAG_NAME, "a").get_attribute('href')
        data['venue'] = dicts.find_elements(By.TAG_NAME, "td")[2].text
        data['session'] = dicts.find_elements(By.TAG_NAME, "td")[3].text
        dataset.append(data)
dataset

[{'president': 'Manuel L. Quezon',
  'date': 'November 25, 1935',
  'title': 'Message to the First Assembly on National Defense',
  'link': 'http://www.officialgazette.gov.ph/1935/11/25/message-of-president-quezon-to-the-first-assembly-on-national-defense-november-25-1935/',
  'venue': 'Legislative Building, Manila',
  'session': 'First National Assembly, First Session'},
 {'president': 'Manuel L. Quezon',
  'date': 'June 16, 1936',
  'title': 'On the Country’s Conditions and Problems',
  'link': 'http://www.officialgazette.gov.ph/1936/06/16/manuel-l-quezon-second-state-of-the-nation-address-june-16-1936/',
  'venue': 'Legislative Building, Manila',
  'session': 'First National Assembly, First Session'},
 {'president': 'Manuel L. Quezon',
  'date': 'October 18, 1937',
  'title': 'Improvement of Philippine Conditions, Philippine Independence, and Relations with American High Commissioner',
  'link': 'http://www.officialgazette.gov.ph/1937/10/18/manuel-l-quezon-third-state-of-the-nation-

Our **first data frame**

In [14]:
df1 = pd.DataFrame(dataset)
df1.head()

Unnamed: 0,president,date,title,link,venue,session
0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session"
1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session"
2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session"
3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session"
4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session"


## Scraping proper: actual speeches

We use BeautifulSoup on this one. The process is easier since we already have the links in the first df and all we have to do is to just **access and grab** their contents one by one.

I'm commenting this part out to avoid reading through a bunch of texts, but hey, it runs very well so try it on your own!

In [1]:
# speeches=[]
# for speech in dataset[0:]:
#     href = speech['link']
#     raw_html = requests.get(href).content
#     doc = BeautifulSoup(raw_html, "html.parser")
#     headers = doc.find_all(class_= 'large-9 large-centered columns')[1]
#     text={}
#     text['link']= speech['link']
#     text['speech']= headers.text 
#     speeches.append(text)
# speeches

As you can see, the speeches are arranged as a **single block** per row to match their place in the df. This is, of course, not the ideal way and may be improved. Below is a **second data frame** containing the links and speeches themselves.

We then **merge** this information with our earlier df.

In [17]:
df2=pd.DataFrame(speeches)
df2

Unnamed: 0,link,speech
0,http://www.officialgazette.gov.ph/1935/11/25/m...,\nMessage\nof\nHis Excellency Manuel L. Quezon...
1,http://www.officialgazette.gov.ph/1936/06/16/m...,\nMessage\nof\nHis Excellency Manuel L. Quezon...
2,http://www.officialgazette.gov.ph/1937/10/18/m...,\nMessage\nof\nHis Excellency Manuel L. Quezon...
3,http://www.officialgazette.gov.ph/1938/01/24/m...,\nMessage\nof\nHis Excellency Manuel L. Quezon...
4,http://www.officialgazette.gov.ph/1939/01/24/m...,\nMessage\nof\nHis Excellency Manuel L. Quezon...
5,http://www.officialgazette.gov.ph/1940/01/22/m...,\nMessage\nof\nHis Excellency Manuel L. Quezon...
6,http://www.officialgazette.gov.ph/1941/01/31/m...,\nMessage\nof\nHis Excellency Manuel L. Quezon...
7,http://www.officialgazette.gov.ph/1945/06/09/s...,\nMessage\nof\nHis Excellency Sergio Osmeña\nP...
8,http://www.officialgazette.gov.ph/1946/06/03/m...,\nMessage\nof\nHis Excellency Manuel Roxas\nPr...
9,http://www.officialgazette.gov.ph/1947/01/27/m...,\nMessage\nof\nHis Excellency Manuel Roxas\nPr...


Our final df.

In [18]:
merged = df1.merge(df2, suffixes=('_left'))
merged

  return merge(


Unnamed: 0,president,date,title,link,venue,session,speech
0,Manuel L. Quezon,"November 25, 1935",Message to the First Assembly on National Defense,http://www.officialgazette.gov.ph/1935/11/25/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
1,Manuel L. Quezon,"June 16, 1936",On the Country’s Conditions and Problems,http://www.officialgazette.gov.ph/1936/06/16/m...,"Legislative Building, Manila","First National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
2,Manuel L. Quezon,"October 18, 1937","Improvement of Philippine Conditions, Philippi...",http://www.officialgazette.gov.ph/1937/10/18/m...,"Legislative Building, Manila","First National Assembly, Second Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
3,Manuel L. Quezon,"January 24, 1938",Revision of the System of Taxation,http://www.officialgazette.gov.ph/1938/01/24/m...,"Legislative Building, Manila","First National Assembly, Third Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
4,Manuel L. Quezon,"January 24, 1939",The State of the Nation and Important Economic...,http://www.officialgazette.gov.ph/1939/01/24/m...,"Legislative Building, Manila","Second National Assembly, First Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
5,Manuel L. Quezon,"January 22, 1940",The State of the Nation,http://www.officialgazette.gov.ph/1940/01/22/m...,"Legislative Building, Manila","Second National Assembly, Second Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
6,Manuel L. Quezon,"January 31, 1941",The State of the Nation,http://www.officialgazette.gov.ph/1941/01/31/m...,"Legislative Building, Manila","Second National Assembly, Third Session",\nMessage\nof\nHis Excellency Manuel L. Quezon...
7,Sergio Osmeña,"June 9, 1945",Message to the First Congress of the Commonwea...,http://www.officialgazette.gov.ph/1945/06/09/s...,"Lepanto Street, Manila",First Congress of the Commonwealth,\nMessage\nof\nHis Excellency Sergio Osmeña\nP...
8,Manuel Roxas,"June 3, 1946",The State of the Nation,http://www.officialgazette.gov.ph/1946/06/03/m...,"Lepanto Street, Manila",Second Congress of the Commonwealth,\nMessage\nof\nHis Excellency Manuel Roxas\nPr...
9,Manuel Roxas,"January 27, 1947",Message on the State of the Nation,http://www.officialgazette.gov.ph/1947/01/27/m...,"Lepanto Street, Manila","First Congress, First Session",\nMessage\nof\nHis Excellency Manuel Roxas\nPr...


## Initial peek: regex

We are now ready to take an **initial analysis** of the texts that we have. For this part, I provided some examples below using **regex**.

The words we ran here are based from peer-reviewed textual studies that gauge **populism.**

In [19]:
#Ran to just check the type of files we are dealing with.
merged.dtypes

president    object
date         object
title        object
link         object
venue        object
session      object
speech       object
dtype: object

### 'elite'

The word "elite" is found to have been often used by populist leaders. We find based on this initial analysis that in the case of Philippine presidents, three leaders (one of whom was **dictator** Ferdinand Marcos Sr.) were found to have said the word in their SONAs.

Most recently by **current president Rodrigo Roa Duterte**.

In [20]:
merged[merged.speech.str.contains(r"\belite", case=False, regex=True)].president.value_counts()

Ferdinand E. Marcos        2
Joseph Ejercito Estrada    1
Rodrigo Roa Duterte        1
Name: president, dtype: int64

In [25]:
merged.speech.str.extractall(r'(.*\belite.+)', re.IGNORECASE)

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Unnamed: 0_level_1,match,Unnamed: 2_level_1
31,0,"It is fortunate that the nation will, just two..."
37,0,"Clearly, we face here the danger that our New ..."
60,0,Our war on poverty is in the acceleration of t...
81,0,Great wealth enables economic elites and corpo...


### 'democracy'

Conversely, the three presidents listed above were found to have mentioned "democracy" the least in their SONAs, which figure prominently in other speeches by past presidents.

**Joseph Estrada**, whose term was cut short by a popular revolt in 2001, did not mention democracy at all in his three speeches.

In [1]:
#pd.set_option('display.max_rows', None)
merged.speech.str.extractall(r'(.*\bdemocracy.+)', re.IGNORECASE).head(5)

NameError: name 'merged' is not defined