# Introduction 

The website scraped is the Official Gazette of the Philippines (https://www.officialgazette.gov.ph/section/executive-orders/), the country's repository of legal documents. In particular, the section on executive orders. This site was chosen because legal documents are typically not hosted on APIs, and the site itself contains this information in a structured manner across different portals. Access to timely and effective jurisprudence is key to several undertakings that aim to democratize access to legal information, as well as a bevy of research into the natural-language processes of legal domains using AI (Dyevre, 2021; Ibarra & Revilla, 2014; Peramo et al., 2021; Virtucio et al., 2018). 

The website was scraped at 8:30 PM on September 25. The robots text can be seen here:

# Building and Running the Selenium Scraper 


##### Importing libraries 

In [1]:
# importing the necessary libraries 

from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.common.by import By 
from time import sleep 
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager 
import pandas as pd 
from bs4 import BeautifulSoup
import requests 

In [2]:
# error checking librarires 
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions


##### Initializing Selenium 

In [3]:
# setting target-page
base_url = "https://www.officialgazette.gov.ph/section/executive-orders/"

#### vv optional. seeing the GUI is fun and helps make sure the script is working ok so i don't recommend going headless

# # window settings - UNCOMMENT after running the noteboko fully
# options = webdriver.ChromeOptions()
# options.binary_location = ""
# options.add_argument("--headless")
# options.add_argument("--start-maximized")
# options.add_argument("--incognito")

# initializing driver options 
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(base_url)
sleep(3) 


##### Scraping the relevant content 

In [4]:
data_list = []

# iterating through 10 pages to get 100 observations:
for i in range(11): 
    sleep(3)

# iterate through each executive order collection and extract relevant features

    # this is the main div that contains the body. This returns a list of objects containing the EOs. There are 10 EOs on each page/body.
    main_body = driver.find_element(by=By.XPATH, value="/html/body/div[2]/section/main/div/div[1]")  

    # contains each EO body. Nested underneath this are: 1) headers containing metadata & the EO name, 2) a brief summary of the EO, and 3) a footer containing category and tag links
    eo_articles = main_body.find_elements(by=By.TAG_NAME, value='article') 

    for eo in eo_articles: 
        ##### Header Objects
        # title
        entry_title = eo.find_element(by=By.TAG_NAME, value='h3') # header value to extract subdomains
        title = entry_title.text # E.G. Executive Order No. 5, s. 2022

        # date signed (metadata)
        signed_on = eo.find_element(by=By.TAG_NAME, value='time').text # E.G. September 22, 2022 
        
        # url 
        url = eo.find_element(by=By.TAG_NAME, value='a').get_attribute('href') # E.G. https://www.officialgazette.gov.ph/section/executive-orders/...

        ##### Body Objects 
        # summary 
        summary = eo.find_element(by=By.TAG_NAME, value='p').text # E.G. TRANSFERERING THE ATTACHMENT OF TECHNICAL EDUCAITON AND SKILLS DEPT FROM ...

        
        #### Footer Objects 
        footer = eo.find_element(by=By.CLASS_NAME, value='entry-footer')

        # categories - these contain classifications 
        cats_links = footer.find_element(by=By.CLASS_NAME, value='cat-links')
        cats_tags = cats_links.find_elements(by=By.TAG_NAME, value='a') # contains information on what type of legal document/categories this piece is filed under
        cats_list = [cat.text for cat in cats_tags] # E.G. ['Executive Issuances, Executive Orders, Laws and Issuances']

        # tags 
        tags_links = footer.find_element(by=By.CLASS_NAME, value='tags-links')
        tags_posted = tags_links.find_elements(by=By.TAG_NAME, value='a') # contains info on the relevant tag links tied to the executing president, and document-type
        tags_list = [tags.text for tags in tags_posted] # E.G. ['Executive Issuances, Executive Orders, Ferdinand R. Marcos Jr.']


        # print(tags_posted)
        # print(tags_list)

        # each row of the data frame will contain the following info:
        data_dict = {'title': title, 
                    'signed_on': signed_on, 
                    'url': url, 
                    'summary': summary, 
                    'categories': cats_list, 
                    'tags': tags_list}

        # appending the row to the existing dataframe 
        data_list.append(data_dict)

    # clicking on the older issuances tag to get a fresh set of pages 
    previous_results_link = driver.find_element(by=By.PARTIAL_LINK_TEXT, value="Older entries")
    previous_results_link.click()    

In [5]:
# exiting the driver 
driver.quit()

In [6]:
# outputting the data frame 
df = pd.DataFrame(data_list)
df

Unnamed: 0,title,signed_on,url,summary,categories,tags
0,"Executive Order No. 5, s. 2022","September 16, 2022",https://www.officialgazette.gov.ph/2022/09/16/...,TRANSFERRING THE ATTACHMENT OF THE TECHNICAL E...,"[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Ferdin..."
1,"Executive Order No. 4, s. 2022","September 13, 2022",https://www.officialgazette.gov.ph/2022/09/13/...,DIRECTING THE IMPLEMENTATION OF A MORATORIUM O...,"[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Ferdin..."
2,"Executive Order No. 3, s. 2022","September 12, 2022",https://www.officialgazette.gov.ph/2022/09/12/...,ALLOWING VOLUNTARY WEARING OF FACEMASKS IN OUT...,"[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Ferdin..."
3,"Executive Order No. 2, s. 2022","June 30, 2022",https://www.officialgazette.gov.ph/2022/06/30/...,REORGANIZING AND RENAMING THE PRESIDENTIAL COM...,"[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Ferdin..."
4,"Executive Order No. 1, s. 2022","June 30, 2022",https://www.officialgazette.gov.ph/2022/06/30/...,REORGANIZING THE OFFICE OF THE PRESIDENT INCLU...,"[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Ferdin..."
...,...,...,...,...,...,...
105,"Executive Order No. 77, s. 2019","March 15, 2019",https://www.officialgazette.gov.ph/2019/03/15/...,PRESCRIBING RULES AND REGULATIONS AND RATES OF...,"[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Rodrig..."
106,"Executive Order No. 76, s. 2019","March 15, 2019",https://www.officialgazette.gov.ph/2019/03/15/...,"AMENDING EXECUTIVE ORDER NO. 201 (S. 2016), EN...","[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Rodrig..."
107,"Executive Order No. 75, s. 2019","February 1, 2019",https://www.officialgazette.gov.ph/2019/02/01/...,"DIRECTING ALL DEPARTMENTS, BUREAUS, OFFICES AN...","[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Rodrig..."
108,"Executive Order No. 74, s. 2019","February 1, 2019",https://www.officialgazette.gov.ph/2019/02/01/...,MALACAÑAN PALACE MANILA BY THE PRESIDENT OF TH...,"[Executive Issuances, Executive Orders, Laws a...","[Executive Issuances, Executive Orders, Rodrig..."


##### making it a csv

In [7]:
df.to_csv('executive_orders.csv')

# Difficulties encountered 
- Tracking the elements. Be very particular with when to use `find_element()` for containers, and `find_elements()` for smaller lists :))))))) 
- Figuring out whether to use XPATH, class name, tag name as target of `find_element` due to the page structure. Some wouldn't be compatible via a standard for-loop 
- The selenium docs are hard to parse :))) and often some elements are insufficiently contextualized 
- Making sure the order of operations made sense. You need to make sure to add `sleep()`, `refresh()`, and the like to make sure the pages are rendered before you try to look for elements. Additionally, if you use loops to iteratre through multiple pages, you have to make sure the pages' selectors are updated to avoid `StaleElementExceptions` which cropped up often.
- Sometimes the page will throw errors because of your internet (relevant during the rainy season.) Important to refresh and check if it's an internet problem bc that slowed down the debugging process and was a red herring if the error happens in the middle of a scraping cycle, which will not throw a `ConnectionError` 


# References 

Dyevre, A. (2021). Text-mining for lawyers: How machine learning techniques can advance our understanding of legal discourse. Erasmus Law Review, 14, 7. https://heinonline.org/HOL/Page?handle=hein.journals/erasmus14&id=9&div=&collection=

Ibarra, V. C., & Revilla, C. D. (2014). Consumers’ awareness on their eight basic rights: A comparative study of filipinos in the philippines and guam (SSRN Scholarly Paper No. 2655817). Social Science Research Network. https://papers.ssrn.com/abstract=2655817

Peramo, E., Cheng, C., & Cordel, M. (2021). Juris2vec: Building word embeddings from philippine jurisprudence. 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 121–125. https://doi.org/10.1109/ICAIIC51459.2021.9415251

Virtucio, M. B. L., Aborot, J. A., Abonita, J. K. C., Aviñante, R. S., Copino, R. J. B., Neverida, M. P., Osiana, V. O., Peramo, E. C., Syjuco, J. G., & Tan, G. B. A. (2018). Predicting decisions of the philippine supreme court using natural language processing and machine learning. 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), 02, 130–135. https://doi.org/10.1109/COMPSAC.2018.10348