# Introduction 

The website scraped is the Official Gazette of the Philippines (https://www.officialgazette.gov.ph/section/executive-orders/), the country's repository of legal documents. In particular, the section on executive orders. This site was chosen because legal documents are typically not hosted on APIs, and the site itself contains this information in a structured manner across different portals. Access to timely and effective jurisprudence is key to several undertakings that aim to democratize access to legal information, as well as a bevy of research into the natural-language processes of legal domains using AI (Dyevre, 2021; Ibarra & Revilla, 2014; Peramo et al., 2021; Virtucio et al., 2018). 

The website was scraped at <insert time and date here>. The robots text can be seen here: <insert img>

# Building and Running the Selenium Scraper 


##### Importing libraries 

In [1]:
# importing the necessary libraries 

from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.common.by import By 
from time import sleep 
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager 
import pandas as pd 
from bs4 import BeautifulSoup
import requests 

##### Initializing Selenium 

In [2]:
# setting target-page
base_url = "https://www.officialgazette.gov.ph/section/executive-orders/"

# # window settings - UNCOMMENT after running the noteboko fully
# options = webdriver.ChromeOptions()
# options.binary_location = ""
# options.add_argument("--headless")
# options.add_argument("--start-maximized")
# options.add_argument("--incognito")

# initializing driver options 
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(base_url)
sleep(3) 


##### Scraping the relevant content 

In [3]:

# this is the main div that contains the body. This returns a list of objects containing the EOs. There are 10 EOs on each page/body.
main_body = driver.find_element(by=By.XPATH, value="/html/body/div[2]/section/main/div/div[1]")  

# contains each EO body. Nested underneath this are: 1) headers containing metadata & the EO name, 2) a brief summary of the EO, and 3) a footer containing category and tag links
eo_articles = main_body.find_elements(by=By.TAG_NAME, value='article') 

data_list = []

# iterating through 10 pages to get 100 observations:
for i in range(11): 

# iterate through each executive order collection and extract relevant features
    for eo in eo_articles: 
        # title
        entry_title = eo.find_element(by=By.CLASS_NAME, value='entry-title') # header value to extract subdomains
        title = entry_title.text # E.G. Executive Order No. 5, s. 2022

        # date signed (metadata)
        signed_on = eo.find_element(by=By.TAG_NAME, value='time').text # E.G. September 22, 2022 
        
        # url 
        url = eo.find_element(by=By.TAG_NAME, value='a').get_attribute('href') # E.G. https://www.officialgazette.gov.ph/section/executive-orders/...

        # summary 
        summary = eo.find_element(by=By.TAG_NAME, value='p').text # E.G. TRANSFERERING THE ATTACHMENT OF TECHNICAL EDUCAITON AND SKILLS DEPT FROM ...

        # category and tags. 
        categories_posted = eo.find_elements(by=By.CLASS_NAME, value='cat-links') # contains information on what type of legal document/categories this piece is filed under
        cats_list = [cats.text for cats in categories_posted] # E.G. ['Executive Issuances, Executive Orders, Laws and Issuances']

        tags_posted = eo.find_elements(by=By.CLASS_NAME, value='tag-links') # contains info on the relevant tag links for the type of issuance
        tags_list = [tags.text for tags in tags_posted] # E.G. ['Tagged Executive Issuances, Executive Orders, Ferdinand R. Marcos Jr.']

        # each row of the data frame will contain the following info:
        data_dict = {'title': title, 
                    'signed_on': signed_on, 
                    'url': url, 
                    'summary': summary, 
                    'categories': cats_list, 
                    'tags': tags_list}

        # appending the row to the existing dataframe 
        data_list.append(data_dict)

        # clicking on the older issuances tag to get a fresh set of pages 
        older_issuances_button = driver.find_element(by=By.XPATH, value='/html/body/div[2]/section/main/div/div[1]/div/div/nav/div/div')
        older_issuances_button.click()

        ## testing the loop logic -- comment/uncomment as needed
        # print(title)
        # print(url)
        # print(signed_on)
        # print(summary)
        # print(cats_list)
        # print(tags_list)
        # print('----') 


    

In [4]:

df = pd.DataFrame(data_list)
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   title       110 non-null    object
 1   signed_on   110 non-null    object
 2   url         110 non-null    object
 3   summary     110 non-null    object
 4   categories  110 non-null    object
 5   tags        110 non-null    object
dtypes: object(6)
memory usage: 5.3+ KB


##### Clicking on the next page 


In [None]:
# there are 10 documents per page, so we can set a for loop with range 1-10 to collect our 100 data points 

# find the lick to older entries and have the page click on it as it scrapes through the necessary content

# References 

Dyevre, A. (2021). Text-mining for lawyers: How machine learning techniques can advance our understanding of legal discourse. Erasmus Law Review, 14, 7. https://heinonline.org/HOL/Page?handle=hein.journals/erasmus14&id=9&div=&collection=

Ibarra, V. C., & Revilla, C. D. (2014). Consumers’ awareness on their eight basic rights: A comparative study of filipinos in the philippines and guam (SSRN Scholarly Paper No. 2655817). Social Science Research Network. https://papers.ssrn.com/abstract=2655817

Peramo, E., Cheng, C., & Cordel, M. (2021). Juris2vec: Building word embeddings from philippine jurisprudence. 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 121–125. https://doi.org/10.1109/ICAIIC51459.2021.9415251

Virtucio, M. B. L., Aborot, J. A., Abonita, J. K. C., Aviñante, R. S., Copino, R. J. B., Neverida, M. P., Osiana, V. O., Peramo, E. C., Syjuco, J. G., & Tan, G. B. A. (2018). Predicting decisions of the philippine supreme court using natural language processing and machine learning. 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), 02, 130–135. https://doi.org/10.1109/COMPSAC.2018.10348