# **Parsers**

This notebook explores the various methods on how to properly parse links, text files, pdfs, and text itself from the frontend. In this notebook, we will:

- Parse links using `Trafilatura`, pdf's as raw bytes using `PyMuPDF`, and text itself as string.
- Create a preprocessing pipeline (method), based on the `model_dev.ipynb`
- Perform inference on sample data using the Random Forests Classifier with LIME

## **OBJECTIVE 1.1:** Parse links using `Trafilatura`

In [1]:
import trafilatura

url = 'https://www.gmanetwork.com/news/topstories/nation/943939/sara-duterte-hints-2028-run-isko-moreno-rally/story/'
downloaded = trafilatura.fetch_url(url)
result = trafilatura.extract(downloaded, output_format='txt')

In [2]:
import re

cleaned_text = re.sub(r"[^\w\s']", ' ', result) # Step 1: Remove non-alphanumeric characters except for the aphostrophe
cleaned_text = re.sub(r'[\t\n\r\f\v]', ' ', cleaned_text)  # Step 2: Replace tabs/newlines with spaces
cleaned_text = re.sub(r'\s+', ' ', cleaned_text)  # Step 3: Normalize multiple spaces into one

print(cleaned_text.strip())

Sara Duterte hints at 2028 run at Isko rally Vice President Sara Duterte once again teased about her supposed plans for the 2028 presidential elections during the campaign rally of Manila mayoral candidate Isko Moreno on Thursday night While delivering her speech Duterte removed the seal of the Office of the Vice President OVP from the lectern saying that she will replace it with a seal of the Office of the President This drew applause and cheers from the crowd Tinanggal ko lang kasi iba kasi 'yung opisina iba rin 'yung pangangampanya Pero kung sa pangulo lang pumili na lang kayo si Isko Moreno o si Sara Duterte wala kaming problema she said I removed the seal because my work is different from campaigning But if you will choose a president it s either me or Isko Moreno we don t have a problem with that To recall Duterte said last January that she was seriously considering to run in the 2028 presidential elections However when asked for a clarification on Thursday if she was now sure of

In [9]:
# perform inference on provided link
from joblib import load
vectorizer = load('./serialized/tfidf-vectorizer.pkl')
clf = load('./serialized/models/EnsembleSoftVoting.pkl')

In [4]:
from pathlib import Path
def load_stopwords(path: Path = Path("./stopwords-tl.txt")):
    """ Opens the tagalog stopwords file """
    with open(path, "r", encoding="utf-8") as f:
        return [line.strip() for line in f.readlines()]

In [5]:
import calamancy
import nltk
nlp = calamancy.load("tl_calamancy_md-0.2.0") # load in states in future due to slow loading time
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
tagalog_tl = load_stopwords()

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess_text(txt: str, calamancy_model, tl_stopwords, join: bool=True):
    """
    Preprocesses input text by converting it to lowercase,
    removing punctuation, applying Tagalog & English stopword removal,
    and performing Tagalog & English lemmatization.

    Args:
        txt (str): Input text
        calamancy_model: calamanCy model for Tagalog lemmatization
        tl_stopwords (list): Tagalog stopword list
        join (bool): Whether to return a joined string or list of tokens
    """
    # Initialize English tools
    en_stopwords = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    # Step 1: Lowercase
    lower_text = txt.lower()

    # Step 2: Remove punctuation
    no_punc = lower_text.translate(str.maketrans('', '', string.punctuation))

    # Step 3: Tokenize
    tokenized = word_tokenize(no_punc)

    # Step 4: Remove Tagalog and English stopwords
    tokens_no_stop = [
        token for token in tokenized
        if token not in tl_stopwords and token not in en_stopwords
    ]

    # Step 5: Lemmatize with calamancy (Tagalog)
    calamancy_doc = calamancy_model(' '.join(tokens_no_stop))
    calamancy_lemmas = [token.lemma_ for token in calamancy_doc]

    # Step 6: Lemmatize again with English lemmatizer (handles English tokens better)
    final_tokens = [lemmatizer.lemmatize(token) for token in calamancy_lemmas]

    # Step 7: Return
    if join:
        return ' '.join(final_tokens)
    else:
        return final_tokens

In [8]:
preprocessed = preprocess_text(cleaned_text, nlp, tagalog_tl, True)
print(preprocessed)

sara duterte hint 2028 run isko rally vice president sara duterte teased supposed plan 2028 presidential election campaign rally manila mayoral candidate isko moreno thursday night delivering speech duterte removed seal office vice president ovp lectern saying replace seal office president drew applause cheer crowd tanggal lang kasi kasi iyon opisina din iyon kampanya pangulo lang pili lang kayo si isko moreno si sara duterte wala kami na problema said removed seal work different campaigning choose president either isko moreno problem recall duterte said last january seriously considering run 2028 presidential election however asked clarification thursday sure running presidency duterte replied siguro kilala tao siguro mag distinguish iyon na joke iyon joke tagal din sama tagal din kita sambayanan national stage naiintindihan iyon joke iyon joke people already know able distinguish im saying joke weve together long time people seeing national stage since think already understand rally 

In [10]:
input_tfidf = vectorizer.transform([preprocessed])
pred = clf.predict(input_tfidf)

In [11]:
pred # Real News-0, Fake-News-1

array([1])

In [12]:
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline

# Wrap vectorizer and model into one pipeline
c = make_pipeline(vectorizer, rf)

class_names = ["Real News", "Fake News"] # 0, 1

In [13]:
explainer = LimeTextExplainer(class_names=class_names)
exp = explainer.explain_instance(preprocessed, c.predict_proba, num_features=10)

In [15]:
exp.as_list()

[(np.str_('duterte'), 0.10221146761156928),
 (np.str_('senator'), 0.06921283309831457),
 (np.str_('iyon'), 0.06852191278390994),
 (np.str_('said'), -0.06073342512814338),
 (np.str_('thursday'), -0.04901291096970945),
 (np.str_('joke'), -0.04641129731860557),
 (np.str_('house'), -0.03733824620170919),
 (np.str_('gma'), 0.029988503842908923),
 (np.str_('lang'), 0.027357178085538995),
 (np.str_('pangulo'), 0.022496359780912988)]

In [22]:
exp.predict_proba.max()

np.float64(0.6095946538286927)

In [16]:
word_scores = {}
for word, score in exp.as_list():
    word_scores[str(word)] = round(score, 5)

In [17]:
word_scores

{'duterte': 0.10221,
 'senator': 0.06921,
 'iyon': 0.06852,
 'said': -0.06073,
 'thursday': -0.04901,
 'joke': -0.04641,
 'house': -0.03734,
 'gma': 0.02999,
 'lang': 0.02736,
 'pangulo': 0.0225}

In [36]:
exp.save_to_file('./figures/gma_news.html')

## ## **OBJECTIVE 1.2:** Parse PDF Files

In [37]:
import fitz  # PyMuPDF

with open('sample_file.pdf', 'rb') as f:
    pdf_bytes = f.read() # open pdf as raw bytes

# Load PDF directly from bytes
doc = fitz.open(stream=pdf_bytes, filetype='pdf')

# Extract text from each page
full_text = ""
for page in doc:
    full_text += page.get_text()

print(full_text)

PILIPINAS, NAGTAGUMPAY SA KAUNA-UNAHANG MISYON NG PAGPAPATUBO NG HALAMAN SA 
BUWAN 
Maynila, Pilipinas — Sa isang makasaysayang anunsyo ngayong araw, ipinahayag ng Philippine Space 
and Agriculture Agency (PSAA) ang matagumpay na pagpapalago ng halamang malunggay sa ibabaw 
ng buwan, isang hakbang na tinaguriang "monumental achievement" sa larangan ng agham at 
teknolohiya. 
Ayon kay Dr. Feliciano Robles, punong siyentipiko ng misyon, ang proyekto na tinawag na Project 
Gulaylaktik ay bahagi ng mas malawak na plano ng pamahalaan upang palawakin ang food 
sustainability program sa labas ng Earth. Gamit ang isang makabagong lunar greenhouse na 
inimbento sa loob ng tatlong buwan, nagawa ng mga Pilipinong siyentipiko na pasibulin ang unang 
buto ng malunggay sa mabatong kapaligiran ng buwan. 
"Ang malunggay, na kilala sa taglay nitong mataas na nutrisyon, ay napili dahil sa kakayahan nitong 
mabuhay sa mahihirap na kondisyon. Sa pamamagitan ng teknolohiyang hydroponics at lunar nutrient 


In [38]:
pdf_text = preprocess_text(full_text, nlp, stopwords)
pdf_preprocessed = ' '.join(pdf_text)

In [39]:
pdf_preprocessed

'pilipinas tagumpay kaunaunahang misyon pagpapatubo halaman buwan maynila pilipinas — saysaya na anunsyo ngayon na araw hayag philippine space and agriculture agency psaa tagumpay pagpapalago halamang malunggay buwan hakbang taguri na monumental achievement larang agham teknolohiya ayon kay dr feliciano robles punong siyentipiko misyon proyekto tawag project gulaylaktik bahagi mas lawak plano pamahalaan upang palawakin food sustainability program labas earth gamit bago na lunar greenhouse mbento loob tatlo na buwan gawa pilipino na siyentipiko sibul una na buto malunggay bato na ligid buwan malunggay kilala taglay nito na taas nutrisyon pili kaya nito na buhay hirap kondisyon teknolohiya na hydroponics lunar nutrient infusion tunay natin posible pagtatanim iba na planeta hayag dr robles press briefing tuwa buo na bansa balita agad labas hayag si pangulo na andres dela cruz sabi na tunay kaya na sabay pilipinas larang space exploration maging una na supplier malunggay lawakan samantala 

In [41]:
pdf_tfidf = vectorizer.transform([pdf_preprocessed])
pdf_pred = rf.predict(pdf_tfidf)

In [42]:
pdf_pred

array([1], dtype=int64)

In [44]:
exp = explainer.explain_instance(pdf_preprocessed, c.predict_proba, num_features=10)
exp.save_to_file('./figures/sample_file_lime.html')