# **Parsers**

This notebook explores the various methods on how to properly parse links, text files, pdfs, and text itself from the frontend. In this notebook, we will:

- Parse links using `Trafilatura`, pdf's as raw bytes using `PyMuPDF`, and text itself as string.
- Create a preprocessing pipeline (method), based on the `model_dev.ipynb`
- Perform inference on sample data using the Random Forests Classifier with LIME

## **OBJECTIVE 1.1:** Parse links using `Trafilatura`

In [1]:
import trafilatura

url = 'https://www.gmanetwork.com/news/topstories/nation/943939/sara-duterte-hints-2028-run-isko-moreno-rally/story/'
downloaded = trafilatura.fetch_url(url)
result = trafilatura.extract(downloaded, output_format='txt')

In [2]:
import re

cleaned_text = re.sub(r"[^\w\s']", ' ', result) # Step 1: Remove non-alphanumeric characters except for the aphostrophe
cleaned_text = re.sub(r'[\t\n\r\f\v]', ' ', cleaned_text)  # Step 2: Replace tabs/newlines with spaces
cleaned_text = re.sub(r'\s+', ' ', cleaned_text)  # Step 3: Normalize multiple spaces into one

print(cleaned_text.strip())

Sara Duterte hints at 2028 run at Isko rally Vice President Sara Duterte once again teased about her supposed plans for the 2028 presidential elections during the campaign rally of Manila mayoral candidate Isko Moreno on Thursday night While delivering her speech Duterte removed the seal of the Office of the Vice President OVP from the lectern saying that she will replace it with a seal of the Office of the President This drew applause and cheers from the crowd Tinanggal ko lang kasi iba kasi 'yung opisina iba rin 'yung pangangampanya Pero kung sa pangulo lang pumili na lang kayo si Isko Moreno o si Sara Duterte wala kaming problema she said I removed the seal because my work is different from campaigning But if you will choose a president it s either me or Isko Moreno we don t have a problem with that To recall Duterte said last January that she was seriously considering to run in the 2028 presidential elections However when asked for a clarification on Thursday if she was now sure of

In [3]:
# perform inference on provided link
from joblib import load
vectorizer = load('./serialized/tfidf-vectorizer.pkl')
rf = load('./serialized/models/RandomForestClassifier.pkl')

In [4]:
import calamancy
nlp = calamancy.load("tl_calamancy_md-0.2.0") # load in states in future due to slow loading time

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [5]:
import string
from nltk.tokenize import word_tokenize
from pathlib import Path

# save in state
def load_stopwords(path: Path = "./stopwords-tl.txt"):
    with open(path, "r", encoding="utf-8") as f:
        return [line.strip() for line in f.readlines()]

def preprocess_text(txt: str, calamancy_model, tl_stopwords):
    # convert text to lowercase
    lower_text = txt.lower()

    # remove punctuation
    no_punc = lower_text.translate(str.maketrans('', '', string.punctuation))

    # tokenize
    tokenized = word_tokenize(no_punc)

    # remove stopwords
    tokens_no_stopword = [token for token in tokenized if token not in tl_stopwords]

    # join for NLP model
    res = ' '.join(tokens_no_stopword)

    # lemmatization using calamancy_model
    doc = calamancy_model(res)
    tokens = [token.lemma_ for token in doc]

    return tokens

In [6]:
stopwords = load_stopwords()
tokens = preprocess_text(cleaned_text, nlp, stopwords)
preprocessed = ' '.join(tokens)

In [7]:
preprocessed

'sara duterte hint 2028 run isko rally vice president sara duterte once again teased about her supposed plans for the 2028 presidential election during the campaign rally of manila mayoral candidate isko moreno on thursday night while deliver her speech duterte removed the seal of the office of the vice president ovp from the lectern saying that she will replace it with a seal of the office of the president this drew applause and cheers from the crowd tanggal lang kasi kasi iyon opisina din iyon kampanya pangulo lang pili lang kayo si isko moreno si sara duterte wala kami na problema she said i removed the seal because my work is different from campaigning but if you will choose a president it sa either me or isko moreno we don t have a problem with that to recall duterte said last january that she was seriously considering to run in the 2028 presidential election however when asked for a clarification on thursday if she was now sure of running for presidency duterte replied siguro kil

In [8]:
input_tfidf = vectorizer.transform([preprocessed])
pred = rf.predict(input_tfidf)

In [9]:
pred # Real News-0, Fake-News-1

array([1], dtype=int64)

In [10]:
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline

# Wrap vectorizer and model into one pipeline
c = make_pipeline(vectorizer, rf)

class_names = ["Real News", "Fake News"] # 0, 1

In [11]:
explainer = LimeTextExplainer(class_names=class_names)
exp = explainer.explain_instance(preprocessed, c.predict_proba, num_features=10)

In [13]:
exp.as_list()

[('duterte', 0.049913386411017995),
 ('rodrigo', 0.04501168477587132),
 ('the', 0.03943759079112869),
 ('kayo', 0.03437595356533199),
 ('pangulo', 0.033658622028086116),
 ('senatorial', 0.0324378127275591),
 ('iyon', 0.03024167572050168),
 ('kita', 0.029089377737817144),
 ('news', 0.027197114232640547),
 ('candidate', 0.0261643034756546)]

In [36]:
exp.save_to_file('./figures/gma_news.html')

## ## **OBJECTIVE 1.2:** Parse PDF Files

In [37]:
import fitz  # PyMuPDF

with open('sample_file.pdf', 'rb') as f:
    pdf_bytes = f.read() # open pdf as raw bytes

# Load PDF directly from bytes
doc = fitz.open(stream=pdf_bytes, filetype='pdf')

# Extract text from each page
full_text = ""
for page in doc:
    full_text += page.get_text()

print(full_text)

PILIPINAS, NAGTAGUMPAY SA KAUNA-UNAHANG MISYON NG PAGPAPATUBO NG HALAMAN SA 
BUWAN 
Maynila, Pilipinas — Sa isang makasaysayang anunsyo ngayong araw, ipinahayag ng Philippine Space 
and Agriculture Agency (PSAA) ang matagumpay na pagpapalago ng halamang malunggay sa ibabaw 
ng buwan, isang hakbang na tinaguriang "monumental achievement" sa larangan ng agham at 
teknolohiya. 
Ayon kay Dr. Feliciano Robles, punong siyentipiko ng misyon, ang proyekto na tinawag na Project 
Gulaylaktik ay bahagi ng mas malawak na plano ng pamahalaan upang palawakin ang food 
sustainability program sa labas ng Earth. Gamit ang isang makabagong lunar greenhouse na 
inimbento sa loob ng tatlong buwan, nagawa ng mga Pilipinong siyentipiko na pasibulin ang unang 
buto ng malunggay sa mabatong kapaligiran ng buwan. 
"Ang malunggay, na kilala sa taglay nitong mataas na nutrisyon, ay napili dahil sa kakayahan nitong 
mabuhay sa mahihirap na kondisyon. Sa pamamagitan ng teknolohiyang hydroponics at lunar nutrient 


In [38]:
pdf_text = preprocess_text(full_text, nlp, stopwords)
pdf_preprocessed = ' '.join(pdf_text)

In [39]:
pdf_preprocessed

'pilipinas tagumpay kaunaunahang misyon pagpapatubo halaman buwan maynila pilipinas — saysaya na anunsyo ngayon na araw hayag philippine space and agriculture agency psaa tagumpay pagpapalago halamang malunggay buwan hakbang taguri na monumental achievement larang agham teknolohiya ayon kay dr feliciano robles punong siyentipiko misyon proyekto tawag project gulaylaktik bahagi mas lawak plano pamahalaan upang palawakin food sustainability program labas earth gamit bago na lunar greenhouse mbento loob tatlo na buwan gawa pilipino na siyentipiko sibul una na buto malunggay bato na ligid buwan malunggay kilala taglay nito na taas nutrisyon pili kaya nito na buhay hirap kondisyon teknolohiya na hydroponics lunar nutrient infusion tunay natin posible pagtatanim iba na planeta hayag dr robles press briefing tuwa buo na bansa balita agad labas hayag si pangulo na andres dela cruz sabi na tunay kaya na sabay pilipinas larang space exploration maging una na supplier malunggay lawakan samantala 

In [41]:
pdf_tfidf = vectorizer.transform([pdf_preprocessed])
pdf_pred = rf.predict(pdf_tfidf)

In [42]:
pdf_pred

array([1], dtype=int64)

In [44]:
exp = explainer.explain_instance(pdf_preprocessed, c.predict_proba, num_features=10)
exp.save_to_file('./figures/sample_file_lime.html')