<a href="https://colab.research.google.com/github/joms-hub/tagalog-fake-news-detection/blob/main/notebooks/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preprocessing

### 1. Install libraries and setup repo

In [2]:
!git clone https://github.com/joms-hub/tagalog-fake-news-detection.git
!pip install pandas transformers scikit-learn torch torchvision torchaudio

Cloning into 'tagalog-fake-news-detection'...
remote: Enumerating objects: 74, done.[K
remote: Counting objects: 100% (74/74), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 74 (delta 37), reused 55 (delta 30), pack-reused 0 (from 0)[K
Receiving objects: 100% (74/74), 4.75 MiB | 16.78 MiB/s, done.
Resolving deltas: 100% (37/37), done.


### 2. Basic Data Inspection

In [3]:
import pandas as pd

df = pd.read_csv("/content/tagalog-fake-news-detection/data/full.csv")
print(df.head())
print(df['label'].value_counts())


   label                                            article
0      0  Ayon sa TheWrap.com, naghain ng kaso si Krupa,...
1      0  Kilala rin ang singer sa pagkumpas ng kanyang ...
2      0  BLANTYRE, Malawi (AP) -- Bumiyahe patungong Ma...
3      0  Kasama sa programa ang pananalangin, bulaklak ...
4      0  Linisin ang Friendship Department dahil dadala...
label
0    1603
1    1603
Name: count, dtype: int64


### 3. Train/Validation/Test Split (70/15/15)

In [4]:
from sklearn.model_selection import train_test_split

# First split (70% train, 30% temp)
train, temp = train_test_split(
    df, test_size=0.30, stratify=df['label'], random_state=42
)

# Second split (50/50 of temp → 15% val, 15% test)
val, test = train_test_split(
    temp, test_size=0.50, stratify=temp['label'], random_state=42
)

print("Train size:", len(train))
print("Validation size:", len(val))
print("Test size:", len(test))

Train size: 2244
Validation size: 481
Test size: 481


### 4. Tokenizer Setup

In [5]:
# Set up HuggingFace Datasets

from datasets import Dataset

train_ds = Dataset.from_pandas(train.reset_index(drop=True))
val_ds = Dataset.from_pandas(val.reset_index(drop=True))
test_ds = Dataset.from_pandas(test.reset_index(drop=True))

# Define Models + Tokenizers

from transformers import AutoTokenizer

model_names = {
    "TinyBERT": "huawei-noah/TinyBERT_General_4L_312D",
    "DistilBERT": "distilbert-base-multilingual-cased",
    "MobileBERT": "google/mobilebert-uncased",
    "MiniLMv2": "nreimers/MiniLMv2-L6-H384-distilled-from-BERT-base",
    "ELECTRA-small": "google/electra-small-discriminator"
}

# Load tokenizers
tokenizers = {name: AutoTokenizer.from_pretrained(path) for name, path in model_names.items()}


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/409 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/847 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [6]:
# Encoding function

def encode(batch, tokenizer):
    return tokenizer(
        batch['article'],
        truncation=True,
        padding='max_length',
        max_length=512
    )


In [7]:
# Loop through models and save

import os

out_dir = "/content/tagalog-fake-news-detection/tokenized"
os.makedirs(out_dir, exist_ok=True)

for name, tok in tokenizers.items():
    print(f"Tokenizing for {name}...")

    train_enc = train_ds.map(lambda b: encode(b, tok), batched=True)
    val_enc   = val_ds.map(lambda b: encode(b, tok), batched=True)
    test_enc  = test_ds.map(lambda b: encode(b, tok), batched=True)

    # Save HuggingFace dataset objects to disk
    train_enc.save_to_disk(f"{out_dir}/{name}_train")
    val_enc.save_to_disk(f"{out_dir}/{name}_val")
    test_enc.save_to_disk(f"{out_dir}/{name}_test")


Tokenizing for TinyBERT...


Map:   0%|          | 0/2244 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2244 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Tokenizing for DistilBERT...


Map:   0%|          | 0/2244 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2244 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Tokenizing for MobileBERT...


Map:   0%|          | 0/2244 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2244 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Tokenizing for MiniLMv2...


Map:   0%|          | 0/2244 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2244 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Tokenizing for ELECTRA-small...


Map:   0%|          | 0/2244 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Map:   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2244 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/481 [00:00<?, ? examples/s]

### 5. Creating a small sample for documentation

In [8]:
sample = train.head(20)   # pick first 20 rows
sample.to_csv("/content/tagalog-fake-news-detection/data/fake_news_sample.csv", index=False)