# Introduction

This notebook is intended for presenting processes and tasks done with the use of Python. Below are objectives to be completed through this notebook:
* To  use a dataset consisting of politics-related tweets written in Filipino language for training the model;
* To train the three (3) selected models and apply hyperparameter tuning and optimization using random search optimization and experimentation; (*This was done in separate, duplicate notebooks*)
* To compare trained models for the performance between transformer-based models and the baseline model, Multinomial Naïve Bayes, and find the best performing model.

The notebook is split into the following subsections:
1. *Installation and Imports* - This section consists of installing dependencies and importing necessary libraries to run the notebook and the models, data, and techniques to be used.
2. *Environment Setup and Data Loading* - This section involves loading the dataset, splitting into training and testing, and preparation for modeling.
3. *Train-Validation-Test Splitting* - This section divides the dataset into three parts: Training, Validation, and testing.
4. *Converting DataFrames into Dataset Format* - This section converts the processed pandas DataFrames into Hugging Face Dataset format.
5. *Setting Up Evaluation Metrics* - This section prepares the evaluation metriccs that will be used to measure the model's perfomance during training and validation.
6. *Training Function for Transformer Models* - This setion defines a function that automates the enttire training process for a transformet-based text classification model.
7. *Defining the Model Choices* - This section creates a dictionary that stores the names and model IDs of the Transformer models that will be used for experimentation.
8. *Best Hyperparameters per Model* - This section defines the optimal hyperparameters for each transformer model based on the previous experimentations.
9. *Running the Models with Their Best Hyperparameters* - This section runs the training process for all the selected models (RoBERTa, ELECTRA, and BERT-WWM) using the best hyperparameters.
________________________________________________________________________________

**Dataset Information**

The dataset, obtained from HuggingFace as a secondary source, contains data collected by PhD student Jan Christian Blaise Cruz. The data consists of social media content (specifically tweets) from the 2016 presidential election in the Philippines. The dataset has 2 columns, one for the extracted text and the other being a label indicating whether the text contains hate speech or not (1 or 0 respectively). According to Cruz, the dataset was released having 4232 samples each for validation and testing and contains about 10k rows/samples for the training split .


# Installation and Imports

This sections setups up the environment by first installing dependencies and then, importing the necessary libraries that will be used for model training, dataset handling, and hyperparameter tuning.

The code cell below installs the latest verions of the required libraries. It uses the *pip* shell command to install *transformers*, *datasets*, *accelerate*, *ray[tune]*, and *optuna* hich are used for NLP models, loading datsets, and performing hyperparameter tuning. The *-U* at the end of the line indicates that the latest version of each package must be installed.

In [None]:
!pip install transformers datasets evaluate scikit-learn accelerate



The code cell below involves importing the necessary libraries for performing hyperparameter tuning using automated optimization techniques. It helps prepare the environment by adding tools that help with handling the data, defining the model, configuring training, and evaluating performance. The following libraries are imported:



1.   **pandas as pd** – Used for reading and manipulating tabular data, such as CSV files containing text and labels.
2. **Dataset (from Hugging Face Datasets)** – Converts pandas DataFrames into a dataset format that is ready for tokenization and model training.
3. **CountVectorizer** – A tool from scikit-learn that transforms text into a bag-of-words numerical representation for traditional machine learning models.
4. **MultinomialNB** – The Naive Bayes classifier used for text classification, especially effective for word-frequency data.
5. **train_test_split** – Splits the dataset into training and testing sets to evaluate model performance.
6. **accuracy_score, f1_score** – Performance metrics that measure how well the model classifies the texts.
7. **AutoConfig, AutoTokenizer, AutoModelForSequenceClassification** – Hugging Face tools used to load the model configuration, tokenizer, and pretrained transformer model for sequence classification tasks.
8. **TrainingArguments, Trainer** – Hugging Face utilities that simplify the training loop by handling optimization, logging, evaluation, and saving of model checkpoints.
9. **evaluate **– A library that provides ready-to-use evaluation modules for computing metrics during or after training.
10. **google.colab.files** – Enables uploading files (like CSV datasets) directly from the local machine into the Google Colab environment.



In [None]:
import pandas as pd
from datasets import Dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
from google.colab import files

# Environment Setup and Data Loading

This section sets up the environment by performing steps such as setting seed to ensure reproducibility, detecting the presence of a GPU, uploading of dataset, splitting the data into training and evaluation dataset, and converting it into a suitable format. This prepares the data for tokenization and model training.

The code cell below uploads the dataset. In this case, the dataset is in CSV format and has the file name *hate_speech*.

To upload the file from the local machine to the Colab environment, it uses *files.upload()* and stores the file in the *uploaded* variable. It, then, reads the CSV file using *pd.read_csv()* by pandas and stores it in the *df* variable. Finally, it ensures that it keeps only the relevant columns (those with the name of *text* and *label*). Since the dataset contains only these two columns, it keeps all of the columns of the dataset.

In [None]:
uploaded = files.upload()

df = pd.read_csv("hate_speech.csv", encoding="latin-1")
df = df[["text", "label"]]

Saving hate_speech.csv to hate_speech (2).csv


# Train-Validation-Test Splitting

This section divides the dataset into three parts: **training, validation, and final testing.** Splitting the data helps ensure that the model is trained properly, tuned fairly, and evaluated on data it has never seen before.

The first part of the code performs the initial split betweenn the trainings set and a temporary set. The temporary set will later be divided into validation and test data.

The first split sets aside 70% of the dataset for training and the remaining 30 is for the testing.

The parameter **stratify=df['label']** ensures theat the label distribution remains balanced across both splits. This helps avoid issues where one subset accidentally contains more samples of a certain label.

The use of **random_state=42** ensures that the splitting process is reproducible.

The second part of the code further splits the temporary group into validation and test datasets. It splits 50% for validation and 50% for testing, Since the temporary set represents 30% of the data, this results in 15% validation and 15% test data overall.

Lastly, the indices of all resulting datasets are reset. It removes the old row numbers and replaces them with the ned continous indices. This keeps each dataset clean and easier to handle during training and evaluation.

In [None]:
# First split: train vs temp (val + test)
train_df, temp_df = train_test_split(
    df, test_size=0.3, stratify=df['label'], random_state=42
)

# Second split: validation vs test
eval_df, unseen_df = train_test_split(
    temp_df, test_size=0.5, stratify=temp_df['label'], random_state=42
)

# Reset indices
train_df = train_df.reset_index(drop=True)
eval_df = eval_df.reset_index(drop=True)
unseen_df = unseen_df.reset_index(drop=True)

# Converting DataFrames into Dataset Format
This section converts the processes pandas DataFrames into the Hugging Face Dataset format. THe Dataset class is required for efficient preprocessing, tokenization, and training with transformer models.

This block of code takes each of the previously created DataFrames (train_df, eval_df, and unseen_df) and converts them into Hugging Face Dataset objects using Dataset.from_pandas(). This format is more suitable for machine learning pipelines because it integrates smoothly with Hugging Face tokenizers, trainers, and transformer models. The train_dataset holds the training data that the model will learn from, while the eval_dataset contains the validation data used to monitor the model’s performance during training. Meanwhile, the unseen_dataset is reserved for the final evaluation, ensuring that the model is tested on data it has never encountered before. By converting the data into Dataset format, the subsequent training process becomes more efficient, streamlined, and fully compatible with Hugging Face tools.

In [None]:
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(eval_df)
unseen_dataset = Dataset.from_pandas(unseen_df)

# Setting Up Evaluation Metrics

Prepares the evaluation metriccs that will be used to measure the model's perfomance during training and validation. The model is evaluated using **accuracy and F1-score**. These metrics are commonly used in classification tasks.

The code loads the accuracy and F1-score metric modules from the evaluate library, which provide built-in functions for computing performance metrics after the model makes predictions. It then defines a compute_metrics function that the Trainer will use during evaluation. The eval_pred input contains both the model’s output logits and the true labels. Inside the function, logits.argmax(axis=-1) selects the class with the highest score, converting the logits into predicted labels. These predictions (preds) and the true labels (labels) are then passed into the metric functions. Finally, the function returns a dictionary containing both the accuracy and F1-score, which will appear in the training logs and help track the model’s performance.

In [None]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        "accuracy": accuracy.compute(predictions=preds, references=labels)["accuracy"],
        "f1": f1.compute(predictions=preds, references=labels)["f1"]
    }

# Training Function for Transformer Models
Defines a function that automates the enttire training process for a transformet-based text classification model. The function takses several hyperparameters as inputs, such as learning ratem batch size, weight decay, warmup steps, etc. These settings allow experimentation and potimization during training process.

the function begins by loading the tokenizer for the specified model. this converts raw text data into tokens that the transformer model can understand.

Inside the function, another fucntion called **tokenize()** is created. It applies padding and truncation so that all inputs have the same max lengthe of 256 tokens. The training and evaluation datasets are then tokenized using the **.map()** function.

Then, the models is loaded with predefined dropout values to help reduce overfitting.

The training configurations is now created using the TrainingArguments class. This includes all the hyperparameters that control the behavior of the training loop.

Key parameters include:
1. **learning_rate** - controls how faast the model updates weights.
2. **batch_sizes** - number of samples processed per step.
3. **weight_decay** - regularization to prevent overfitting.
4. **warmup_steps** - number of steps before full learning rate is used.
5. **Fp16** - allows faster training using mixed-precision.

The trainer is initialized using the model, argyumetns, datasets, tokenizerm and metric function. Finally, the model is trained and evaluated. The function returns the evaulation results, which include accuracy and F1-score.

In [None]:
def train_transformer(model_name, learning_rate, train_batch_size, eval_batch_size, logging_steps, warmup_steps, weight_decay, fp16):
    # Tokenizer and tokenization
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize(batch):
        return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=256)

    tokenized_train = train_dataset.map(tokenize, batched=True)
    tokenized_eval = eval_dataset.map(tokenize, batched=True)

    # Load model with dropout overrides directly
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2,
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1
    )

    # Training arguments with dynamic hyperparameters
    training_args = TrainingArguments(
        output_dir=f"./results/{model_name.split('/')[-1]}",
        eval_strategy="epoch",
        save_strategy="no",
        learning_rate=learning_rate,
        per_device_train_batch_size=int(train_batch_size),
        per_device_eval_batch_size=int(eval_batch_size),
        num_train_epochs=3,
        weight_decay=weight_decay,
        warmup_steps=int(warmup_steps),
        logging_steps=int(logging_steps),
        fp16=fp16,
        logging_dir="./logs",
        load_best_model_at_end=False
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_eval,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics
    )

    trainer.train()
    return trainer.evaluate()

# Defining the Model Choices

Creates a dictionary that stores the names and model IDs of the Transfformer models that will be used for experimentation. These model IDs came from the Hugging Face and point to pre-trained Filipino language models.

Each key in the dictionary represents a readble model label, while each value contains the exact model identifier used when loading the model through Hugging Face. This makes it easier to loop through multiple models and compare their performance during training and evaluation.

The three models are:

1. RoBERTa Tagalog Base
2. ELECTRA Tagalog Base
3. BERT tagalog Base with WWM



In [None]:
# Define model IDs
models = {
    "RoBERTa": "jcblaise/roberta-tagalog-base",
    "ELECTRA": "jcblaise/electra-tagalog-base-cased-discriminator",
    "BERT-WWM": "jcblaise/bert-tagalog-base-cased-wwm"
}

# Best Hyperparameters per Model

This defines the optimal hyperparameters for each transformer model based on the previous experimentations. Each model has its own set of hyperparameters that were found to give best performance during training.




The training logs for each transformer model were generated separately using three different Google Colab notebooks. Running each model in its own environment ensures that the logs remain clean, organized, and independent from one another.

After training, the results from all experiments were manually recorded in an Excel file. This allows for easy comparison across models and provides a clear reference for identifying which model and hyperparameter configuration performed best.

Hyperparameters included"


*   learning_rate - how quickly the model update its weights
*   train_batc_size/ eval_batch_size - number of samples processed per step
* logging_steps - how often training updates are logged.
* warmup_steps - initial steps before reaching full learning rate.
* weight_decay - regularization strength to prevent overfitting.
* fp16 - allows faster training using mixed-precision.



These hyperparameter were the best performing among the trails for each model. They are the one who produced the highest evaluation scores during tuning.

In [None]:
# Define dynamic hyperparameters per model (can be customized)
hyperparams = {
    "RoBERTa": {
        "learning_rate": 2.9553546138231474e-05,
        "train_batch_size": 19.86254256629313,
        "eval_batch_size": 25.63606857986822,
        "logging_steps": 116.20903404186195,
        "warmup_steps": 487.21661881260644,
        "weight_decay": 0.03282719334939759,
        "fp16": True
    },
    "ELECTRA": {
        "learning_rate": 3.635707276668329e-05,
        "train_batch_size": 21.13636898944459,
        "eval_batch_size": 16.06928900489984,
        "logging_steps": 125.44853210758701,
        "warmup_steps": 489.35806425112776,
        "weight_decay": 0.0022243522902821063,
        "fp16": True
    },
    "BERT-WWM": {
        "learning_rate": 3.119814329835857e-05,
        "train_batch_size": 22.216040575545613,
        "eval_batch_size": 25.54089700860461,
        "logging_steps": 93.8923411579136,
        "warmup_steps": 415.64319475650296,
        "weight_decay": 0.0018982749277018096,
        "fp16": True
    }
}


# Running the Models with Their Best Hyperparameters

This runs the training process for all the selected models (RoBERTa, ELECTRA, and BERT-WWM) using the best hyperparameters. By looping through each model, the training process becomes automated and consistent across all experiments.

The code works as dictionary named reuslts is created to store the evaluation output for each model. The loop goes through all the model entriesin the model dictionary.

For each model, a message is printed indicating which model is currently being trained. Then loads the best hyperparameters and calls the train_transformer function with the correct model ID and hyperparameter values.

Lastly, the evaluation results such as accuracy and F1-score are saved. This ensures that each model is trained using its optimal configurations for mthe previous experimentations. This makes the performance comparison fair and systematic.

In [None]:
results = {}
for label, model_id in models.items():
    print(f"\n🔧 Training {label}...")
    params = hyperparams[label]
    results[label] = train_transformer(
        model_name=model_id,
        learning_rate=params["learning_rate"],
        train_batch_size=params["train_batch_size"],
        eval_batch_size=params["eval_batch_size"],
        logging_steps=params["logging_steps"],
        warmup_steps=params["warmup_steps"],
        weight_decay=params["weight_decay"],
        fp16=params["fp16"]
    )


🔧 Training RoBERTa...


Map:   0%|          | 0/9941 [00:00<?, ? examples/s]

Map:   0%|          | 0/2130 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at jcblaise/roberta-tagalog-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.524,0.521753,0.729577,0.760399
2,0.392,0.453625,0.793897,0.779064
3,0.1737,0.701576,0.778404,0.770204



🔧 Training ELECTRA...


Map:   0%|          | 0/9941 [00:00<?, ? examples/s]

Map:   0%|          | 0/2130 [00:00<?, ? examples/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at jcblaise/electra-tagalog-base-cased-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5984,0.581882,0.681221,0.718841
2,0.4838,0.538294,0.735681,0.745594
3,0.2544,0.695076,0.7277,0.718992



🔧 Training BERT-WWM...


Map:   0%|          | 0/9941 [00:00<?, ? examples/s]

Map:   0%|          | 0/2130 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-wwm and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5914,0.555015,0.710798,0.725979
2,0.4741,0.52541,0.743662,0.732353
3,0.2691,0.643428,0.735211,0.722441


# Multinomial Naive Bayes Baseline Model

Implements a traditional machine learning baseline model using CountVectorizer and Multinomial Naive Bayes. The purpose of this baseline is to compare the performance of traditional NLP methods with the transformer-based models trained earlier.

The CountVectorizer converts the text into numerical features that the model can process. With binary=True, each word is represented as either present (1) or absent (0), regardless of how many times it appears. The ngram_range=(1,2) setting allows the model to capture both unigrams and bigrams, helping it learn short word combinations, while stop_words='english' removes common English stopwords to reduce noise in the data.

After configuring the vectorizer, the text is transformed for both the training and evaluation sets. The fit_transform() method learns the vocabulary from the training data and converts it into numerical vectors, while transform() applies this same vocabulary to the evaluation set to ensure consistency. The corresponding labels are then stored in y_train and y_eval for model training and evaluation.

The Multinomial Naive Bayes model is then created and trained. This model is commonly used for text classification because it works well with high-dimensional word-frequency data. After training, predictions are generated on the evaluation set. Finally, the accuracy and F1-score of the Naive Bayes model are computed and stored in the results dictionary. This allows the baseline model to be directly compared with the transformer models trained earlier. Including a traditional model helps determine whether the advanced models truly outperform simpler approaches.

In [None]:
vectorizer = CountVectorizer(binary=True, ngram_range=(1,2), stop_words='english')
X_train = vectorizer.fit_transform(train_df['text'])
y_train = train_df['label']
X_eval = vectorizer.transform(eval_df['text'])
y_eval = eval_df['label']

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_eval)

results["MultinomialNB"] = {
    "accuracy": accuracy_score(y_eval, y_pred),
    "f1": f1_score(y_eval, y_pred)
}

# Model Performance Comparison

Prints and summarizes the evaluation results of all models, including both the transformer-based models and the multinomial Naive Bayes Baseline Model.

From the final evaluation results, RoBERTa delivered the strongest performance, achieving 0.7784 (78%) accuracy and an F1-score of 0.7702 (77%). This indicates that RoBERTa handles the patterns in the Tagalog hate speech dataset most effectively among all tested models.

The Multinomial Naive Bayes model also performed well, despite being a simple baseline. With 73% accuracy and 73% F1-score, it came close to the transformer models, showing that basic text features still carry strong predictive power in this task.

Both ELECTRA and BERT-WWM showed moderate performance but did not surpass RoBERTa. ELECTRA achieved around 73% accuracy and 72% F1-score, while BERT-WWM slightly improved with 74% accuracy and 72% F1-score.

Overall, the results show that RoBERTa is the most effective model for this classification task, while Naive Bayes provides a surprisingly strong baseline compared to the other transformer models.

In [None]:
print("\nFinal Model Comparison:")
for model, metrics in results.items():
    # Normalize keys for Hugging Face vs scikit-learn outputs
    acc = metrics.get("eval_accuracy", metrics.get("accuracy", None))
    f1 = metrics.get("eval_f1", metrics.get("f1", None))

    if acc is not None and f1 is not None:
        print(f"{model}: Accuracy = {acc:.4f}, F1 Score = {f1:.4f}")
    else:
        print(f"{model}: ❌ Metrics unavailable or improperly formatted.")


Final Model Comparison:
RoBERTa: Accuracy = 0.7784, F1 Score = 0.7702
ELECTRA: Accuracy = 0.7277, F1 Score = 0.7190
BERT-WWM: Accuracy = 0.7352, F1 Score = 0.7224
MultinomialNB: Accuracy = 0.7329, F1 Score = 0.7317
