# Introduction

This notebook is intended for presenting processes and tasks done with the use of Python. Below are objectives to be completed through this notebook:
* To  use a dataset consisting of politics-related tweets written in Filipino language for training the model;
* To train the three (3) selected models and apply hyperparameter tuning and optimization using random search optimization and experimentation; (*This was done in separate, duplicate notebooks*)
* To compare trained models for the performance between transformer-based models and the baseline model, Multinomial Naïve Bayes, and find the best performing model.

The notebook is split into the following subsections:
1. *Installation and Imports* - This section consists of installing dependencies and importing necessary libraries to run the notebook and the models, data, and techniques to be used.
2. *Environment Setup and Data Loading* - This section involves loading the dataset, splitting into training and testing, and preparation for modeling.
3. *Tokenization* - This section involves the process of tokenizing the data by loading the tokenizer and defining a tokenization function.
4. *Model and Metrics* - This section involves defining functions for model initialization and computing metrics.
5. *Training Arguments and Training Setup* - This section involves defining the arguments of the hyperparameter value ranges to be used for optimization and searching of the best set of values.
6. *Traditional Model Setup and Training* - This section handles the setup of  the Multinomial Naive Bayes model to be compared to the BERT models.
________________________________________________________________________________

**Dataset Information**

The dataset, obtained from HuggingFace as a secondary source, contains data collected by PhD student Jan Christian Blaise Cruz. The data consists of social media content (specifically tweets) from the 2016 presidential election in the Philippines. The dataset has 2 columns, one for the extracted text and the other being a label indicating whether the text contains hate speech or not (1 or 0 respectively). According to Cruz, the dataset was released having 4232 samples each for validation and testing and contains about 10k rows/samples for the training split .


# Installation and Imports

This sections setups up the environment by first installing dependencies and then, importing the necessary libraries that will be used for model training, dataset handling, and hyperparameter tuning.

The code cell below installs the latest verions of the required libraries. It uses the *pip* shell command to install *transformers*, *datasets*, *accelerate*, *ray[tune]*, and *optuna* hich are used for NLP models, loading datsets, and performing hyperparameter tuning. The *-U* at the end of the line indicates that the latest version of each package must be installed.

In [None]:
# Install dependencies
!pip install transformers datasets accelerate ray[tune] optuna -U

Collecting datasets
  Downloading datasets-4.4.1-py3-none-any.whl.metadata (19 kB)
Collecting optuna
  Downloading optuna-4.6.0-py3-none-any.whl.metadata (17 kB)
Collecting ray[tune]
  Downloading ray-2.51.1-cp312-cp312-manylinux2014_x86_64.whl.metadata (21 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Downloading pyarrow-22.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.2 kB)
Collecting click!=8.3.0,>=7.0 (from ray[tune])
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting tensorboardX>=1.9 (from ray[tune])
  Downloading tensorboardx-2.6.4-py3-none-any.whl.metadata (6.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.10.1-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.4.1-py3-none-any.whl (511 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m511.6/511.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading optuna-4.6.0-py3-none-any.whl (404 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

The code cell below involves importing the necessary libraries for performing hyperparameter tuning using automated optimization techniques. It helps prepare the environment by adding tools that help with handling the data, defining the model, configuring training, and evaluating performance. The following libraries are imported:

1. *torch* - This is the library by PyTorch that handles tensor operations and GPU acceleration.
2. *os* - This library helps by giving access to operating system functions like file handling and environment variables.
3. *numpy* - This was used for numerical operations, especially predictions and metrics.
4. *pandas* - This is mainly used for reading and manipulating tabular data. In this case, it was used for reading the csv file.
5. *accuracy_score, f1_score* - These are the evaluation metrics used for checking the performance of the model.
6. *files* - This is by Google Colab that allows file uploads from the local machine to the Colab environment.
7. *Dataset* - This is the class by Hugging Face that wraps pandas DataFrames as model-ready datasets.
8. *AutoTokenizer* - Hugging face tool for automaticallly loading the right tokenizer for the model.
9. *AutoModelForSequenceClassification* - Another hugging face tool for loading model for text classification.
10. *Trainer* - Hugging face tool for simplifying training and evaluation.
11. *TrainingArguments* - Library for configuring the training settings.
12. *set_seed* - Library that ensures the reproducibility by fixing random seeds.

In [None]:
# Import necessary libraries
import torch
import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, f1_score
from google.colab import files
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    set_seed
)

# Environment Setup and Data Loading

This section sets up the environment by performing steps such as setting seed to ensure reproducibility, detecting the presence of a GPU, uploading of dataset, splitting the data into training and evaluation dataset, and converting it into a suitable format. This prepares the data for tokenization and model training.

The code cell below ensures that the results are consistent every time the notebook is run. This is done by setting a fixed random seed using *set_seed()* with the value of *42*.

In [None]:
# Set seed for reproducibility
set_seed(42)

The code cell below aims to detect whether the system has a GPU or not and sets the appropriate device for training the model. It makes sure that the model will use the most efficient hardware available. Using a higher performance hardware (in this case, using GPU instead of CPU) affects the amount of time it takes to train the model and the amount of memory it uses.

*if torch.cuda.is_available()* check whether a GPU is available. If it is available, it assigns the GPU as the computation device, stores it in the *device* variable and prints the name of the GPU to the user. If it is not available, it sets the CPU as the computation device and stores it in the *device* variable and informs the user that GPU is not available for use.

In [None]:
# Check device
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")

The code cell below uploads the dataset. In this case, the dataset is in CSV format and has the file name *hate_speech*.

To upload the file from the local machine to the Colab environment, it uses *files.upload()* and stores the file in the *uploaded* variable. It, then, reads the CSV file using *pd.read_csv()* by pandas and stores it in the *df* variable. Finally, it ensures that it keeps only the relevant columns (those with the name of *text* and *label*). Since the dataset contains only these two columns, it keeps all of the columns of the dataset.

In [None]:
# Upload and load daataset
uploaded = files.upload()
df = pd.read_csv("hate_speech.csv", encoding="latin-1")[["text", "label"]]

Saving hate_speech.csv to hate_speech.csv


The code cell below splits the data into two parts: 1000 samples for training, and 200 samples for evaluation. As seen in the code cell below, it uses *.iloc[]* in identifying the rows to be added on the indicated set.

In [None]:
# Split into train and eval
train_data = df.iloc[:1000].reset_index(drop=True)
eval_data = df.iloc[1000:1200].reset_index(drop=True)

The code cell below converts data into a compatible format. As seen below, te pandas DataFrame is first converted into Hugging Face dataset. This uses the *Dataset.from_pandas()* method on the *train_data* and *eval_data* and stores it on the *train_dataset* and *eval_dataset* variables respectively.

In [None]:
# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_data)
eval_dataset = Dataset.from_pandas(eval_data)

# Tokenization

This section focuses on the process of turning the raw text into a numerical format that the model can understand. In this section, the tokenizer is being loaded and a tokenization function was defined and applied on the data.

The code cell below initializes the tokenizer to be used for converting the Tagalog text into tokens.

The model name which is *jcblaise/roberta-tagalog-base* is stored in the variable *MODEL_NAME*. After which, the tokenizer is loaded using *AutoTokenizer.from_pretrained()* with the model name as its argument and stores it in the *tokenizer* variable.

In [None]:
# Load tokenizer
MODEL_NAME = "jcblaise/bert-tagalog-base-cased-WWM"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

The code cell below focuses on creating a function that tokenizes the text in the dataset. As seen on the first line of the code cell, the function is named *tokenize_function* with *examples* as its argument. It uses the function *tokenizer()* which applies truncation and padding to ensure that the input length is consistent. This ensures that the text from the dataset undergoes consistent preprocessing prior to training and evaluation.

As seen in the code cell below, the padding has or the max length has been set to *256* which is the same value that got the best results in the results of one of my members.

In [None]:
# Tokenization function
def tokenize_function(examples):
        return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

The code cell below applies the tokenization function by applying *map()* on both the *train_dataset* and *eval_dataset* and storing them on the variables *tokenized_train* and *tokenized_eval* respectively.

In [None]:
# Apply tokenization
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

The code cell below renames the label column to labels to ensure that it is compatible with the Trainer class during training and evaluation. As seen in the code cell below, it uses the *rename_column()* function to match the expected format of Hugging Face's Trainer.

The code cell below also prepares the tokenized data for PyTorch-based training. It also ensures proper format of the input for training and evaluation. As seen in the code cell below, the *set_format("torch",...)* part ensures that the dataset columns are converted into PyTorch tensors.

In [None]:
# Rename 'label' to 'labels' and set format to PyTorch tensors
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_eval = tokenized_eval.rename_column("label", "labels")

tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_eval.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Model and Metrics

This section involves the definition of functions for modelling and computing the metrics to ensure consistent model setup and gauge the performance of the models.

The code cell below initializes a fresh model for every grid search run. It does this by creating a new instance of the model for each hyperparameter trial. As seen in the cell below, the function is defined as *model_init()*.

What this function does is that it loads the model (in this case, the RoBERTA Tagalog base model). In the loading of the model, it ensures that the model has been specified for binary classification using *num_labels=2*. The dropout rate has also been declared with the value of *0.2* and assigned to *hidden_dropout_prob* and *attention_probs_dropout_prob*. The value was based on the results of one of my members which indicated 0.2 as the best performing value for the dropout rate.

In [None]:
# Function to initialize a fresh model for each grid search run
def model_init():
  return AutoModelForSequenceClassification.from_pretrained(
      MODEL_NAME,
      num_labels=2,
      hidden_dropout_prob=0.1,
      attention_probs_dropout_prob=0.1
  ).to(device)

The code cell below focuses on defining the metrics to be used for evaluating the performance of the model. As seen in the code cell below, a function named *compute_metrics* is defined. It basically calculates the accuracy score and the F1 score for binary classification using the functions *accuracy_score()* and *f1_score()* and stores them in the variables *acc* and *f1* respectively. Both of these metrics are then returned as a dictionary.

In [None]:
# Metrics function
def compute_metrics(p):
    # Uses F1-Score as the primary metric for comparison
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    f1 = f1_score(p.label_ids, preds, average="binary")
    return {"accuracy": acc, "f1": f1}

# Random Search

The code cell below declares the **tune_hp** defined function. The function creates the space where the different hyperparameter values will be placed in. The **trial** object samples different declared values for key parameters that influence training behavior and performance. Once the function is used, each trial will make use of random hyperparameter values from the given range (such as in line 17, where it will randomly use logging_step values between 50 and 200). The function will then return the value used in a dictionary which will be passed to the model's training loop, allowing logging as Optuna evaluates and compares different values for each hyperparameter.

In [None]:
def tune_hp(trial):
    """
    This function defines the hyperparameter space to be explored.
    The `trial` object allows us to suggest different values.
    """
    # The grid uses the trial.suggest_categorical and trial.suggest_float methods
    # from the Optuna backend, which is highly efficient.

    # 1. Learning Rate (Critical for performance)
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)

    # 2. Batch Size (Affects VRAM and stability)
    train_batch_size = trial.suggest_float("per_device_train_batch_size", 16, 32, log=True)
    eval_batch_size = trial.suggest_float("per_device_eval_batch_size", 16, 32, log=True)

    # 3. Logging Steps
    logging_steps = trial.suggest_float("logging_steps", 50, 200, log=True)

    # 4. Warmup Steps
    warmup_steps = trial.suggest_float("warmup_steps", 250, 1000, log=True)

    # 5. Weight Decay
    weight_decay = trial.suggest_float("weight_decay", 0, 0.05)

    # 6. FP16
    fp16 = trial.suggest_categorical("fp16", [True, False])

    return {
        "learning_rate": learning_rate,
        "per_device_train_batch_size": train_batch_size,
        "per_device_eval_batch_size": eval_batch_size,
        "logging_steps": logging_steps,
        "warmup_steps": warmup_steps,
        "weight_decay": weight_decay,
        "fp16": fp16,
    }

# Training Arguments and Trainer Setup

This section configures how the model was trained and evaluated. It involves defining fixed training settings for the hyperparameters and the initialization of the Trainer.

The code cell below is responsible for configuring how the model will be trained, evaluated, and logged. As seen in the code cell below, it consists of multiple hyperparameters. For the purposes of this exercise, the evaluation settings were not changed. Parameters such as *fp16*, *num_train_epochs*, *warmup_steps*, and *weight_decay* have fixed values which were all based on the results of the previous exercise.

In [None]:
# Training arguments (fixed for all runs)
training_args = TrainingArguments(
    output_dir="./grid_search_results",
    # Evaluation settings (fixed)
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1", # Optimize for F1-Score
    report_to="none",
)

The code cell below initializes the Hugging Face Trainer with all the necessary components including the model, training and testing dataset, matrics, and tokenizer. As seen in the code ccell below the Hugging Face Trainer is stored in the *trainer* variable with arguments such as *model*, *args*, *train_dataset*, *eval_dataset*, *compute_metrics*, and *tokenizer*. This prepares the training engine that will fine-tune the model and evaluate its performance.

In [None]:
# Initialize the Trainer
trainer = Trainer(
    model_init=model_init, # We pass the function, not the object, for fresh initialization
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)

  trainer = Trainer(


pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

The code cell below initiates the hyperparameter search using the Random Search technique. With the use of **Optuna** as a backend and Hugging Face's **trainer.hyperparameter_search()** API, an automated search for the best set of hyperparameters which targets the best valued F1-Score is conducted. The set of hyperparameters are sourced from the declared value range in the **tune_hp** space, while the use of Optuna as an argument ensures efficient sampling and pruning. The **direction_maximize** configuration is what instructs Optuna to select the set of values that produce the best F1-score. The result, stored in **best_trial**, contains the best-performing hyperparameters discovered during the search.

Pruned trials are those that fail two produce optimal F1-Scores. Additionally, a count of 18 trials was set to be done (0-17).

In [None]:
# Execution of Random Search
# We use Optuna backend for efficient searching. The 'hp_space' provides the search definition.
print("\n--- Starting Random Search (Total Runs: 18) ---")
print("Optimizing for 'f1' score...")

best_trial = trainer.hyperparameter_search(
    # We use 'Optuna' as the backend for the hyperparameter search
    backend="optuna",
    # Pass the function that defines the search space
    hp_space=tune_hp,
    # Maximize the F1 score (higher is better)
    direction="maximize",
    # Set the total number of experiments to run
    n_trials=18,
)

[I 2025-11-17 09:56:45,686] A new study created in memory with name: no-name-477ce9d4-7ee1-408c-83b5-907b1ac09bca



--- Starting Random Search (Total Runs: 5) ---
Optimizing for 'f1' score...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.692285,0.54,0.402597
2,No log,0.647925,0.63,0.614583
3,0.652600,0.599203,0.705,0.748936


[I 2025-11-17 09:59:34,171] Trial 0 finished with value: 1.4539361702127658 and parameters: {'learning_rate': 3.119814329835857e-05, 'per_device_train_batch_size': 22.216040575545613, 'per_device_eval_batch_size': 25.54089700860461, 'logging_steps': 93.8923411579136, 'warmup_steps': 415.64319475650296, 'weight_decay': 0.0018982749277018096, 'fp16': True}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.682862,0.555,0.453988
2,No log,0.664374,0.585,0.541436
3,No log,0.620269,0.665,0.685446


[I 2025-11-17 10:02:16,566] Trial 1 finished with value: 1.3504460093896715 and parameters: {'learning_rate': 3.846419178243955e-05, 'per_device_train_batch_size': 20.725170992635007, 'per_device_eval_batch_size': 16.586704614894728, 'logging_steps': 190.32810905957712, 'warmup_steps': 920.0507509923899, 'weight_decay': 0.009496233860959115, 'fp16': False}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.682147,0.55,0.476744
2,No log,0.679208,0.57,0.494118
3,No log,0.637918,0.67,0.673267


[I 2025-11-17 10:05:11,132] Trial 2 finished with value: 1.3432673267326734 and parameters: {'learning_rate': 1.1918563265027827e-05, 'per_device_train_batch_size': 28.645602924190555, 'per_device_eval_batch_size': 17.794607540879348, 'logging_steps': 161.26884135177738, 'warmup_steps': 335.6436865178559, 'weight_decay': 0.022099922761508514, 'fp16': True}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.678811,0.555,0.548223
2,No log,0.688458,0.56,0.45
3,0.676100,0.671187,0.575,0.514286


[I 2025-11-17 10:09:57,677] Trial 3 finished with value: 1.0892857142857142 and parameters: {'learning_rate': 1.6265771257800996e-05, 'per_device_train_batch_size': 31.277495634886247, 'per_device_eval_batch_size': 29.773688639700662, 'logging_steps': 86.86626287154515, 'warmup_steps': 738.226746466736, 'weight_decay': 0.005000020533281319, 'fp16': False}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.689665,0.56,0.45
2,No log,0.649448,0.625,0.60733
3,No log,0.622738,0.665,0.691244


[I 2025-11-17 10:15:26,643] Trial 4 finished with value: 1.3562442396313363 and parameters: {'learning_rate': 3.752274025561575e-05, 'per_device_train_batch_size': 26.20684625856315, 'per_device_eval_batch_size': 25.217063806680848, 'logging_steps': 157.66217746223063, 'warmup_steps': 582.1574387207886, 'weight_decay': 0.041633798082186306, 'fp16': False}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.678783,0.54,0.510638
2,No log,0.687288,0.555,0.447205
3,0.674800,0.655521,0.6,0.565217


[I 2025-11-17 10:20:23,489] Trial 5 finished with value: 1.1652173913043478 and parameters: {'learning_rate': 1.2443346451572982e-05, 'per_device_train_batch_size': 24.102137879996874, 'per_device_eval_batch_size': 16.306139804083667, 'logging_steps': 87.80648396792736, 'warmup_steps': 595.359345998454, 'weight_decay': 0.03302524169158571, 'fp16': False}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.680585,0.53,0.477778
2,0.687800,0.682407,0.555,0.460606
3,0.665000,0.646251,0.635,0.617801


[I 2025-11-17 10:24:41,664] Trial 6 finished with value: 1.252801047120419 and parameters: {'learning_rate': 1.1025772108948546e-05, 'per_device_train_batch_size': 21.580212715500394, 'per_device_eval_batch_size': 25.779101074797726, 'logging_steps': 54.389884201755386, 'warmup_steps': 586.6309827359133, 'weight_decay': 0.017573692264069057, 'fp16': False}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.678902,0.54,0.520833
2,0.679000,0.684851,0.55,0.444444


[I 2025-11-17 10:27:20,360] Trial 7 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.683442,0.555,0.447205


[I 2025-11-17 10:28:02,064] Trial 8 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.691155,0.555,0.440252
2,No log,0.661587,0.595,0.552486
3,No log,0.614018,0.675,0.70852


[I 2025-11-17 10:33:29,810] Trial 9 finished with value: 1.3835201793721974 and parameters: {'learning_rate': 1.3787418324743608e-05, 'per_device_train_batch_size': 19.58831115699904, 'per_device_eval_batch_size': 27.280152244587445, 'logging_steps': 165.69812453385438, 'warmup_steps': 380.16701687621475, 'weight_decay': 0.0002832674414789183, 'fp16': True}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.691898,0.545,0.420382
2,0.641000,0.651945,0.645,0.646766
3,0.641000,0.586,0.69,0.743802


[I 2025-11-17 10:38:09,467] Trial 10 finished with value: 1.433801652892562 and parameters: {'learning_rate': 2.650460343540157e-05, 'per_device_train_batch_size': 16.562927303659883, 'per_device_eval_batch_size': 19.704975715735657, 'logging_steps': 123.0538396837022, 'warmup_steps': 273.85554277232467, 'weight_decay': 0.049747707220039766, 'fp16': True}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.691894,0.545,0.420382
2,0.640900,0.651946,0.645,0.646766
3,0.640900,0.586399,0.69,0.741667


[I 2025-11-17 10:42:13,794] Trial 11 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.694345,0.54,0.402597


[I 2025-11-17 10:42:59,439] Trial 12 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.685175,0.545,0.448485
2,No log,0.644632,0.685,0.701422
3,0.610100,0.590675,0.695,0.728889


[I 2025-11-17 10:49:18,805] Trial 13 finished with value: 1.423888888888889 and parameters: {'learning_rate': 4.85487045257914e-05, 'per_device_train_batch_size': 18.144590004215853, 'per_device_eval_batch_size': 20.491154504765507, 'logging_steps': 123.19192025700454, 'warmup_steps': 250.51827248815133, 'weight_decay': 0.03757280210414869, 'fp16': True}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.686718,0.56,0.45
2,No log,0.646423,0.645,0.632124
3,0.656400,0.611486,0.7,0.736842


[I 2025-11-17 10:54:38,548] Trial 14 finished with value: 1.4368421052631577 and parameters: {'learning_rate': 2.0264344017065016e-05, 'per_device_train_batch_size': 23.015836894754138, 'per_device_eval_batch_size': 23.306155645600267, 'logging_steps': 90.92243549622225, 'warmup_steps': 300.5197551178641, 'weight_decay': 0.04962077705072338, 'fp16': True}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.684104,0.555,0.447205


[I 2025-11-17 10:55:23,691] Trial 15 pruned. 
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.692264,0.55,0.423077
2,No log,0.650218,0.63,0.614583
3,0.661000,0.617444,0.68,0.711712


[I 2025-11-17 11:00:59,586] Trial 16 finished with value: 1.3917117117117117 and parameters: {'learning_rate': 2.0237348021496265e-05, 'per_device_train_batch_size': 25.25974625169003, 'per_device_eval_batch_size': 22.999708169066125, 'logging_steps': 91.31820260954274, 'warmup_steps': 327.86469718156144, 'weight_decay': 0.027035705529987, 'fp16': True}. Best is trial 0 with value: 1.4539361702127658.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at jcblaise/bert-tagalog-base-cased-WWM and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.692051,0.54,0.402597
2,0.665800,0.64793,0.63,0.614583
3,0.665800,0.601161,0.7,0.74359


[I 2025-11-17 11:04:24,539] Trial 17 pruned. 


Below prints the set of hyperparameters that yielded the best F1-score. An if-else statement was done to check whether there is an actual best trial from the result, otherwise it would print a message indicating a failure of searching or when there is no best trial. The hyperparameters of the best trial is listed, presenting values that were taken from the provided range of values.

In [None]:
print("\n--- Random Search Complete ---")
print("\nBEST HYPERPARAMETERS FOUND:")

# Extract and print the best configuration
if best_trial:
    print(best_trial)
    best_hps = best_trial.hyperparameters
    print("\nBest Hyperparameters:")
    for key, value in best_hps.items():
        print(f"  {key}: {value}")
else:
    print("Search failed or no best trial found.")

print("\nTo run the final model, use the best_hps found in a new TrainingArguments instance.")


--- Random Search Complete ---

BEST HYPERPARAMETERS FOUND:
BestRun(run_id='0', objective=1.4539361702127658, hyperparameters={'learning_rate': 3.119814329835857e-05, 'per_device_train_batch_size': 22.216040575545613, 'per_device_eval_batch_size': 25.54089700860461, 'logging_steps': 93.8923411579136, 'warmup_steps': 415.64319475650296, 'weight_decay': 0.0018982749277018096, 'fp16': True}, run_summary=None)

Best Hyperparameters:
  learning_rate: 3.119814329835857e-05
  per_device_train_batch_size: 22.216040575545613
  per_device_eval_batch_size: 25.54089700860461
  logging_steps: 93.8923411579136
  warmup_steps: 415.64319475650296
  weight_decay: 0.0018982749277018096
  fp16: True

To run the final model, use the best_hps found in a new TrainingArguments instance.
