<a href="https://colab.research.google.com/github/joms-hub/tagalog-fake-news-detection/blob/main/notebooks/distilbert_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DistilBERT Training/Fine-tuning

### 1. Colab Setup for **Training**

In [1]:
!pip install transformers datasets evaluate



### 2. Model Training

#### 2.1 Load Preprocessed dataset from Day 2

In [2]:
from datasets import load_from_disk

train_dataset = load_from_disk("/content/tagalog-fake-news-detection/tokenized/DistilBERT_train")
val_dataset   = load_from_disk("/content/tagalog-fake-news-detection/tokenized/DistilBERT_val")
test_dataset  = load_from_disk("/content/tagalog-fake-news-detection/tokenized/DistilBERT_test")

print(train_dataset, val_dataset, test_dataset)


Dataset({
    features: ['label', 'article', 'input_ids', 'attention_mask'],
    num_rows: 2244
}) Dataset({
    features: ['label', 'article', 'input_ids', 'attention_mask'],
    num_rows: 481
}) Dataset({
    features: ['label', 'article', 'input_ids', 'attention_mask'],
    num_rows: 481
})


#### 2.2 Set up model

In [3]:
from transformers import DistilBertForSequenceClassification

model_name = "distilbert-base-multilingual-cased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=2)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### 2.3 Finetuning model with early stopping

In [16]:
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback
from evaluate import load
import numpy as np

# Define a function to compute metrics
def compute_metrics(eval_pred):
    f1_metric = load("f1")
    acc_metric = load("accuracy")
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    f1_result = f1_metric.compute(predictions=predictions, references=labels, average="weighted")
    acc_result = acc_metric.compute(predictions=predictions, references=labels)
    # Combine results into one dictionary for logging
    return {
        "f1": f1_result["f1"],
        "accuracy": acc_result["accuracy"]
    }

# Define training arguments with early stopping settings
training_args = TrainingArguments(
    output_dir='./tagalog-fake-news-detection/results',
    num_train_epochs=10,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    greater_is_better=True,
    report_to=None
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Start the training process
trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Epoch,Training Loss,Validation Loss,F1,Accuracy
1,No log,0.261184,0.950013,0.950104
2,0.032400,0.181206,0.968815,0.968815
3,0.022300,0.252172,0.943793,0.943867
4,0.022300,0.213946,0.964646,0.964657


TrainOutput(global_step=284, training_loss=0.024735616546281626, metrics={'train_runtime': 560.9752, 'train_samples_per_second': 40.002, 'train_steps_per_second': 1.266, 'total_flos': 1189027370336256.0, 'train_loss': 0.024735616546281626, 'epoch': 4.0})

In [None]:
!git add .