<a href="https://colab.research.google.com/github/lymoelopez/filipino-fake-news-detection/blob/main/preliminaryWork/modelFinetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import Libraries

In [None]:
import numpy as np
import pandas as pd

from google.colab import files
import io


# Load Dataset

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
uploadedTest = files.upload()

Saving test.csv to test.csv


In [None]:
from datasets import Dataset 
df = pd.read_csv(io.BytesIO(uploadedTest['test.csv']))
testDataset = Dataset.from_pandas(df)

In [None]:
testDataset[10]

{'label': '1',
 'article': 'Usap-usapan ngayon sa social media ang umano\'y panggagaya ng kinatawan ng Thailand na si Ingchanok Prasart sa ginanap na Miss Intercontinental pageant 2018 sa trademark ni Miss Universe 2018 Catriona Gray na "lava walk." Isang netizen sa Twitter na nagngangalang Marnie Raro ang nakapansin sa paraan ng paglalakad at pag-project ni Miss Thailand na parang kinokopya ang "lava walk" ni Catriona. Bukod kay Miss Thailand, napansin din ng mga netizens ang pagkopya umano sa "Mayon Volcano gown" ni Miss Vietnam. Aminado naman ang dalawa na labis nilang hinahangaan at inspirasyon nila si Miss Universe 2018 Catriona Gray. Halo-halo naman ang naging reaksyon ng mga netizens hinggil dito. "So now everyone looks like Catriona Gray. From hair to stance to gowns. The influence. Only legends do that." "Catriona Gray\'s walk can never be perfect without her walking it." "I think so? Even the hand gesture of Miss Thailand seems copied too. Catriona Gray is indeed a queen." "Y

In [None]:
uploadedTrain = files.upload()
df = pd.read_csv(io.BytesIO(uploadedTrain['train.csv']))
trainDataset = Dataset.from_pandas(df)

Saving train.csv to train.csv


In [None]:
uploadedVal = files.upload()
df = pd.read_csv(io.BytesIO(uploadedVal['validation.csv']))
valDataset = Dataset.from_pandas(df)

Saving validation.csv to validation.csv


# Preprocessing

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:

from transformers import AutoTokenizer, RobertaTokenizer
tokenizer = AutoTokenizer.from_pretrained("jcblaise/electra-tagalog-small-cased-discriminator", model_max_length=256)

Downloading (…)okenizer_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/235k [00:00<?, ?B/s]

In [None]:
def preprocess_function(examples):
    return tokenizer(examples['article'], truncation=True)

In [None]:
tokenized_test = testDataset.map(preprocess_function, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
tokenized_train = trainDataset.map(preprocess_function, batched=True)
tokenized_val = valDataset.map(preprocess_function, batched=True)

  0%|          | 0/5 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# evaluate

In [None]:
!pip install transformers datasets evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# Finetuning

In [None]:
id2label = {0: "Real", 1: "Fake"}
label2id = {"Real": 0, "Fake": 1}

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, RobertaForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "jcblaise/electra-tagalog-small-cased-discriminator", num_labels=2, id2label=id2label, label2id=label2id
)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/55.0M [00:00<?, ?B/s]

Some weights of the model checkpoint at jcblaise/electra-tagalog-small-cased-discriminator were not used when initializing ElectraForSequenceClassification: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias']
- This IS expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at jcblaise/electra-tagalog-small-cased-discriminator and are

In [None]:
training_args = TrainingArguments(
    output_dir="model",
    learning_rate=1e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    weight_decay=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    warmup_ratio=0.006,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: article. If article are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 4488
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 141
  Number of trainable parameters = 13738498
You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.642694,0.735967


The following columns in the evaluation set don't have a corresponding argument in `ElectraForSequenceClassification.forward` and have been ignored: article. If article are not expected by `ElectraForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 481
  Batch size = 32
Saving model checkpoint to model/checkpoint-141
Configuration saved in model/checkpoint-141/config.json
Model weights saved in model/checkpoint-141/pytorch_model.bin
tokenizer config file saved in model/checkpoint-141/tokenizer_config.json
Special tokens file saved in model/checkpoint-141/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from model/checkpoint-141 (score: 0.6426935791969299).


TrainOutput(global_step=141, training_loss=0.6647864808427527, metrics={'train_runtime': 3494.7396, 'train_samples_per_second': 1.284, 'train_steps_per_second': 0.04, 'total_flos': 66017674027008.0, 'train_loss': 0.6647864808427527, 'epoch': 1.0})

## References

[1] https://huggingface.co/docs/transformers/tasks/sequence_classification