# WangchanBERTa on LimeSoda (🤗Trainer API)
This notebook implements WangchanBERTa & Linear on the [LimeSoda](https://ieeexplore.ieee.org/document/9678187) dataset. The dataset consists of news articles labeled as either fake or fact, gathered from a variety of online sources. We use the Transformers library (Trainer API) to fine-tune WangchanBERTa model.

# Implementation

## Installation

In [None]:
!pip install -q --upgrade transformers datasets tokenizers 
!pip install -q emoji pythainlp sklearn-pycrfsuite seqeval 
!rm -r thai2transformers thai2transformers_parent 
!git clone -b dev https://github.com/vistec-AI/thai2transformers/
!mv thai2transformers thai2transformers_parent
!mv thai2transformers_parent/thai2transformers .
!apt install git-lfs
!pip install sentencepiece
!pip install huggingface_hub

## Importing Libraries

In [None]:
import pandas as pd
import numpy as np
from datasets import load_dataset, load_metric, DatasetDict, Dataset, load_from_disk
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from transformers import EarlyStoppingCallback
from thai2transformers.preprocess import process_transformers
from thai2transformers.metrics import classification_metrics

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Loading the Dataset

First, we load the dataset and clean it using the `preprocess_transformer` function. This function fixes html, rm brackets, replace newlines, and rm use. It is important to clean the data before using it in any analysis or machine learning models.

In [None]:
split_dataset = load_from_disk('/content/drive/MyDrive/Fake news/News-Dataset/dataset')

In [None]:
def clean_function(examples):
    examples['text'] = process_transformers(examples['text'])
    return examples

cleaned_dataset = split_dataset.map(clean_function)

  0%|          | 0/2430 [00:00<?, ?ex/s]

  0%|          | 0/676 [00:00<?, ?ex/s]

  0%|          | 0/270 [00:00<?, ?ex/s]

Let's show cleaned information here. These samples are of news on the dataset. 1 is for fake news and 0 is for fact news.

In [None]:
pd.DataFrame(cleaned_dataset['train'].shuffle()[:10])[['labels','text']]

Unnamed: 0,labels,text
0,1,ย้อนรอย“เหตุการณ์ที่เทย์กิน”การปล้นธนาคารที่ได...
1,0,การแพร่กระจายของเชื้อไวรัสโคโรนา.การแพร่กระจาย...
2,0,เรื่องสำคัญ!อ่านฉลากให้ชัดก่อนใช้”ผ้าอนามัยแบบ...
3,1,ผักชีลาว..ผักพื้นบ้านมากคุณค่าช่วยต้านสารพัดโร...
4,0,ผู้ป่วยโควิด-19เสียชีวิตรายแรกในไทย|||ผู้เสียช...
5,0,แพทย์ยืนยันน้ำยาบ้วนปากอันตรายสรรพคุณเกินจริง|...
6,1,วิธีรักษามะเร็งง่ายๆด้วยตนเอง!!มันเป็นสิ่งที่ค...
7,1,น้ำมะกรูดผสมโซดา|||สูตรดื่มน้ำมะกรูดผสมโซดาลดไ...
8,1,วิธีบอกลาไขมันหน้าท้องพุงยุบน้ำหนักลดง่ายๆโดยใ...
9,1,จะเกิดอะไรขึ้นถ้าคุณดื่มน้ำมะพร้าวติดต่อกัน7วั...


Then, we load the tokenizer and map the function through all dataset. The tokenizer will split the text into tokens and substitute tokens with their ids.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('airesearch/wangchanberta-base-att-spm-uncased')
def encode_function(examples):
    return tokenizer(examples['text'], max_length=416, truncation=True)
encoded_dataset = cleaned_dataset.map(encode_function, batched=True)

Downloading:   0%|          | 0.00/282 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/546 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/883k [00:00<?, ?B/s]

{'input_ids': [5, 10, 19775, 303, 86, 732, 4973, 143, 4098, 33, 1979, 972, 23370, 5252, 18496, 19775, 303, 86, 732, 4973, 143, 4098, 33, 1979, 972, 23370, 303, 86, 732, 4973, 143, 4098, 33, 1979, 972, 23370, 7173, 94, 3268, 185, 84, 2040, 52, 21, 142, 303, 86, 732, 4973, 143, 4098, 268, 303, 86, 732, 32, 4973, 15470, 13610, 207, 59, 303, 4178, 598, 21, 5198, 21392, 7196, 16691, 76, 22, 3620, 33, 143, 510, 207, 22081, 140, 6018, 1258, 280, 143, 4090, 4535, 3817, 2754, 59, 303, 21184, 148, 125, 356, 393, 17, 14876, 1258, 22, 111, 75, 56, 9, 1258, 3442, 4155, 37, 2556, 533, 3189, 96, 56, 17839, 3313, 13390, 2672, 56, 9, 44, 5611, 3620, 1258, 21184, 148, 125, 1178, 204, 4898, 23370, 11993, 13, 796, 15, 151, 3620, 1258, 15, 2221, 30, 2110, 83, 974, 176, 709, 30, 15618, 563, 14876, 3620, 1258, 64, 125, 51, 2556, 1178, 382, 4898, 2672, 15762, 3159, 75, 56, 9, 103, 1979, 44, 78, 1051, 3620, 1258, 1766, 1979, 2727, 1979, 2235, 890, 17, 1108, 906, 197, 17, 14876, 3278, 99, 159, 1289, 118, 6084, 

## Fine-tuning

In [None]:
num_labels = len(set(encoded_dataset['train']['labels']))

model = AutoModelForSequenceClassification.from_pretrained(
    'airesearch/wangchanberta-base-att-spm-uncased', 
    num_labels=num_labels)

Downloading:   0%|          | 0.00/404M [00:00<?, ?B/s]

Some weights of the model checkpoint at airesearch/wangchanberta-base-att-spm-uncased were not used when initializing CamembertForSequenceClassification: ['lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at airesearch/wa

Here, we fine-tune our WangchanBERTa model. One way to do this is by using the Huggingface trainer API. This allows you to tweak the settings of your BERT model in order to get the best possible results. Additionally, we add `earlystopping callbacks` to the Trainer API to make sure the model doesn't overfit. 

In [None]:
#setup trainer
train_args = TrainingArguments(
    output_dir = 'wisesight_sentiment_wangchanberta',
    evaluation_strategy = "steps",
    eval_steps = 50,
    learning_rate=5e-05,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=20,
    warmup_steps = int(len(encoded_dataset['train']) * 1 // 8 * 0.1),
    weight_decay=1e-2,
    save_total_limit=5,
    metric_for_best_model='f1_micro',
    seed = 125,
    load_best_model_at_end=True
)

trainer = Trainer(
    model,
    train_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['valid'],
    tokenizer=tokenizer,
    compute_metrics=classification_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=7)]
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 2430
  Num Epochs = 20
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 6080


Step,Training Loss,Validation Loss,Accuracy,F1 Micro,Precision Micro,Recall Micro,F1 Macro,Precision Macro,Recall Macro,Nb Samples
50,No log,0.468164,0.833333,0.833333,0.833333,0.833333,0.833276,0.833516,0.833242,270
100,No log,0.467115,0.796296,0.796296,0.796296,0.796296,0.793575,0.810823,0.795489,270
150,No log,0.483646,0.774074,0.774074,0.774074,0.774074,0.767619,0.805298,0.772882,270
200,No log,0.510373,0.77037,0.77037,0.77037,0.77037,0.761525,0.813748,0.768986,270
250,No log,0.5108,0.796296,0.796296,0.796296,0.796296,0.789931,0.833789,0.79505,270
300,No log,0.432362,0.803704,0.803704,0.803704,0.803704,0.803572,0.804994,0.80394,270
350,No log,0.739026,0.792593,0.792593,0.792593,0.792593,0.791862,0.797781,0.793075,270
400,No log,0.434099,0.803704,0.803704,0.803704,0.803704,0.798095,0.838449,0.802513,270
450,No log,0.466131,0.82963,0.82963,0.82963,0.82963,0.827202,0.847081,0.828797,270
500,0.502500,0.615956,0.825926,0.825926,0.825926,0.825926,0.822951,0.847003,0.825011,270


The following columns in the evaluation set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 270
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 270
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 270
  Batc

TrainOutput(global_step=1350, training_loss=0.4268484949182581, metrics={'train_runtime': 612.7154, 'train_samples_per_second': 79.319, 'train_steps_per_second': 9.923, 'total_flos': 2303672956314720.0, 'train_loss': 0.4268484949182581, 'epoch': 4.44})

## Test Results

In this Cell, we show the performance of our fine-tuned model on the LimeSoda dataset.

In [None]:
preds  = trainer.predict(encoded_dataset['test'])
pd.DataFrame.from_dict(preds[2],orient='index')

The following columns in the test set don't have a corresponding argument in `CamembertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `CamembertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 676
  Batch size = 8


Unnamed: 0,0
test_loss,0.398809
test_accuracy,0.89497
test_f1_micro,0.89497
test_precision_micro,0.89497
test_recall_micro,0.89497
test_f1_macro,0.893731
test_precision_macro,0.894418
test_recall_macro,0.893145
test_nb_samples,676.0
test_runtime,10.0878


In [None]:
import torch
torch.tensor(preds[0])
probabilities = torch.nn.functional.softmax(torch.tensor(preds[0]), dim=-1)
predictions = torch.argmax(probabilities, axis=1)
predictions
from sklearn.metrics import classification_report
y_true = encoded_dataset['test']['labels']
y_pred = preds[1]
target_names = ['fact news', 'fake news']
print(classification_report(y_true, predictions, target_names=target_names))

              precision    recall  f1-score   support

   fact news       0.89      0.88      0.88       304
   fake news       0.90      0.91      0.91       372

    accuracy                           0.89       676
   macro avg       0.89      0.89      0.89       676
weighted avg       0.89      0.89      0.89       676



# Upload to Huggingface Hub

In [None]:
!huggingface-cli login


        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        (Deprecated, will be removed in v0.3.0) To login with username and password instead, interrupt with Ctrl+C.
        
Token: 

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
model.push_to_hub("soda-berta")