## 1 Importing the Pre-Processed Dataset

As the dataset has been separated to 4 parts, we need to reread them from files:

- X_train (training variables of the dataset)
- X_val (validation variables of the dataset)
- y_train (training labels of the dataset)
- y_val (validation labels of the dataset)

In [None]:
import pandas as pd

In [None]:
X_train = pd.read_csv('X_train.csv')
X_val = pd.read_csv('X_val.csv')
y_train = pd.read_csv('y_train.csv')
y_val = pd.read_csv('y_val.csv')

Because we are only attempting to classify the ```True``` from the ```False``` by the text. Then we should select the variable "text" from X_train and X_val, and select the variable "target" from y_train and y_val.

In [None]:
train_text = X_train['cleaned_text'].to_list()
train_label = y_train['target'].to_list()
val_text = X_val['cleaned_text'].to_list()
val_label = y_val['target'].to_list()

In [None]:
train_text[:5]

['  jimmyfallon crush squirrel bone mortar pestl school  bio dept  realli sure whi worstsummerjob',
 ' mccainenl think spectacular look stonewal riot obliter white house ',
 'can t bloodi wait   soni set date stephen king       the dark tower    stephenk thedarktow    bdisgust',
 'protest ralli stone mountain  atleast they r burn build loot store like individu  protest ',
 ' rbcinsur quot websit   disaster  tri 3 browser  amp  3 machines  alway get  miss info  error due non exist drop down ']

In [None]:
train_label[:5]

[0, 1, 0, 0, 0]

## 2 Tokenize the text by the pre-trained model RoBERTa

In this notebook, we choose a pre-trained model named RoBERTa-base to tokenize the text from the dataset.

Then, we should download the pre-trained model first:

In [None]:
!pip install transformers
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import XLNetTokenizer
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Apply the pre-trained model to the text by the tokenizer.

In [None]:
train_encodings = tokenizer(train_text, truncation=True, padding=True)
val_encodings = tokenizer(val_text, truncation=True, padding=True)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## 3 Defining the Dataset Class

Before we input the dataset into Neural Network, we should define a Class then instantiate it to store the encodings and labels of the data.

In [None]:
import torch

class DatasetClass(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = DatasetClass(train_encodings, train_label)
val_dataset = DatasetClass(val_encodings, val_label)

In [None]:
train_dataset.__getitem__(1)

{'input_ids': tensor([    5,     5,     5,     5,     5,     5,     5,     5,     5,     5,
             5,     5,     5,     5,     5,     5,     5,     5,     5,     5,
             5,     5,     5,     5,     5,     5,     5,     5,     5,     5,
             5,     5,     5,     5,     5,     5,     5,     5,     5,     5,
             5,     5,     5,     5,     5,     5,     5,     5,    17,    98,
          8664,   153,   254,   368,   232,  8073,   338,  3085,  9760, 10666,
          6837,  9803,   817,   480,     4,     3]),
 'token_type_ids': tensor([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]),
 'attention_mask': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 1, 1, 1, 1, 1, 1, 1,

## 4 Defining Evaluation Matrics

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def evaluation_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

## 5 Fine-tuning with Huggingface Trainer

Huggingface Trainer is a highly packaged trainer, we need to define the following arguments before we train:

- **args** -> TrainingArguments (Contains the definition of hyperparameters, which is also an important feature of the trainer, where most of the training-related parameters are set)
- **model** -> Model (is a model that integrates *transformers.PreTrainedMode* or *torch.nn.module*, which is officially mentioned as being optimised by Trainer for transformers.PreTrainedModel)
- **compute_metrics** -> Evaluation Metrics (to define how to evaluate the results of the fine-tuned model)
- **train_dataset** -> Train Dataset
- **eval_dataset** -> Validation Dataset

In [None]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device

'cuda'

In [None]:
from transformers import XLNetForSequenceClassification, Trainer, TrainingArguments
import os
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir='./RoBERTa-results',# output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased")

trainer = Trainer(
    model=model,                         # the instantiated pre-trained model to be trained
    args=training_args,                  # training arguments, defined above
    compute_metrics=evaluation_metrics,
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # validation dataset
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_

In [None]:
trainer.train()

***** Running training *****
  Num examples = 6090
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 1143
  Number of trainable parameters = 117310466


Step,Training Loss
10,0.7154
20,0.6934
30,0.7364
40,0.7132
50,0.7054
60,0.6772
70,0.7058
80,0.7025
90,0.669
100,0.7038


Saving model checkpoint to ./RoBERTa-results/checkpoint-500
Configuration saved in ./RoBERTa-results/checkpoint-500/config.json
Model weights saved in ./RoBERTa-results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./RoBERTa-results/checkpoint-1000
Configuration saved in ./RoBERTa-results/checkpoint-1000/config.json
Model weights saved in ./RoBERTa-results/checkpoint-1000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1143, training_loss=0.4717025792296269, metrics={'train_runtime': 320.4815, 'train_samples_per_second': 57.008, 'train_steps_per_second': 3.567, 'total_flos': 670926442752720.0, 'train_loss': 0.4717025792296269, 'epoch': 3.0})

## 6 Results of Evaluating the Validation Dataset

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1523
  Batch size = 64


{'eval_loss': 0.5068272352218628,
 'eval_accuracy': 0.7951411687458962,
 'eval_f1': 0.7539432176656151,
 'eval_precision': 0.7599364069952306,
 'eval_recall': 0.7480438184663537,
 'eval_runtime': 5.8328,
 'eval_samples_per_second': 261.111,
 'eval_steps_per_second': 4.115,
 'epoch': 3.0}

Some code references from:
1. https://huggingface.co/transformers/v3.2.0/custom_datasets.html
2. https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/xlnet