# CZ4042: Neural Networks and Deep Learning Group Project
Chua Yong Xuan, Daniel Yang, Jiang Zixing

# Introduction
## Sentiment Analysis with XLNet
In this notebook, we aim to perform sentiment analysis with [XLNet](https://arxiv.org/abs/1906.08237).

We have already preprocessed our data in the data preprocessing notebook and split it into three datasets: training, validation, and test. Our focus here will be on fine-tuning a XLNet model on our training dataset and evaluating its performance on the validation and test sets.

Class imbalance and small sized dataset has also been noted as a common hurdle in model training. We will explore methods to overcome this constraint with several data augmentation methods.

Two key factors on model performance to be explored in this notebook are
1. The impact of hyperparameter tuning
2. The impact of data augmentation (Back translation, NLP Augmentation, GPT Augmentation)


## Content
The outline of this notebook is as follow
1. Set up
2. Preprocess and tokenize data
3. Develop the optimal model
4. Evaluate data augmentation


## About XLNet
XLNet is an extension of the Transformer-XL model pre-trained using an autoregressive method to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input sequence factorization order.

XLNet is a BERT-like model. The major difference comes in its approach to pre-training. BERT is an Autoencoding (AE) based model, while XLNet is an Auto-Regressive (AR). This difference materializes in the MLM task, where randomly masked language tokens are to be predicted by the model. So far, XLNet has outperformed BERT in 20 NLP benchmark test.


# 1. Set up

## Install dependencies and import libraries.

In [1]:
!pip install transformers sentencepiece optuna

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting optuna
  Downloading optuna-3.4.0-py3-none-any.whl (409 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.6/409.6 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import torch
import random
from torch.utils.data import Dataset, DataLoader, TensorDataset
from transformers import XLNetTokenizer, XLNetForSequenceClassification, XLNetConfig, AdamW
from transformers import get_linear_schedule_with_warmup
from sklearn.metrics import f1_score, accuracy_score
from tqdm import tqdm
import optuna

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In order to ensure reproducibility of our results, we are setting the random seeds for different libraries we are using in our notebook. This makes the random functions in these libraries generate the same values every time the notebook is run.

In [5]:
# Seed Python random
random_seed = 42
random.seed(random_seed)

# Seed NumPy random
np.random.seed(random_seed)

# Seed PyTorch random
torch.manual_seed(random_seed)

# If you are using CUDA, you should also seed the CUDA random number generator
torch.cuda.manual_seed(random_seed)
torch.cuda.manual_seed_all(random_seed)

# Set the cuDNN backend to be deterministic
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Device: {device}')

Device: cuda


## Import Data

In [7]:
path = '/content/drive/MyDrive/CZ4042 NNDL/datasets'

# vanilla datasets
train_df = pd.read_csv(os.path.join(path, 'train.csv'))
test_df = pd.read_csv(os.path.join(path, 'test.csv'))
val_df = pd.read_csv(os.path.join(path, 'validation.csv'))

# augmented datasets
aug_train_df = pd.read_csv(os.path.join(path, 'nlpaug-train.csv'))
trans_train_df = pd.read_csv(os.path.join(path, 'google-translate-train.csv'))
gpt_train_df = pd.read_csv(os.path.join(path, 'gpt-augment-train.csv'))

In [8]:
train_df.head()

Unnamed: 0,sentiment,statement
0,1,The operations to be sold include manufacturin...
1,1,L&T has also made a commitment to redeem the r...
2,1,The deal was worth about EUR 1.2 mn .
3,1,FinancialWire tm is not a press release servic...
4,1,The share of the share capital of both above m...


# 2. Preprocess and tokenize data

In [9]:
# Set params
num_labels = train_df.sentiment.nunique()

Load the pre-trained XLNet tokenizer

In [10]:
# Define the XLNet model and tokenizer
model_name = 'xlnet-base-cased'
tokenizer = XLNetTokenizer.from_pretrained(model_name)
model = XLNetForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Define dataset
We define a custom dataset class, SentimentDataset, which inherits from PyTorch's Dataset class. This dataset class will handle the loading and processing of our text data.

In [11]:
# Define a custom dataset class
class CustomDataset(Dataset):
    def __init__(self, data, tokenizer, max_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]['statement']
        sentiment = self.data.iloc[idx]['sentiment']

        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_length,
            return_tensors='pt',
            return_attention_mask=True,
            add_special_tokens=True
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(sentiment, dtype=torch.long)
        }

# Define data loaders for training, validation, and test sets
def create_data_loaders(train_df, val_df, test_df, batch_size, max_length):
    tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
    train_dataset = CustomDataset(train_df, tokenizer, max_length)
    val_dataset = CustomDataset(val_df, tokenizer, max_length)
    test_dataset = CustomDataset(test_df, tokenizer, max_length)

    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)

    return train_loader, val_loader, test_loader

## Define train and evaluate loops
Performs batch gradient descent to calculate `average_loss`, `accuracy` and `f1` score

### Train loop
For each batch of data from the training dataloader, backward pass is performed to calculate gradients, and the optimizer steps to update the model's parameters.

### Evaluate loop
For each batch of data from the validation dataloader, we use a `torch.no_grad()` context manager to disable gradient calculation, as we don't need to update the model's parameters during evaluation. The rest of the process is similar to the training loop, but we don't perform backward pass or optimizer step.

In [12]:
def train(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    total_predictions = []
    total_targets = []

    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        predictions = torch.argmax(outputs.logits, dim=1)
        total_predictions.extend(predictions.tolist())
        total_targets.extend(labels.tolist())

        loss.backward()
        optimizer.step()

    average_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(total_targets, total_predictions)
    f1 = f1_score(total_targets, total_predictions, average='weighted')

    return average_loss, accuracy, f1

def evaluate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    total_predictions = []
    total_targets = []

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()

            predictions = torch.argmax(outputs.logits, dim=1)
            total_predictions.extend(predictions.tolist())
            total_targets.extend(labels.tolist())

    average_loss = total_loss / len(dataloader)
    accuracy = accuracy_score(total_targets, total_predictions)
    f1 = f1_score(total_targets, total_predictions, average='weighted')

    return average_loss, accuracy, f1

## Hyperparameter Tuning
We perform hyperparameter tuning of the following hyperparameters: `batch_size`, `learning_rate`, and `num_epochs`.

In [13]:
def objective(trial):
    # Define hyperparameters to be optimized
    batch_size = trial.suggest_categorical("batch_size", [32, 48, 64])
    learning_rate = trial.suggest_categorical("learning_rate", [3e-5, 2e-5, 1e-5])
    num_epochs = trial.suggest_categorical("num_epochs", [2, 3, 4])

    # Initialize the model, optimizer, and data loaders with selected hyperparameters
    model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", num_labels=num_labels)
    model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    train_loader, val_loader, _ = create_data_loaders(train_df, val_df, test_df, batch_size, max_length)

    for epoch in tqdm(range(num_epochs)):
        train_loss, train_accuracy, train_f1 = train(model, train_loader, optimizer, criterion, device)
        val_loss, val_accuracy, val_f1 = evaluate(model, val_loader, criterion, device)

        trial.report(val_loss, epoch)

        if trial.should_prune():
            raise optuna.TrialPruned()

    return val_loss

In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_length = 128
criterion = torch.nn.CrossEntropyLoss()

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=12, gc_after_trial=True)

best_params = study.best_params
best_loss = study.best_value

[I 2023-11-09 15:17:58,378] A new study created in memory with name: no-name-6f94fa0e-284a-4865-9eb1-75f5d3ea3e3d
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 3/3 [04:29<00:00, 89.90s/it]
[I 2023-11-09 15:22:36,633] Trial 0 finished with value: 0.373175950050354 and parameters: {'batch_size': 32, 'learning_rate': 1e-05, 'num_epochs': 3}. Best is trial 0 with value: 0.373175950050354.
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
You should probably TRAIN 

In [15]:
print("Best Hyperparameters:", best_params)
print("Best Validation Loss:", best_loss)

Best Hyperparameters: {'batch_size': 32, 'learning_rate': 2e-05, 'num_epochs': 3}
Best Validation Loss: 0.3448593947291374


# 3. Develop the Optimal Model

## Set optimal hyperparameters

In [16]:
optimal_batch_size = best_params["batch_size"]
optimal_learning_rate = best_params["learning_rate"]
optimal_num_epochs = best_params["num_epochs"]

# Get optimal dataloaders
optimal_train_loader, optimal_val_loader, optimal_test_loader = create_data_loaders(train_df, val_df, test_df, optimal_batch_size, max_length)

## Train, validate and test optimal model

In [17]:
optimal_model = XLNetForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)
optimal_optimizer = torch.optim.AdamW(optimal_model.parameters(), lr=optimal_learning_rate)
print(f'\n{"Epoch" : <6} | {"Train loss" : <10} | {"Val loss" : <10} | {"Train acc" : <10} | {"Val acc" : <10} | {"Train F1" : <10} | {"Val F1" : <10}')

for epoch in (range(optimal_num_epochs)):
    train_loss, train_accuracy, train_f1 = train(optimal_model, optimal_train_loader, optimal_optimizer, criterion, device)
    val_loss, val_accuracy, val_f1 = evaluate(optimal_model, optimal_val_loader, criterion, device)
    print(f'{epoch+1}{"" : <4}\
    {train_loss:.4f}{"" : <7}{val_loss:.4f}{"" : <3}\
    {train_accuracy:.4f}{"" : <7}{val_accuracy:.4f}{"" : <3}\
    {train_f1:.4f}{"" : <7}{val_f1:.4f}')

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Epoch  | Train loss | Val loss   | Train acc  | Val acc    | Train F1   | Val F1    
1        0.6324       0.4332       0.7317       0.8181       0.7157       0.8200
2        0.3454       0.3840       0.8654       0.8387       0.8646       0.8384
3        0.2511       0.4129       0.9041       0.8542       0.9038       0.8492


In [18]:
test_loss, test_accuracy, test_f1 = evaluate(optimal_model, optimal_test_loader, criterion, device)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"Test F1 Score: {test_f1:.4f}")

Test Loss: 0.4220
Test Accuracy: 0.8554
Test F1 Score: 0.8502


In [19]:
model

XLNetForSequenceClassification(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 768)
    (layer): ModuleList(
      (0-11): 12 x XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=768, out_features=3072, bias=True)
          (layer_2): Linear(in_features=3072, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (activation_function): GELUActivation()
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (sequence_summary): SequenceSummary(
    (summary): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
    (first_dropout): Identity()
    (last

# 4. Evaluate Data Augmentation
## Tackling class imbalance in a small dataset
Small dataset with imbalanced distribution of classes (ie. sentiment) may skew the model results. Therefore, we will retrain the model with optimal parameters on two augmented datasets to evaluate the model performance and effectiveness of augmentation methods.

## Google Translate Augmentation
Through back transalation <i>(ie. by translating from language A to language B then back to language A)</i>, certain words may be replaced with synonyms.

This method was applied on the `train_df`. Subsequent duplicates were removed. This helped to upsample the underpresented classes of `sentiment = 0`.

In [20]:
train_df.sentiment.value_counts() / len(train_df) * 100

1    59.509202
2    27.704230
0    12.786568
Name: sentiment, dtype: float64

In [21]:
trans_train_df.sentiment.value_counts() / len(trans_train_df) * 100

1    52.898967
2    24.626866
0    22.474168
Name: sentiment, dtype: float64

The proportion of `sentiment = 0` increased from 12.7% to 22.5%.

### Train, validate and test model on Google Translate Augmented data

In [22]:
trans_train_loader, _, _ = create_data_loaders(trans_train_df, val_df, test_df, optimal_batch_size, max_length)

trans_model = XLNetForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)
trans_optimizer = torch.optim.AdamW(trans_model.parameters(), lr=optimal_learning_rate)
print(f'\n{"Epoch" : <6} | {"Train loss" : <10} | {"Val loss" : <10} | {"Train acc" : <10} | {"Val acc" : <10} | {"Train F1" : <10} | {"Val F1" : <10}')

for epoch in (range(optimal_num_epochs)):
    train_loss, train_accuracy, train_f1 = train(trans_model, trans_train_loader, trans_optimizer, criterion, device)
    val_loss, val_accuracy, val_f1 = evaluate(trans_model, optimal_val_loader, criterion, device)
    print(f'{epoch+1}{"" : <4}\
    {train_loss:.4f}{"" : <7}{val_loss:.4f}{"" : <3}\
    {train_accuracy:.4f}{"" : <7}{val_accuracy:.4f}{"" : <3}\
    {train_f1:.4f}{"" : <7}{val_f1:.4f}')

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Epoch  | Train loss | Val loss   | Train acc  | Val acc    | Train F1   | Val F1    
1        0.5744       0.3801       0.7592       0.8245       0.7524       0.8266
2        0.2952       0.7526       0.8881       0.7303       0.8875       0.7372
3        0.2016       0.3597       0.9242       0.8671       0.9238       0.8685


In [23]:
trans_test_loss, trans_test_accuracy, trans_test_f1 = evaluate(trans_model, optimal_test_loader, criterion, device)
print(f"Test Loss: {trans_test_loss:.4f}")
print(f"Test Accuracy: {trans_test_accuracy:.4f}")
print(f"Test F1 Score: {trans_test_f1:.4f}")

Test Loss: 0.3997
Test Accuracy: 0.8533
Test F1 Score: 0.8537


## `nlpaug` Augmentation
The [`nlpaug` library](https://github.com/makcedward/nlpaug) is used here for data augmentation.

This helped to upsample the underpresented classes of `sentiment = 0` by replacing texts with synonyms.

In [24]:
train_df.sentiment.value_counts() / len(train_df) * 100

1    59.509202
2    27.704230
0    12.786568
Name: sentiment, dtype: float64

In [25]:
aug_train_df.sentiment.value_counts() / len(aug_train_df) * 100

1    52.762668
2    24.563413
0    22.673919
Name: sentiment, dtype: float64

The proportion of `sentiment = 0` increased from 12.7% to 22.7%.

### Train, validate and test model on `nlpaug` Augmented data

In [26]:
nlpaug_train_loader, _, _ = create_data_loaders(aug_train_df, val_df, test_df, optimal_batch_size, max_length)

nlpaug_model = XLNetForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)
nlpaug_optimizer = torch.optim.AdamW(nlpaug_model.parameters(), lr=optimal_learning_rate)
print(f'\n{"Epoch" : <6} | {"Train loss" : <10} | {"Val loss" : <10} | {"Train acc" : <10} | {"Val acc" : <10} | {"Train F1" : <10} | {"Val F1" : <10}')

for epoch in (range(optimal_num_epochs)):
    train_loss, train_accuracy, train_f1 = train(nlpaug_model, nlpaug_train_loader, nlpaug_optimizer, criterion, device)
    val_loss, val_accuracy, val_f1 = evaluate(nlpaug_model, optimal_val_loader, criterion, device)
    print(f'{epoch+1}{"" : <4}\
    {train_loss:.4f}{"" : <7}{val_loss:.4f}{"" : <3}\
    {train_accuracy:.4f}{"" : <7}{val_accuracy:.4f}{"" : <3}\
    {train_f1:.4f}{"" : <7}{val_f1:.4f}')

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Epoch  | Train loss | Val loss   | Train acc  | Val acc    | Train F1   | Val F1    
1        0.6188       0.4809       0.7378       0.8168       0.7299       0.8176
2        0.3135       0.3667       0.8798       0.8684       0.8793       0.8654
3        0.2161       0.5071       0.9196       0.8206       0.9193       0.8252


In [27]:
nlpaug_test_loss, nlpaug_test_accuracy, nlpaug_test_f1 = evaluate(nlpaug_model, optimal_test_loader, criterion, device)
print(f"Test Loss: {nlpaug_test_loss:.4f}")
print(f"Test Accuracy: {nlpaug_test_accuracy:.4f}")
print(f"Test F1 Score: {nlpaug_test_f1:.4f}")

Test Loss: 0.4314
Test Accuracy: 0.8368
Test F1 Score: 0.8399


## GPT-2 Augmentation

Drop the original `statement` column and replace it with the `gpt_output` columns

In [28]:
gpt_train_df = gpt_train_df.drop('statement', axis=1)
gpt_train_df = gpt_train_df.rename(columns={'gpt_output': 'statement'})
gpt_train_df.head()

Unnamed: 0,sentiment,statement
0,1,The operations to be sold include manufacturin...
1,1,L&T has also made a commitment to redeem the r...
2,1,The deal was worth about EUR 1.2 mn. Given thi...
3,1,FinancialWire tm is not a press release servic...
4,1,The share of the share capital of both above m...


### Train, validate and test model on GPT-2 Augmented data

In [29]:
gpt_train_loader, _, _ = create_data_loaders(gpt_train_df, val_df, test_df, optimal_batch_size, max_length)

gpt_model = XLNetForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)
gpt_optimizer = torch.optim.AdamW(gpt_model.parameters(), lr=optimal_learning_rate)
print(f'\n{"Epoch" : <6} | {"Train loss" : <10} | {"Val loss" : <10} | {"Train acc" : <10} | {"Val acc" : <10} | {"Train F1" : <10} | {"Val F1" : <10}')

for epoch in (range(optimal_num_epochs)):
    train_loss, train_accuracy, train_f1 = train(gpt_model, gpt_train_loader, gpt_optimizer, criterion, device)
    val_loss, val_accuracy, val_f1 = evaluate(gpt_model, optimal_val_loader, criterion, device)
    print(f'{epoch+1}{"" : <4}\
    {train_loss:.4f}{"" : <7}{val_loss:.4f}{"" : <3}\
    {train_accuracy:.4f}{"" : <7}{val_accuracy:.4f}{"" : <3}\
    {train_f1:.4f}{"" : <7}{val_f1:.4f}')

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['sequence_summary.summary.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'logits_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Epoch  | Train loss | Val loss   | Train acc  | Val acc    | Train F1   | Val F1    
1        0.6942       0.4674       0.7029       0.8039       0.6803       0.8083
2        0.3407       0.4324       0.8624       0.8516       0.8618       0.8469
3        0.2352       0.4106       0.9089       0.8542       0.9088       0.8542


In [30]:
gpt_test_loss, gpt_test_accuracy, gpt_test_f1 = evaluate(gpt_model, optimal_test_loader, criterion, device)
print(f"Test Loss: {gpt_test_loss:.4f}")
print(f"Test Accuracy: {gpt_test_accuracy:.4f}")
print(f"Test F1 Score: {gpt_test_f1:.4f}")

Test Loss: 0.3331
Test Accuracy: 0.8647
Test F1 Score: 0.8642


# Conclusion
The `GPT-2` augmentation to the dataset was most useful.

Compared to the base `test_df`, the test accuracy improved from 0.8554 to 0.8647.