<a class="anchor" id="10.">

# 1. GPT-2 Decoder Model
<a>

In this section, we implement a transformer **decoder model** using GPT-2 for sentiment classification.

Unlike encoder-based models like BERT or FinBERT, GPT-2 uses an autoregressive decoder architecture. Although GPT-2 is primarily designed for text generation, we fine-tune it here for a classification task using HuggingFace's `GPT2ForSequenceClassification` class.

<a class="anchor" id="10.1.">

## 1.1. Imports & Setup
<a>

In [3]:
# Install dependencies if needed (if running on a fresh Colab instance)
!pip install transformers
!pip install scikit-learn
!pip install pandas

# Import libraries
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, GPT2Config, Trainer, TrainingArguments
import os

# Disable wandb to avoid unnecessary API prompts
os.environ["WANDB_DISABLED"] = "true"



<a class="anchor" id="10.2.">

## 1.2. Load Preprocessed CSV Files
<a>

We load the training and test datasets that were preprocessed in the main project notebook and exported for transformer models. This allows us to work independently of the previous preprocessing pipelines.

In [4]:
# Load training and test data from exported files
train_df = pd.read_csv("train_df.csv")
test_df = pd.read_csv("test_df.csv")

# Reset indices to avoid alignment issues
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# Extract features and labels for GPT-2
X_train_text = train_df["text"]
y_train = train_df["label"]

X_test_text = test_df["text"]  # test set only has text column

<a class="anchor" id="10.2.">

## 1.3. GPT-2 Tokenizer & Model Setup
<a>

We load a smaller version of GPT-2 (`sshleifer/tiny-gpt2`) for faster training on limited hardware. Since GPT-2 does not natively support padding, we add a pad token to the tokenizer and update the model configuration accordingly to handle padding during training.

In [5]:
# Load tiny GPT-2 model for fast fine-tuning
model_name = "sshleifer/tiny-gpt2"
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained(model_name)
tokenizer_gpt2.add_special_tokens({'pad_token': '[PAD]'})

# Build model config with pad token correctly handled
config = GPT2Config.from_pretrained(model_name, num_labels=3, pad_token_id=tokenizer_gpt2.pad_token_id)
model_gpt2 = GPT2ForSequenceClassification.from_pretrained(model_name, config=config)
model_gpt2.resize_token_embeddings(len(tokenizer_gpt2))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at sshleifer/tiny-gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 2)

<a class="anchor" id="10.2.">

## 1.4. Custom Dataset Class
<a>

We define a custom PyTorch Dataset to tokenize the text data and prepare it in the format expected by the GPT-2 model. This includes truncation and padding of sequences to a fixed maximum length to ensure consistent input sizes.

In [6]:
# Custom Dataset class for GPT-2
class GPT2TweetDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=64):
        self.texts = texts.tolist()
        self.labels = labels.tolist()
        self.encodings = tokenizer(self.texts, truncation=True, padding=True, max_length=max_len)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

<a class="anchor" id="10.2.">

## 1.5. Train/Validation Split
<a>

We split the training data into training and validation sets to evaluate model performance during training. A stratified 90/10 split is used to preserve class distribution.

In [7]:
# Optional train-validation split (90/10 split for simplicity)
from sklearn.model_selection import train_test_split

X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train_text, y_train, test_size=0.1, random_state=42, stratify=y_train)

# Build datasets
train_dataset_gpt2 = GPT2TweetDataset(X_train_split, y_train_split, tokenizer_gpt2)
val_dataset_gpt2 = GPT2TweetDataset(X_val_split, y_val_split, tokenizer_gpt2)

model.safetensors:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

<a class="anchor" id="10.2.">

## 1.6. Metrics Function
<a>

We define weighted evaluation metrics (accuracy, precision, recall, and F1-score) that are compatible with multi-class classification and handle potential class imbalance.

In [8]:
 #Define evaluation metrics for Trainer
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

<a class="anchor" id="10.2.">

## 1.7. Training Arguments
<a>

We define the training configuration using Hugging Face's `TrainingArguments`, specifying batch size, learning rate schedule, warm-up steps, and logging intervals. A single epoch is used to balance training time with model performance.

In [9]:
training_args_gpt2 = TrainingArguments(
    output_dir='./results_gpt2',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=50,
    weight_decay=0.01,
    logging_dir='./logs_gpt2',
    logging_steps=10,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


<a class="anchor" id="10.2.">

## 1.8. Train the GPT-2 Model
<a>

We initialize the Hugging Face Trainer, fine-tune GPT-2 on the prepared dataset, and evaluate its performance on the validation set.

In [25]:
# Build Trainer
trainer_gpt2 = Trainer(
    model=model_gpt2,
    args=training_args_gpt2,
    train_dataset=train_dataset_gpt2,
    eval_dataset=val_dataset_gpt2,
    compute_metrics=compute_metrics
)

# Train GPT-2
trainer_gpt2.train()

# Evaluate GPT-2
results_gpt2 = trainer_gpt2.evaluate()
print(results_gpt2)

Exception ignored in: <function _xla_gc_callback at 0x7abea171d620>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/jax/_src/lib/__init__.py", line 96, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 


Step,Training Loss
10,1.0296
20,1.0283
30,1.046
40,1.0726
50,1.0659
60,1.0403
70,1.0231
80,1.0602
90,1.0532
100,1.0346


{'eval_loss': 1.0054112672805786, 'eval_accuracy': 0.6471204188481675, 'eval_f1': 0.5084811428457311, 'eval_precision': 0.4187648364902278, 'eval_recall': 0.6471204188481675, 'eval_runtime': 1.4144, 'eval_samples_per_second': 675.196, 'eval_steps_per_second': 168.976, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


<a class="anchor" id="10.2.">

## 1.9. Generate Report Metrics (Macro & Weighted)
<a>

We generate a detailed classification report to obtain both macro and weighted averages for precision, recall, and F1-score. This provides a better understanding of model performance across all classes, especially for imbalanced datasets.

==== Model Performance Summary ====

After training, the GPT-2 decoder model achieved:
- Weighted F1 score: ~50.8%
- Macro F1 score: ~26.1%
- High recall on Neutral class (~99%), but poor recall for Bearish and Bullish.

This shows GPT-2 struggled with class imbalance, over-predicting Neutral class.

Nevertheless, the model was successfully fine-tuned and predictions were generated.

In [11]:
# Generate predictions on validation set for report metrics
val_predictions = trainer_gpt2.predict(val_dataset_gpt2)
val_predicted_labels = np.argmax(val_predictions.predictions, axis=1)

# Full classification report
report = classification_report(val_dataset_gpt2.labels, val_predicted_labels, target_names=['Bearish', 'Bullish', 'Neutral'], digits=4)
print(report)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

     Bearish     0.0000    0.0000    0.0000       144
     Bullish     0.0000    0.0000    0.0000       193
     Neutral     0.6468    0.9984    0.7850       618

    accuracy                         0.6461       955
   macro avg     0.2156    0.3328    0.2617       955
weighted avg     0.4185    0.6461    0.5080       955



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


<a class="anchor" id="10.2.">

## 1.10. Predict on Final Test Set
<a>

We prepare the unseen test set for inference, generate predictions using the trained GPT-2 model, and export the final submission file in the required format for evaluation.

 ==== Test Set Predictions Summary ====

The GPT-2 decoder model successfully generated predictions for the unseen test set.
The output file 'pred_gpt2.csv' is ready for submission and contains:
- 'id' column from test.csv
- 'label' column with predicted classes (0=Bearish, 1=Bullish, 2=Neutral)

In [12]:
# Prepare test dataset for submission
dummy_labels = pd.Series([0] * len(X_test_text))
test_dataset_gpt2 = GPT2TweetDataset(X_test_text, dummy_labels, tokenizer_gpt2)

# Generate predictions
test_predictions = trainer_gpt2.predict(test_dataset_gpt2)
test_predicted_labels = np.argmax(test_predictions.predictions, axis=1)

# Build submission dataframe
submission_df = pd.DataFrame({
    'id': test_df['id'],
    'label': test_predicted_labels
})

# Export to CSV
submission_df.to_csv('pred_gpt2.csv', index=False)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Final Remarks

The GPT-2 decoder model was successfully fine-tuned and achieved a weighted F1-score of ~50.8% and macro F1-score of ~26.1%.

The model demonstrated strong bias towards predicting the majority Neutral class due to class imbalance, which is a common challenge in small datasets for decoder-based models.

Despite these challenges, valid predictions were generated and exported for submission.