<div class="alert alert-block alert-info">
<h1> Text Mining Project: Stock Sentiment <br>
Predicting Market Behavior from Tweets</h1><br>
 Text Mining 2025<br>
NOVA IMS MDSAA

<div class="alert alert-block alert-warning">
[NOTE] tm_tests consists of 3 notebooks: <br>
- tm_tests_01_12.ipynb: Pipeline 1 for EDA, ML models, LSTM, and DistilBERT  <br>
- tm_tests_02_12.ipynb: Pipeline 2 for GPT-2<br>
- tm_tests_03_12.ipynb: Pipeline 3 for FinBERT<br>

**This is Pipeline 2: "tm_tests_02_12.ipynb"**

# Group 12

|   | Student Name          |  Student ID |
|---|-----------------------|    ---      |
| 1 | Hassan Bhatti       |  20241023 |
| 2 | Moeko Mitani          |   20240670  |
| 3 | Oumayma Ben Hfaiedh   |   20240699  |
| 4 | Rute D'Alva Teixeira      |  20240667  |
| 5 | Sarah Leuthner    |   20240581  |  


# Table of Contents

* [<font color='#52b69a'>1. Data Integration</font>](#1.) <br>
    - [1.1. Import Libraries ](#1.1.)<br>
    - [1.2. Load Preprocessed CSV Files from tm_tests_01_12.ipynb](#1.2.)<br>  

* [<font color='#52b69a'>2. GPT-2 Tokenizer & Model Setup </font>](#2.)<br>

* [<font color='#52b69a'>3. Custom Dataset Class </font>](#3.)<br>

* [<font color='#52b69a'>4. Train-Validation Split </font>](#4.)<br>

* [<font color='#52b69a'>5. Metrics Function </font>](#5.)<br>

* [<font color='#52b69a'>6. Training Arguments </font>](#6.)<br>

* [<font color='#52b69a'>7. Train The GPT-2 Model </font>](#7.)<br>

* [<font color='#52b69a'>8. Generate Report Metrics (Macro & Weighted) </font>](#8.)<br>

In this notebook, we implement a **Transformer Decoder Model** using **GPT-2** for sentiment classification.

Unlike encoder-based models like BERT or FinBERT, GPT-2 uses an autoregressive decoder architecture. Although GPT-2 is primarily designed for text generation, we fine-tune it here for a classification task using HuggingFace's `GPT2ForSequenceClassification` class.

<a class="anchor" id="1.">

# 1. Data Integration
<a>

<a class="anchor" id="1.1.">

## 1.1. Import Libraries
<a>

In [1]:
# Install dependencies
#!pip install transformers
#!pip install scikit-learn
#!pip install pandas

# Import libraries
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification, GPT2Config, Trainer, TrainingArguments
import os

# Disable wandb to avoid unnecessary API prompts
os.environ["WANDB_DISABLED"] = "true"

<a class="anchor" id="1.2.">

## 1.2. Load Preprocessed CSV Files from tm_tests_01_12.ipynb
<a>

We are going to load the training and test datasets that were preprocessed in *tm_tests_01_12.ipynb* and exported for transformer models. This allows us to work independently of the previous preprocessing pipelines.

In [2]:
# Load training and test data from exported files
train_df = pd.read_csv("train_df.csv")
test_df = pd.read_csv("test_df.csv")

# Reset indices to avoid alignment issues
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

# Extract features and labels for GPT-2
X_train_text = train_df["text"]
y_train = train_df["label"]

X_test_text = test_df["text"]  # test set only has text column

<a class="anchor" id="2.">

# 2. GPT-2 Tokenizer & Model Setup
<a>

We are going to load a smaller version of GPT-2 (`sshleifer/tiny-gpt2`) for faster training on limited hardware. Since GPT-2 does not natively support padding, we are going to add a pad token to the tokenizer and update the model configuration accordingly to handle padding during training.

In [3]:
# Load tiny GPT-2 model for fast fine-tuning
model_name = "sshleifer/tiny-gpt2"
tokenizer_gpt2 = GPT2Tokenizer.from_pretrained(model_name)
tokenizer_gpt2.add_special_tokens({'pad_token': '[PAD]'})

# Build model config with pad token correctly handled
config = GPT2Config.from_pretrained(model_name, num_labels=3, pad_token_id=tokenizer_gpt2.pad_token_id)
model_gpt2 = GPT2ForSequenceClassification.from_pretrained(model_name, config=config)
model_gpt2.resize_token_embeddings(len(tokenizer_gpt2))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at sshleifer/tiny-gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 2)

<a class="anchor" id="3.">

# 3. Custom Dataset Class
<a>

We are going to define a custom PyTorch Dataset to tokenize the text data and prepare it in the format expected by the GPT-2 model. This includes truncation and padding of sequences to a fixed maximum length to ensure consistent input sizes.

In [4]:
# Custom Dataset class for GPT-2
class GPT2TweetDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=64):
        self.texts = texts.tolist()
        self.labels = labels.tolist()
        self.encodings = tokenizer(self.texts, truncation=True, padding=True, max_length=max_len)

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

<a class="anchor" id="4.">

# 4. Train-Validation Split
<a>

We are going to split the training data into training and validation sets to evaluate model performance during training. A stratified 90/10 split is used to preserve class distribution.

In [5]:
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train_text, y_train, test_size=0.1, random_state=42, stratify=y_train)

# Build datasets
train_dataset_gpt2 = GPT2TweetDataset(X_train_split, y_train_split, tokenizer_gpt2)
val_dataset_gpt2 = GPT2TweetDataset(X_val_split, y_val_split, tokenizer_gpt2)

model.safetensors:   0%|          | 0.00/2.51M [00:00<?, ?B/s]

<a class="anchor" id="5.">

# 5. Metrics Function
<a>

We are going to define weighted evaluation metrics (accuracy, precision, recall, and F1-score) that are compatible with multi-class classification and handle potential class imbalance.

In [6]:
#Define evaluation metrics for Trainer
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

<a class="anchor" id="6.">

# 6. Training Arguments
<a>

We are going to define the training configuration using Hugging Face's `TrainingArguments`, specifying batch size, learning rate schedule, warm-up steps, and logging intervals. A single epoch is used to balance training time with model performance.

In [7]:
training_args_gpt2 = TrainingArguments(
    output_dir='./results_gpt2',
    num_train_epochs=1,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=50,
    weight_decay=0.01,
    logging_dir='./logs_gpt2',
    logging_steps=10,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


<a class="anchor" id="7.">

# 7. Train The GPT-2 Model
<a>

We are going to initialize the Hugging Face Trainer, fine-tune GPT-2 on the prepared dataset, and evaluate its performance on the validation set.

In [8]:
# Build Trainer
trainer_gpt2 = Trainer(
    model=model_gpt2,
    args=training_args_gpt2,
    train_dataset=train_dataset_gpt2,
    eval_dataset=val_dataset_gpt2,
    compute_metrics=compute_metrics
)

# Train GPT-2
trainer_gpt2.train()

# Evaluate GPT-2
results_gpt2 = trainer_gpt2.evaluate()
print(results_gpt2)

Step,Training Loss
10,1.1063
20,1.1003
30,1.11
40,1.0937
50,1.0942
60,1.1011
70,1.0908
80,1.0883
90,1.0927
100,1.0827


{'eval_loss': 1.039266586303711, 'eval_accuracy': 0.6471204188481675, 'eval_f1': 0.5084811428457311, 'eval_precision': 0.4187648364902278, 'eval_recall': 0.6471204188481675, 'eval_runtime': 1.0877, 'eval_samples_per_second': 877.964, 'eval_steps_per_second': 219.721, 'epoch': 1.0}


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


<a class="anchor" id="8.">

# 8. Generate Report Metrics (Macro & Weighted)
<a>

We generate a detailed classification report to obtain both macro and weighted averages for precision, recall, and F1-score. This provides a better understanding of model performance across all classes, especially for imbalanced datasets.

==== Model Performance Summary ====

After training, the GPT-2 decoder model achieved:
- Weighted F1 score: ~50.8%
- Macro F1 score: ~26.1%
- High recall on Neutral class (~99%), but poor recall for Bearish and Bullish.

This shows GPT-2 struggled with class imbalance, over-predicting Neutral class.

Nevertheless, the model was successfully fine-tuned and predictions were generated.

In [9]:
# Generate predictions on validation set for report metrics
val_predictions = trainer_gpt2.predict(val_dataset_gpt2)
val_predicted_labels = np.argmax(val_predictions.predictions, axis=1)

# Full classification report
report = classification_report(val_dataset_gpt2.labels, val_predicted_labels, target_names=['Bearish', 'Bullish', 'Neutral'], digits=4)
print(report)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

     Bearish     0.0000    0.0000    0.0000       144
     Bullish     0.0000    0.0000    0.0000       193
     Neutral     0.6471    1.0000    0.7858       618

    accuracy                         0.6471       955
   macro avg     0.2157    0.3333    0.2619       955
weighted avg     0.4188    0.6471    0.5085       955



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Final Remarks

The GPT-2 decoder model was successfully fine-tuned and achieved a weighted F1-score of ~50.8% and macro F1-score of ~26.1%.

The model demonstrated strong bias towards predicting the majority Neutral class due to class imbalance, which is a common challenge in small datasets for decoder-based models.