<a href="https://colab.research.google.com/github/levina-ai/financial-tweet-sentiment/blob/main/02_final_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 02 — Final Model

This notebook contains the **final end-to-end pipeline** used for the project’s submission.

It implements the best performing approach from Notebook 01:
- **Minimal tweet cleaning** suitable for transformers (URLs/RT/@mentions removed; punctuation preserved)
- **Fine-tuning** of `cardiffnlp/twitter-roberta-base-sentiment-latest` on the full training set
- **Feature extraction** using the fine-tuned RoBERTa encoder (mean-pooled embeddings)
- Training a tuned **LinearSVC** classifier on top of the embeddings
- **Model validation** using 5-fold Stratified CV (out-of-fold predictions via `cross_val_predict`)
- Generating test-set predictions and exporting the final submission file `pred_07.csv`

## Outputs
- Submission file: `pred_07.csv` with columns `id` and `label`


## Imports & Load Data

In [None]:
# basics
import os, re, string, random
from dataclasses import dataclass
from typing import Dict, List, Tuple, Optional
from itertools import chain
from collections import Counter

# data
import numpy as np
import pandas as pd

# plotting
import matplotlib.pyplot as plt

# sklearn
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, balanced_accuracy_score, confusion_matrix, classification_report
from sklearn.svm import LinearSVC
from sklearn.model_selection import StratifiedKFold, cross_val_predict

# transformers
import torch
from datasets import Dataset
from transformers import AutoConfig, AutoTokenizer,AutoModelForSequenceClassification, AutoModel, TrainingArguments, Trainer, DataCollatorWithPadding, pipeline

# utils
from tqdm.notebook import tqdm

# seed
SEED = 42
random.seed(SEED)
np.random.seed(SEED)


In [None]:
train_df = pd.read_csv("train.csv")
test_df  = pd.read_csv("test.csv")

In [None]:
X_train = train_df["text"]
y_train = train_df["label"]

X_test = test_df["text"]
test_ids   = test_df["id"].values

## Preprocessor

In [None]:
def base_clean(text: str) -> str:
    """
    Base cleaning shared by all feature types.
    - lowercase
    - remove URLs
    - remove RT
    - remove @mentions
    - remove leading '#', keep hashtag word
    - normalize whitespace
    NOTE: does NOT remove punctuation globally (depends on downstream method).
    """
    text = str(text).lower()
    text = re.sub(r"http\S+", " ", text)   # URLs
    text = re.sub(r"\brt\b", " ", text)    # RT
    text = re.sub(r"@\w+", " ", text)      # @ mentions
    text = re.sub(r"#", "", text)          # keep hashtag words
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [None]:
def transformer_preprocessor(text: str) -> str:
    """
    Preprocessor for transformer encoders.
    Minimal cleaning only: keep punctuation and stopwords.
    - uses base_clean
    - does NOT strip punctuation (transformers were trained with it)
    """
    text = base_clean(text)
    return text

In [None]:
# Minimal cleaning for transformers
X_train_tx = X_train.apply(transformer_preprocessor)
X_test_tx   = X_test.apply(transformer_preprocessor)

## Feature Engineering

### Fine-tune RoBERTa

In [None]:
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# 1) Load model
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
label2id = model.config.label2id  # {"negative":0,"neutral":1,"positive":2}

# 2) Map dataset ids -> model ids
DATA_ID2LABEL = {0: "negative", 1: "positive", 2: "neutral"}
y_train_m = [label2id[DATA_ID2LABEL[int(y)]] for y in y_train]

# 3) Build HF dataset
train_ds = Dataset.from_dict({"text": X_train_tx.tolist(), "label": y_train_m})

# 4) Tokenize + dynamic padding
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tok(batch):
    return tokenizer(batch["text"], truncation=True, max_length=96)

train_ds = train_ds.map(tok, batched=True)
collator = DataCollatorWithPadding(tokenizer)

# 5) Train
args = TrainingArguments(
    output_dir="twroberta_ft",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1,
    report_to="none",
    dataloader_pin_memory=False,
    seed=SEED,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    data_collator=collator,
)

trainer.train()

# 6) Save
SAVE_DIR = "finetuned_roberta_final"
trainer.save_model(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Map:   0%|          | 0/9543 [00:00<?, ? examples/s]

Step,Training Loss
500,0.4869
1000,0.4029


('finetuned_roberta_final/tokenizer_config.json',
 'finetuned_roberta_final/special_tokens_map.json',
 'finetuned_roberta_final/vocab.json',
 'finetuned_roberta_final/merges.txt',
 'finetuned_roberta_final/added_tokens.json',
 'finetuned_roberta_final/tokenizer.json')

In [None]:
@torch.inference_mode()
def encode_texts(texts, batch_size=16, max_length=64, pooling="cls", show_progress=True):
    texts = texts.tolist() if hasattr(texts, "tolist") else list(texts)

    n = len(texts)
    n_batches = (n + batch_size - 1) // batch_size

    all_vecs = []
    it = range(0, n, batch_size)
    if show_progress:
        it = tqdm(it, total=n_batches, desc="Encoding", unit="batch")

    for i in it:
        batch = texts[i:i+batch_size]

        enc = tokenizer(
            batch,
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )

        # move inputs to same device as model
        enc = {k: v.to(device) for k, v in enc.items()}

        out = encoder(**enc)
        last = out.last_hidden_state  # (B, T, H)

        if pooling == "cls":
            vecs = last[:, 0, :]
        elif pooling == "mean":
            mask = enc["attention_mask"].unsqueeze(-1).to(last.dtype)  # float mask
            vecs = (last * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)

        all_vecs.append(vecs.detach().float().cpu().numpy())  # float() -> stable numpy

    return np.vstack(all_vecs)


### Generate embeddings using the fine-tuned RoBERTa encoder

In [None]:
# Get model embeddings
MODEL_NAME = "finetuned_roberta_final"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
encoder = AutoModel.from_pretrained(MODEL_NAME).eval()

# Pick GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)

# Use half precision (float16) since it's allowed on GPU
# to speed up inference and reduce memory
use_fp16 = (device.type == "cuda")
if use_fp16:
    encoder.half()

X_train_emb_finetuned = encode_texts(X_train_tx, pooling="mean")
X_test_emb_finetuned  = encode_texts(X_test_tx, pooling="mean")

Some weights of RobertaModel were not initialized from the model checkpoint at finetuned_roberta_final and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Encoding:   0%|          | 0/597 [00:00<?, ?batch/s]

Encoding:   0%|          | 0/150 [00:00<?, ?batch/s]

## Final Model

In [None]:
final_model = Pipeline([
    ("clf", LinearSVC(
        class_weight="balanced",
        random_state=SEED,
        C=0.003359818286283781,
        loss="squared_hinge",
        tol=0.0007411299781083245,
        max_iter=10000
    ))
])

final_model.fit(X_train_emb_finetuned, y_train)

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)

oof_pred = cross_val_predict(final_model, X_train_emb_finetuned, y_train, cv=cv, n_jobs=-1)

print("CV Accuracy:", accuracy_score(y_train, oof_pred))
print("CV Macro-F1 :", f1_score(y_train, oof_pred, average="macro"))

CV Accuracy: 0.922351461804464
CV Macro-F1 : 0.9016573202568677


In [None]:
print(classification_report(y_train, oof_pred))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87      1442
           1       0.88      0.90      0.89      1923
           2       0.96      0.93      0.95      6178

    accuracy                           0.92      9543
   macro avg       0.89      0.91      0.90      9543
weighted avg       0.92      0.92      0.92      9543



## Predictions

In [None]:
y_test_pred = final_model.predict(X_test_emb_finetuned)

pred_df = pd.DataFrame({
    "id": test_ids,
    "label": y_test_pred
})

pred_df.to_csv("pred_07.csv", index=False)