
# 🔁 RoBERTa (manual loop) for Sentiment Analysis — Google Colab

This notebook mirrors the **Part 2 (BERT)** lab style (manual training/validation loops) but uses **RoBERTa** instead of BERT.
It includes:
- A short **intro to RoBERTa** and **metric definitions** (Accuracy, Precision, Recall, F1)
- **IMDB 50K** loading/cleaning (via Kaggle `kagglehub`)
- **Tokenization** with `RobertaTokenizerFast`
- **Manual training loop** (AdamW + linear scheduler)
- **Evaluation** (accuracy/precision/recall/F1 + classification report)
- **Comparison table** vs your **BERT** results (paste your BERT scores)



## 1) What is RoBERTa (high level)?
**RoBERTa (Robustly Optimized BERT Approach)** keeps BERT's encoder architecture but changes the pre-training recipe:
- More data and longer training
- **Dynamic masking**
- Drops **Next Sentence Prediction**
- Larger batches and tuned hyperparameters
**Key takeaway:** same architecture as BERT, **stronger pre-training** → often **better downstream performance**.



## 2) Metrics — Definitions & When to Use

Binary classification with confusion counts **TP, FP, FN, TN**.

\[
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
\]

\[
\text{Precision} = \frac{TP}{TP + FP}
\]

\[
\text{Recall} = \frac{TP}{TP + FN}
\]

\[
\text{F1} = 2 \times \frac{\text{Precision}\times \text{Recall}}{\text{Precision} + \text{Recall}}
\]

| Metric | Use when | Why |
|---|---|---|
| Accuracy | Classes roughly balanced | Global performance |
| Precision | False positives are costly | Fewer negatives mislabeled positive |
| Recall | False negatives are costly | Catch more positives |
| F1 | Need balance of P/R | Single balanced score |



## 3) Setup (Colab)
Run the following cell to install requirements.


In [None]:

!pip -q install transformers datasets scikit-learn kagglehub accelerate -U



## 4) Load & Clean IMDB 50K


In [None]:

import pandas as pd
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import kagglehub, shutil, os

# Download the dataset from Kaggle with kagglehub
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
print("Dataset downloaded to:", path)

# Copy to Drive for persistence (optional)
drive_path = '/content/drive/MyDrive/KaggleDatasets/IMDB_50K/'
os.makedirs(drive_path, exist_ok=True)
shutil.copytree(path, drive_path, dirs_exist_ok=True)
print("Dataset copied to Google Drive at:", drive_path)

# Load CSV
df = pd.read_csv(os.path.join(drive_path, 'IMDB Dataset.csv'))
print("Raw shape:", df.shape)
df.head()


In [None]:

# Clean reviews and encode labels
df['review_cleaned'] = (
    df['review']
      .str.replace('<br />', ' ', regex=False)
      .str.replace('\s+', ' ', regex=True)
      .str.strip()
)
df['label'] = df['sentiment'].map({'negative': 0, 'positive': 1}).astype(int)
df[['review_cleaned','sentiment','label']].head(3)



## 5) Train/Validation Split


In [None]:

from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(
    df['review_cleaned'].tolist(),
    df['label'].tolist(),
    test_size=0.2,
    random_state=42,
    stratify=df['label'].tolist()
)

len(train_texts), len(val_texts)



## 6) Tokenization (RoBERTa) → Tensors
We use `RobertaTokenizerFast` with `max_length=128` for faster training.


In [None]:

import torch
from transformers import RobertaTokenizerFast

MAX_LEN = 128
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

def encode_texts(texts):
    enc = tokenizer(
        texts,
        padding='max_length',
        truncation=True,
        max_length=MAX_LEN,
        return_tensors='pt'
    )
    return enc['input_ids'], enc['attention_mask']

train_input_ids, train_attention = encode_texts(train_texts)
val_input_ids,   val_attention   = encode_texts(val_texts)

train_labels_t = torch.tensor(train_labels)
val_labels_t   = torch.tensor(val_labels)

train_input_ids.shape, val_input_ids.shape



## 7) DataLoaders


In [None]:

from torch.utils.data import TensorDataset, DataLoader

train_ds = TensorDataset(train_input_ids, train_attention, train_labels_t)
val_ds   = TensorDataset(val_input_ids,   val_attention,   val_labels_t)

BATCH_SIZE = 16
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=BATCH_SIZE, shuffle=False)

len(train_loader), len(val_loader)



## 8) Model, Optimizer, Scheduler


In [None]:

import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from transformers import RobertaForSequenceClassification, get_linear_schedule_with_warmup
from torch.optim import AdamW

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
if device.type == 'cuda':
    torch.cuda.manual_seed_all(seed)

model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)
model.to(device)

EPOCHS = 2
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

total_steps = EPOCHS * len(train_loader)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

loss_fn = nn.CrossEntropyLoss()

def batch_accuracy(logits, labels):
    preds = torch.argmax(logits, dim=1)
    return (preds == labels).float().mean().item()



## 9) Training & Validation Loop


In [None]:

from tqdm.auto import tqdm
from datetime import datetime

for epoch in range(EPOCHS):
    print(f"\nEpoch {epoch+1}/{EPOCHS}")
    # ---- Train ----
    model.train()
    train_loss, train_acc = 0.0, 0.0
    for input_ids, attn, labels in tqdm(train_loader, desc="Training", leave=False):
        input_ids = input_ids.to(device)
        attn      = attn.to(device)
        labels    = labels.to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attn, labels=labels, return_dict=True)
        loss = outputs.loss
        logits = outputs.logits

        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

        train_loss += loss.item()
        train_acc  += batch_accuracy(logits.detach(), labels)

    train_loss /= len(train_loader)
    train_acc  /= len(train_loader)

    # ---- Validate ----
    model.eval()
    val_loss, val_acc = 0.0, 0.0
    all_logits = []
    all_labels = []
    with torch.no_grad():
        for input_ids, attn, labels in tqdm(val_loader, desc="Validating", leave=False):
            input_ids = input_ids.to(device)
            attn      = attn.to(device)
            labels    = labels.to(device)
            outputs = model(input_ids, attention_mask=attn, labels=labels, return_dict=True)
            val_loss += outputs.loss.item()
            val_acc  += batch_accuracy(outputs.logits, labels)
            all_logits.append(outputs.logits.cpu())
            all_labels.append(labels.cpu())

    val_loss /= len(val_loader)
    val_acc  /= len(val_loader)
    print(f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} | Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")
    
logits = torch.cat(all_logits, dim=0)
labels = torch.cat(all_labels, dim=0)
probs = F.softmax(logits, dim=1).numpy()
preds = np.argmax(probs, axis=1)



## 10) Metrics (Accuracy, Precision, Recall, F1) + Report


In [None]:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
y_true = labels.numpy()
y_pred = preds

acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred, average='binary')
rec = recall_score(y_true, y_pred, average='binary')
f1 = f1_score(y_true, y_pred, average='binary')

print(f"RoBERTa — Accuracy: {acc:.4f}, Precision: {prec:.4f}, Recall: {rec:.4f}, F1: {f1:.4f}\n")
print(classification_report(y_true, y_pred, target_names=['negative','positive']))



## 11) Overview Comparison Table (Paste your BERT metrics)
Enter your **BERT (base-uncased)** metrics from Lab Part 2, and we'll build a side-by-side table.


In [None]:

import pandas as pd

# === ENTER YOUR BERT METRICS HERE (percentages) ===
BERT_accuracy   = 91.8
BERT_precision  = 91.8
BERT_recall     = 91.8
BERT_f1         = 91.8

overview = pd.DataFrame({
    'Model': ['BERT (base-uncased)', 'RoBERTa (base)'],
    'Accuracy':  [BERT_accuracy,  round(acc*100, 1)],
    'Precision': [BERT_precision, round(prec*100,1)],
    'Recall':    [BERT_recall,    round(rec*100, 1)],
    'F1':        [BERT_f1,        round(f1*100,  1)],
})
overview



## 12) Save Outputs


In [None]:

csv_path = '/content/roberta_vs_bert_overview_loop.csv'
overview.to_csv(csv_path, index=False)
print(f"Saved comparison CSV to: {csv_path}")



## 13) Quick Inference on Custom Sentences


In [None]:

test_sentences = [
    "I love this movie! It was fantastic.",
    "The product broke after one use, terrible experience.",
    "Not bad, but could be better."
]

enc = tokenizer(
    test_sentences,
    padding='max_length',
    truncation=True,
    max_length=MAX_LEN,
    return_tensors='pt'
)

model.eval()
with torch.no_grad():
    out = model(enc['input_ids'].to(device), attention_mask=enc['attention_mask'].to(device), return_dict=True)
    p = F.softmax(out.logits, dim=1).cpu().numpy()
    pred = p.argmax(axis=1)

for s, pp, pr in zip(test_sentences, p, pred):
    print(f"Text: {s}\nProb [neg,pos]: {pp} -> Pred: {['negative','positive'][pr]}\n")
