## Cross-domain vs. in-domain evaluation (what it means)

**Domain** = the “type” of text data: writing style, vocabulary, source, topic, and label patterns.

- **In-domain** evaluation: train and test on the same dataset distribution.
  Example: `train on DATA_REVIEWS_REAL → test on DATA_REVIEWS_REAL`.
- **Cross-domain** evaluation: train on one dataset, test on a different dataset distribution.
  Example: `train on DATA_REVIEWS_REAL → test on DATA_REVIEWS`.

Cross-domain results are a practical way to measure **generalization** and detect **domain shift** (performance drop caused by differences in data distribution).

---

## How the LSTM model works (very short)

Architecture used in the notebook:

`Embedding → LSTM → Linear → Softmax`

1. **Embedding** maps token IDs to dense vectors.
2. **LSTM** processes the sequence step-by-step and maintains a memory state using gates:
   - **Forget gate**: what to discard
   - **Input gate**: what to store
   - **Output gate**: what to expose as output
   This helps capture word order and longer dependencies (e.g., negations like “not good”).
3. **Linear layer** takes the final hidden state (sentence representation) and outputs logits for classes `{0, 1}`.

Training setup (high level):
- `CrossEntropyLoss`, `Adam`, gradient clipping
- stratified split into train/validation
- evaluation on both in-domain and cross-domain sets

---

## LSTM vs Transformer (quick comparison)

### Sequence processing
- **LSTM** reads tokens **sequentially** (step-by-step).
- **Transformer** reads tokens **in parallel** using **self-attention**.

### Capturing context
- **LSTM** stores information in a hidden state; long-range dependencies can be harder.
- **Transformer** uses self-attention to directly relate any token to any other token, which typically improves handling of:
  - long sentences
  - negation/sarcasm
  - complex context interactions

### Generalization across domains
- **LSTM (with simple tokenization + small vocab)** is often more sensitive to domain shift because it relies heavily on surface-level patterns and vocabulary overlap.
- **Transformer (pretrained on massive text corpora)** usually generalizes better cross-domain because it starts with rich language representations learned during pretraining.

### Compute / practicality
- **LSTM**: lightweight, fast, easier to deploy offline, good baseline and learning model.
- **Transformer**: heavier, usually better accuracy, but more compute/memory.

---

## Why this notebook matters
By training on one dataset and evaluating on the other, we explicitly measure:
- **in-domain performance** (how well the model fits the training distribution)
- **cross-domain performance** (how well it generalizes to different text distributions)

A large drop in cross-domain metrics is a strong indicator of **domain shift**.


In [1]:
# ============================================
# 1) Imports & reproducibility
# ============================================
import re
import random
from dataclasses import dataclass

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


device(type='cpu')

In [2]:
# ============================================
# 2) Load datasets (TSV)
# ============================================
REAL_PATH  = "data/reviews_dataset_real.tsv"
SMALL_PATH = "data/reviews_dataset.tsv"

df_real  = pd.read_csv(REAL_PATH,  sep="\t")
df_small = pd.read_csv(SMALL_PATH, sep="\t")

for df in (df_real, df_small):
    df["Review"] = df["Review"].fillna("").astype(str)
    df["Liked"]  = df["Liked"].astype(int)
    df.dropna(subset=["Review","Liked"], inplace=True)

print("REAL :", df_real.shape)
print("SMALL:", df_small.shape)

print("\nLabel distribution:")
print("REAL :", df_real["Liked"].value_counts().to_dict())
print("SMALL:", df_small["Liked"].value_counts().to_dict())


REAL : (68221, 3)
SMALL: (6000, 2)

Label distribution:
REAL : {1: 38013, 0: 30208}
SMALL: {1: 3000, 0: 3000}


In [3]:
# ============================================
# 3) Tokenization & vocabulary
# ============================================
TOKEN_RE = re.compile(r"[^a-zA-Z0-9\s']+")

def tokenize(text: str) -> list[str]:
    text = text.lower().strip()
    text = TOKEN_RE.sub(" ", text)
    text = re.sub(r"\s+", " ", text)
    return text.split() if text else []

def build_vocab(texts, min_freq: int = 2) -> dict[str,int]:
    vocab = {"<pad>": 0, "<unk>": 1}
    counts = {}
    for t in texts:
        for w in tokenize(t):
            counts[w] = counts.get(w, 0) + 1
    idx = 2
    for w, c in sorted(counts.items(), key=lambda x: (-x[1], x[0])):
        if c >= min_freq:
            vocab[w] = idx
            idx += 1
    return vocab


In [4]:
# ============================================
# 4) Dataset, collate (padding), config
# ============================================
@dataclass
class Config:
    max_len: int = 120
    batch_size: int = 64
    emb_dim: int = 128
    hid_dim: int = 128
    lr: float = 1e-3
    epochs: int = 5

cfg = Config()

class ReviewsDataset(Dataset):
    def __init__(self, texts, labels, vocab):
        self.texts = list(texts)
        self.labels = list(labels)
        self.vocab = vocab

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        tokens = tokenize(self.texts[idx])[:cfg.max_len]
        ids = [self.vocab.get(w, 1) for w in tokens]  # 1 = <unk>
        if len(ids) == 0:
            ids = [1]
        return torch.tensor(ids, dtype=torch.long), torch.tensor(int(self.labels[idx]), dtype=torch.long)

def collate_fn(batch, pad_id: int = 0):
    xs, ys = zip(*batch)
    lengths = torch.tensor([len(x) for x in xs], dtype=torch.long)
    max_len = int(lengths.max().item())
    padded = torch.full((len(xs), max_len), fill_value=pad_id, dtype=torch.long)
    for i, x in enumerate(xs):
        padded[i, :len(x)] = x
    return padded, lengths, torch.stack(ys)


In [5]:
# ============================================
# 5) Model
# ============================================
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, cfg.emb_dim, padding_idx=0)
        self.lstm = nn.LSTM(cfg.emb_dim, cfg.hid_dim, batch_first=True)
        self.dropout = nn.Dropout(0.2)
        self.fc = nn.Linear(cfg.hid_dim, 2)

    def forward(self, x, lengths):
        emb = self.dropout(self.embedding(x))
        packed = nn.utils.rnn.pack_padded_sequence(
            emb, lengths.cpu(), batch_first=True, enforce_sorted=False
        )
        _, (h, _) = self.lstm(packed)
        logits = self.fc(self.dropout(h[-1]))
        return logits


In [6]:
# ============================================
# 6) Split helper (train/val/test)
# ============================================
def split_train_val_test(df: pd.DataFrame, test_size=0.2, val_size=0.1):
    # First split off TEST
    X = df["Review"].tolist()
    y = df["Liked"].tolist()
    X_trainval, X_test, y_trainval, y_test = train_test_split(
        X, y, test_size=test_size, random_state=SEED, stratify=y
    )
    # Then split TRAIN vs VAL from the remaining
    # val_size is relative to trainval, not the original dataset
    X_train, X_val, y_train, y_val = train_test_split(
        X_trainval, y_trainval, test_size=val_size, random_state=SEED, stratify=y_trainval
    )
    return (X_train, y_train), (X_val, y_val), (X_test, y_test)


In [7]:
# ============================================
# 7) Training & evaluation (test-only metrics)
# ============================================
def make_loader(texts, labels, vocab, shuffle: bool):
    ds = ReviewsDataset(texts, labels, vocab)
    return DataLoader(ds, batch_size=cfg.batch_size, shuffle=shuffle, collate_fn=collate_fn)

def train_on_split(X_train, y_train, X_val, y_val):
    # Build vocab ONLY from training text (no peeking)
    vocab = build_vocab(X_train, min_freq=2)

    model = SentimentLSTM(len(vocab)).to(device)
    optimizer = optim.Adam(model.parameters(), lr=cfg.lr)
    criterion = nn.CrossEntropyLoss()

    train_loader = make_loader(X_train, y_train, vocab, shuffle=True)
    val_loader   = make_loader(X_val,   y_val,   vocab, shuffle=False)

    best_val_acc = -1.0
    best_state = None

    for epoch in range(1, cfg.epochs + 1):
        # ---- train ----
        model.train()
        for x, lengths, y in train_loader:
            x, lengths, y = x.to(device), lengths.to(device), y.to(device)
            optimizer.zero_grad()
            logits = model(x, lengths)
            loss = criterion(logits, y)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

        # ---- validate (sanity) ----
        model.eval()
        all_p, all_t = [], []
        with torch.no_grad():
            for x, lengths, y in val_loader:
                x, lengths = x.to(device), lengths.to(device)
                logits = model(x, lengths)
                preds = torch.argmax(logits, dim=1).cpu().numpy()
                all_p.extend(preds)
                all_t.extend(y.numpy())
        val_acc = accuracy_score(all_t, all_p)
        print(f"Epoch {epoch}/{cfg.epochs} | val_acc={val_acc:.4f}")

        # Keep best model by validation accuracy
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}

    # Restore best weights
    if best_state is not None:
        model.load_state_dict(best_state)
    return model, vocab, best_val_acc

@torch.no_grad()
def evaluate_on_test(model, vocab, X_test, y_test, title: str):
    loader = make_loader(X_test, y_test, vocab, shuffle=False)
    model.eval()

    all_preds, all_true = [], []
    for x, lengths, y in loader:
        x, lengths = x.to(device), lengths.to(device)
        logits = model(x, lengths)
        preds = torch.argmax(logits, dim=1).cpu().numpy()
        all_preds.extend(preds)
        all_true.extend(y.numpy())

    acc = accuracy_score(all_true, all_preds)
    print(f"=== {title} ===")
    print("Test accuracy:", acc)
    print(classification_report(all_true, all_preds, digits=4))
    print("Confusion matrix:\n", confusion_matrix(all_true, all_preds))
    return acc


## 8) Create splits for both datasets

In [8]:
# REAL splits
(real_train, real_y_train), (real_val, real_y_val), (real_test, real_y_test) = split_train_val_test(df_real)

# SMALL splits
(small_train, small_y_train), (small_val, small_y_val), (small_test, small_y_test) = split_train_val_test(df_small)

print("REAL  :", len(real_train), len(real_val), len(real_test))
print("SMALL :", len(small_train), len(small_val), len(small_test))


REAL  : 49118 5458 13645
SMALL : 4320 480 1200


## 9) Experiment A — Train on REAL, test on REAL and SMALL (test splits only)

In [9]:
model_real, vocab_real, best_val_real = train_on_split(real_train, real_y_train, real_val, real_y_val)

acc_rr = evaluate_on_test(model_real, vocab_real, real_test,  real_y_test,  "Train REAL → Test REAL")
acc_rs = evaluate_on_test(model_real, vocab_real, small_test, small_y_test, "Train REAL → Test SMALL")


Epoch 1/5 | val_acc=0.8318
Epoch 2/5 | val_acc=0.8835
Epoch 3/5 | val_acc=0.9007
Epoch 4/5 | val_acc=0.9095
Epoch 5/5 | val_acc=0.9093
=== Train REAL → Test REAL ===
Test accuracy: 0.9057530230853793
              precision    recall  f1-score   support

           0     0.8940    0.8931    0.8935      6042
           1     0.9151    0.9158    0.9155      7603

    accuracy                         0.9058     13645
   macro avg     0.9045    0.9045    0.9045     13645
weighted avg     0.9057    0.9058    0.9057     13645

Confusion matrix:
 [[5396  646]
 [ 640 6963]]
=== Train REAL → Test SMALL ===
Test accuracy: 0.7066666666666667
              precision    recall  f1-score   support

           0     0.7116    0.6950    0.7032       600
           1     0.7020    0.7183    0.7100       600

    accuracy                         0.7067      1200
   macro avg     0.7068    0.7067    0.7066      1200
weighted avg     0.7068    0.7067    0.7066      1200

Confusion matrix:
 [[417 183]
 [16

## 10) Experiment B — Train on SMALL, test on SMALL and REAL (test splits only)

In [10]:
model_small, vocab_small, best_val_small = train_on_split(small_train, small_y_train, small_val, small_y_val)

acc_ss = evaluate_on_test(model_small, vocab_small, small_test, small_y_test, "Train SMALL → Test SMALL")
acc_sr = evaluate_on_test(model_small, vocab_small, real_test,  real_y_test,  "Train SMALL → Test REAL")


Epoch 1/5 | val_acc=1.0000
Epoch 2/5 | val_acc=1.0000
Epoch 3/5 | val_acc=1.0000
Epoch 4/5 | val_acc=1.0000
Epoch 5/5 | val_acc=1.0000
=== Train SMALL → Test SMALL ===
Test accuracy: 0.9966666666666667
              precision    recall  f1-score   support

           0     1.0000    0.9933    0.9967       600
           1     0.9934    1.0000    0.9967       600

    accuracy                         0.9967      1200
   macro avg     0.9967    0.9967    0.9967      1200
weighted avg     0.9967    0.9967    0.9967      1200

Confusion matrix:
 [[596   4]
 [  0 600]]
=== Train SMALL → Test REAL ===
Test accuracy: 0.4881641626969586
              precision    recall  f1-score   support

           0     0.4305    0.4831    0.4553      6042
           1     0.5451    0.4922    0.5173      7603

    accuracy                         0.4882     13645
   macro avg     0.4878    0.4876    0.4863     13645
weighted avg     0.4944    0.4882    0.4898     13645

Confusion matrix:
 [[2919 3123]
 [38

## 11) Summary

In [11]:
summary = pd.DataFrame([
    {"train":"REAL",  "test":"REAL",  "test_accuracy":acc_rr, "best_val_acc":best_val_real},
    {"train":"REAL",  "test":"SMALL", "test_accuracy":acc_rs, "best_val_acc":best_val_real},
    {"train":"SMALL", "test":"SMALL", "test_accuracy":acc_ss, "best_val_acc":best_val_small},
    {"train":"SMALL", "test":"REAL",  "test_accuracy":acc_sr, "best_val_acc":best_val_small},
])

summary


Unnamed: 0,train,test,test_accuracy,best_val_acc
0,REAL,REAL,0.905753,0.909491
1,REAL,SMALL,0.706667,0.909491
2,SMALL,SMALL,0.996667,1.0
3,SMALL,REAL,0.488164,1.0
