<a href="https://colab.research.google.com/github/lorenzospolti/DL.19.06.35/blob/main/hotel_review_spolti.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

csv_url = "https://raw.githubusercontent.com/lorenzospolti/DL.19.06.35/main/input_data.csv"
df = pd.read_csv(csv_url)

# Hotel Review Multi‑Task Pipeline
**Author:** Lorenzo Spolti ID: 535467



## 0  Load Data

* `Review` — raw text
* `Review_Type` — 1 = positive, 0 = negative
* `Review_Score` — 1‑10 numeric score
* `hotel_name`, `reviewer_nationality` — categorical
* `hotel_number_reviews`, `review_date` — numeric

In [None]:
import pandas as pd
df = pd.read_csv(csv_url)


## 1  Model Overview
We follow the exact design described in **“Answer to the exam.rtf”**:

1. **Pre‑trained lightweight Transformer** (BERT‑tiny) provides language features  
2. **WordPiece tokenisation** is inherited from the same BERT model  
3. **Two additional branches** on top of the pooled `[CLS]` representation:
   * **Head A** – binary classifier (`sigmoid`) → review type  
   * **Head B** – regression (`linear`) → review score  
4. **Structured features** are exploited by a small MLP and concatenated to the pooled text vector

All pre‑trained weights are **frozen** by default; only the new layers learn.



## 2  Input Pre‑processing
### 2 .1  WordPiece Tokeniser

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny')
```

### 2 .2  Categorical → Embeddings  
We factorise each categorical column and keep the integer codes.

### 2 .3  Numeric → Standard Scaler  
`hotel_number_reviews` and a **days‑since** version of `review_date` are z‑scored.


In [None]:
from transformers import AutoTokenizer
from sklearn.preprocessing import StandardScaler
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny')

# --- categorical ---
cat_cols = ['Hotel_Name', 'Reviewer_Nationality']
cat_maps = {c: pd.factorize(df[c])[0] for c in cat_cols}
cat_tensors = [torch.tensor(v, dtype=torch.long) for v in cat_maps.values()]

# --- numeric ---
# make 'Review_Date' a numeric (days since first date)
df['Review_Date'] = pd.to_datetime(df['Review_Date'])
df['days_since'] = (df['Review_Date'] - df['Review_Date'].min()).dt.days
num_cols = ['Hotel_number_reviews', 'days_since']

scaler = StandardScaler()
num_array = scaler.fit_transform(df[num_cols]).astype('float32')
num_tensor = torch.tensor(num_array)

# --- text ---
# Add max_length and return_token_type_ids=False to truncate and prevent token_type_ids
enc = tokenizer(df['Review'].tolist(), padding=True, truncation=True, max_length=512, return_tensors='pt', return_token_type_ids=False)
input_ids = enc['input_ids']
attention = enc['attention_mask']
# token_type_ids will not be returned by the tokenizer now

# --- targets ---
# Map 'Bad_review' to 0 and 'Good_review' to 1, fill any resulting NaNs, and convert to int64
df['Review_Type'] = df['Review_Type'].map({'Bad_review': 0, 'Good_review': 1}).fillna(0).astype('int64')
y = torch.tensor(df['Review_Type'].values, dtype=torch.float32)
scores = torch.tensor(df['Review_Score'].values, dtype=torch.float32)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

## 4  Dataset & DataLoaders

In [None]:
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split

# Stack categorical codes into a single tensor [N, C]
cats = torch.stack(cat_tensors, dim=1)
# Numeric is [N, num_features]
nums = num_tensor

idx = torch.arange(len(df))
train_idx, val_idx = train_test_split(idx, test_size=0.2, stratify=y, random_state=42)

train_ds = TensorDataset(
    input_ids[train_idx], attention[train_idx],
    cats[train_idx], nums[train_idx],
    y[train_idx], scores[train_idx]
)
val_ds = TensorDataset(
    input_ids[val_idx], attention[val_idx],
    cats[val_idx], nums[val_idx],
    y[val_idx], scores[val_idx]
)

BATCH = 8 # Reduced batch size to try and prevent CUDA out of memory
train_loader = DataLoader(train_ds, batch_size=BATCH, shuffle=True, num_workers=0)
val_loader   = DataLoader(val_ds,   batch_size=BATCH*2, num_workers=0)


## 5  Model Definition
The diagram matches the RTF answer exactly:

```
Input text        →  Transformer  →  [CLS]
Input categorical →  Embeddings –┐
Input numerical   →  MLP         ├─ concat → Dense → ReLU →    HEAD A  (sigmoid)
                                 └─────────┤
                                           └→ Dense → Dropout → HEAD B (linear)
```
All Transformer layers are frozen; we only unfreeze **N** top layers when `UNFREEZE_TOP_N > 0`.


In [None]:
import torch.nn as nn
from transformers import AutoModel
import types

class ReviewModel(nn.Module):
    def __init__(self,
                 pretrained='prajjwal1/bert-tiny',
                 cat_cardinals=None,
                 num_features=2,
                 cat_dim=32,
                 proj_dim=32,
                 head_dim=128):
        super().__init__()
        self.text_encoder = AutoModel.from_pretrained(pretrained)
        self.hidden = self.text_encoder.config.hidden_size

        # Store the number of hidden layers for unfreezing logic
        self.num_backbone_layers = self.text_encoder.config.num_hidden_layers

        # Freeze everything
        for p in self.text_encoder.parameters():
            p.requires_grad_(False)

        # Embeddings for categorical cols
        self.cat_embeddings = nn.ModuleDict({
            name: nn.Embedding(card, cat_dim)
            for name, card in cat_cardinals.items()
        })
        total_cat = cat_dim * len(cat_cardinals)

        # Projection for numeric
        self.num_proj = nn.Sequential(
            nn.Linear(num_features, proj_dim),
            nn.ReLU()
        )

        fused_dim = self.hidden + total_cat + proj_dim
        self.shared = nn.Sequential(
            nn.Linear(fused_dim, head_dim),
            nn.ReLU()
        )

        # Heads
        self.cls_head = nn.Sequential(nn.Linear(head_dim, 1))  # sigmoid later
        self.reg_head = nn.Linear(head_dim, 1)

        # backbone alias for training script (keep original config)
        self.backbone = self.text_encoder


    def forward(self, input_ids, attention_mask, cats, nums):
        # Explicitly create token_type_ids with zeros
        token_type_ids = torch.zeros_like(input_ids)
        out = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        pooled = out.pooler_output if hasattr(out, 'pooler_output') and out.pooler_output is not None else out.last_hidden_state[:,0]

        cat_vecs = [self.cat_embeddings[name](cats[:, i]) for i, name in enumerate(self.cat_embeddings)]
        cat_concat = torch.cat(cat_vecs, dim=-1)

        num_vec = self.num_proj(nums)

        x = torch.cat([pooled, cat_concat, num_vec], dim=-1)
        h = self.shared(x)

        logit = self.cls_head(h).squeeze(-1)   # BCEWithLogitsLoss will apply sigmoid
        score = self.reg_head(h).squeeze(-1)
        return logit, score

# Instantiate
cat_cardinals = {c: int(df[c].nunique()) for c in cat_cols}
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ReviewModel(cat_cardinals=cat_cardinals).to(device)
print('Model instantiated. Hidden size:', model.hidden)

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

Model instantiated. Hidden size: 128


# hotel‑review pipeline

Split into two clear blocks:
1. **Training** – returns model + raw predictions
2. **Evaluation** – consumes predictions & ground‑truth to compute
   • accuracy, precision, recall, F1 (Head A)
   • MSE + RMSE (Head B)

Configurable via a small `cfg` dict.



In [None]:

from __future__ import annotations
import math, random
from typing import Dict, Any, List, Tuple

import numpy as np, pandas as pd, torch, torch.nn as nn, torch.nn.functional as F
from sklearn.model_selection import GroupKFold
from sklearn import metrics as skm
from transformers import AutoModel, AutoTokenizer
from sklearn.preprocessing import StandardScaler # Import StandardScaler

device = "cuda" if torch.cuda.is_available() else "cpu"
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# ────────────────────────── Dataset ──────────────────────────
class ReviewDS(torch.utils.data.Dataset):
    def __init__(self, df: pd.DataFrame, tok: AutoTokenizer, max_len: int = 128):
        self.df, self.tok, self.max_len = df.reset_index(drop=True), tok, max_len
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        enc = self.tok(row.Review, truncation=True, padding="max_length", max_length=self.max_len, return_tensors="pt")
        item = {k: v.squeeze(0) for k, v in enc.items()}
        # Keep cat_feats as Long tensors for embedding layers
        item["cat"]   = torch.tensor(np.array(row.cat_feats), dtype=torch.long)
        item["num"]   = torch.tensor(np.array(row.num_feats), dtype=torch.float32)
        item["label"] = torch.tensor(row.Review_Type, dtype=torch.float32)
        item["score"] = torch.tensor(row.Review_Score, dtype=torch.float32)
        return item

# ─────────────────────────── Model ───────────────────────────
class MTModel(nn.Module):
    def __init__(self, backbone: str, cat_cardinals: Dict[str, int], n_num: int, dropout: float, unfrozen: int, cat_dim: int = 32):
        super().__init__()
        self.tfm = AutoModel.from_pretrained(backbone)
        for p in self.tfm.parameters(): p.requires_grad_(False)
        if unfrozen>0 and hasattr(self.tfm, "encoder"):
            for lyr in self.tfm.encoder.layer[-unfrozen:]:
                for p in lyr.parameters(): p.requires_grad_(True)
        dim = self.tfm.config.hidden_size

        # Embeddings for categorical cols
        self.cat_embeddings = nn.ModuleDict({
            name: nn.Embedding(card, cat_dim)
            for name, card in cat_cardinals.items()
        })
        total_cat_emb_dim = cat_dim * len(cat_cardinals)


        self.fc_cat = nn.Linear(dim + total_cat_emb_dim, 128)
        self.out_a  = nn.Linear(128, 1)
        self.fc_num = nn.Linear(dim + n_num, 128)
        self.drop   = nn.Dropout(dropout)
        self.out_b  = nn.Linear(128, 1)
        self.relu, self.sig = nn.ReLU(), nn.Sigmoid()
        self.apply(lambda m: nn.init.kaiming_normal_(m.weight) if isinstance(m, nn.Linear) else None)

    def forward(self, ids, mask, cat, num):
        cls = self.tfm(input_ids=ids, attention_mask=mask).last_hidden_state[:,0]

        # Process categorical features through embedding layers
        cat_vecs = [self.cat_embeddings[name](cat[:, i]) for i, name in enumerate(self.cat_embeddings)]
        cat_concat = torch.cat(cat_vecs, dim=-1)

        # Head A (using concatenated text and categorical embeddings)
        x_a  = self.relu(self.fc_cat(torch.cat([cls, cat_concat], 1)))
        outA = self.sig(self.out_a(x_a)).squeeze(1)

        # Head B (using concatenated text and numerical features)
        x_b  = self.drop(self.relu(self.fc_num(torch.cat([cls, num], 1))))
        outB = self.out_b(x_b).squeeze(1)
        return outA, outB

# ─────────────────────── Block 1 – Training ───────────────────

def train_model(model: MTModel, train_dl: torch.utils.data.DataLoader, val_dl: torch.utils.data.DataLoader, cfg: Dict[str, Any]) -> Tuple[Dict[str, np.ndarray], MTModel]:
    """Train for *epochs* with early stopping; return predictions on *val_dl*."""
    opt_cls = torch.optim.SGD if cfg["optim"]=="SGD" else torch.optim.AdamW
    opt = opt_cls(model.parameters(), lr=cfg["lr"], weight_decay=cfg["wd"])
    best, patience = math.inf, 0
    for _ in range(cfg["epochs"]):
        # –– training loop
        model.train()
        for b in train_dl:
            opt.zero_grad()
            logits, reg = model(b["input_ids"].to(device), b["attention_mask"].to(device), b["cat"].to(device), b["num"].to(device))
            loss = cfg["lambda_reg"]*F.binary_cross_entropy(logits, b["label"].to(device)) + F.mse_loss(reg, b["score"].to(device))
            loss.backward()
            if cfg["clip"]: torch.nn.utils.clip_grad_norm_(model.parameters(), cfg["clip"])
            opt.step()
        # –– validation loss for early‑stop
        model.eval(); val_loss=[]
        with torch.no_grad():
            for b in val_dl:
                l, r = model(b["input_ids"].to(device), b["attention_mask"].to(device), b["cat"].to(device), b["num"].to(device))
                vloss = cfg["lambda_reg"]*F.binary_cross_entropy(l, b["label"].to(device)) + F.mse_loss(r, b["score"].to(device))
                val_loss.append(vloss.item())
        cur = np.mean(val_loss)
        if cur < best: best, patience = cur, 0
        else: patience += 1
        if patience >= cfg["early"]: break
    # –– collect predictions on val set
    logits_all, reg_all, lab_all, score_all = [], [], [], []
    model.eval()
    with torch.no_grad():
        for b in val_dl:
            l, r = model(b["input_ids"].to(device), b["attention_mask"].to(device), b["cat"].to(device), b["num"].to(device))
            logits_all.extend(l.cpu().numpy());   lab_all.extend(b["label"].numpy())
            reg_all.extend(r.cpu().numpy());      score_all.extend(b["score"].numpy())
    preds = {
        "cls_pred": np.array(logits_all),
        "cls_true": np.array(lab_all),
        "reg_pred": np.array(reg_all),
        "reg_true": np.array(score_all),
    }
    return preds, model


6) MODEL EVALUATION

In [None]:
# ─────────────────────── Block 2 – Evaluation ─────────────────

def evaluate(preds: Dict[str, np.ndarray]) -> Dict[str, float]:
    y_hat = (preds["cls_pred"] > 0.5).astype(int)
    metrics = {
        "accuracy":  skm.accuracy_score(preds["cls_true"], y_hat),
        "precision": skm.precision_score(preds["cls_true"], y_hat, zero_division=0),
        "recall":    skm.recall_score(preds["cls_true"], y_hat, zero_division=0),
        "f1":        skm.f1_score(preds["cls_true"], y_hat, zero_division=0),
        "mse":       skm.mean_squared_error(preds["reg_true"], preds["reg_pred"]),
        "rmse":      math.sqrt(skm.mean_squared_error(preds["reg_true"], preds["reg_pred"])),
    }
    return metrics

# ───────────────────────── Cross‑val 5× ───────────────────────

def cross_validate(df: pd.DataFrame, cfg: Dict[str, Any]):
    tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    # Prepare cat_feats and num_feats columns
    cat_cols = ['Hotel_Name', 'Reviewer_Nationality']
    num_cols = ['Hotel_number_reviews', 'days_since']

    # Ensure cat_feats and num_feats are stored as lists of numerical values
    # Use factorize directly for cat_feats to get integer codes
    for col in cat_cols:
        df[col], _ = pd.factorize(df[col])
    df['cat_feats'] = df[cat_cols].values.tolist()

    scaler = StandardScaler()
    df['num_feats'] = list(scaler.fit_transform(df[num_cols]).astype('float32'))

# CHANGE: As I intended to keep folds by hotel, I decided to use GroupKFold instead of 5-fold cros validation to ensure no data is leaked.


    gkf = GroupKFold(n_splits=5)
    results: List[Dict[str,float]] = []
    # Assuming 'hotel_id', otherwise 'Hotel_Name' is useda for grouping if it is not available.
    group_col = 'hotel_id' if 'hotel_id' in df.columns else 'Hotel_Name'

    # Get categorical cardinalities for the model
    cat_cardinals = {col: df[col].nunique() for col in cat_cols}

    for idx, (tr, va) in enumerate(gkf.split(df, groups=df[group_col])):
        # Get the number of numerical features from the prepared data
        n_num = len(df['num_feats'].iloc[0])

        # Pass cat_cardinals to the model
        mdl = MTModel("distilbert-base-uncased", cat_cardinals, n_num, cfg["dropout"], cfg["unfreeze"]).to(device)
        tr_dl = torch.utils.data.DataLoader(ReviewDS(df.iloc[tr], tok), batch_size=cfg["bs"], shuffle=True)
        # Set validation batch size to be the same as training batch size
        va_dl = torch.utils.data.DataLoader(ReviewDS(df.iloc[va], tok), batch_size=cfg["bs"])
        preds, _ = train_model(mdl, tr_dl, va_dl, cfg)
        res = evaluate(preds)
        results.append(res)
        print(f"Fold {idx+1}:", {k:f"{v:.4f}" for k,v in res.items()})
    # aggregate
    mean = {k: np.mean([r[k] for r in results]) for k in results[0]}
    std  = {k: np.std( [r[k] for r in results]) for k in results[0]}
    print("\nMean ± SD across folds:")
    for k in mean: print(f"{k}: {mean[k]:.4f} ± {std[k]:.4f}")

# ──────────────────  config & runner ───────────────────
if __name__ == "__main__":
    cfg = dict(lr=2e-5, wd=1e-2, dropout=0.2, unfreeze=0, bs=16, optim="AdamW",
               lambda_reg=1.0, clip=1.0, epochs=5, early=2)
    cross_validate(df, cfg)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Fold 1: {'accuracy': '0.8653', 'precision': '0.8246', 'recall': '0.9077', 'f1': '0.8642', 'mse': '2.4903', 'rmse': '1.5781'}
Fold 2: {'accuracy': '0.8780', 'precision': '0.8472', 'recall': '0.9232', 'f1': '0.8836', 'mse': '2.2662', 'rmse': '1.5054'}
Fold 3: {'accuracy': '0.8784', 'precision': '0.8463', 'recall': '0.9264', 'f1': '0.8845', 'mse': '2.3252', 'rmse': '1.5248'}
