# XLM-RoBERTa Inference Pipeline — Process Explained

This notebook explains, step by step, how XLM-RoBERTa (xlm-roberta-base) processes text for multi-class classification — from raw text, through tokenization and tensors, into the Transformer, and finally to probabilities and predicted labels.

It focuses on concepts and shapes rather than heavy computations or training.

## What you'll learn
- How text is normalized and tokenized (SentencePiece BPE)
- How inputs become tensors: input_ids and attention_mask
- How batches are formed and fed to the model
- What happens inside the model (embeddings → Transformer blocks)
- How the classification head produces logits and probabilities
- How this maps to your Phase 2 training/evaluation notebook

## End-to-end flow (high level)
1) Raw text → normalized (basic unicode normalization).
2) Tokenized with SentencePiece BPE → subword pieces (e.g., `▁hello`, `world`).
3) Special tokens are added: `<s>` (start), `</s>` (end), and padding if needed.
4) Convert tokens → integer IDs (input_ids).
5) Create attention_mask (1 for real tokens, 0 for padding).
6) Batch these tensors and feed to XLM-R encoder (Transformer stack).
7) Take the hidden state at position 0 (the `<s>` token) as the sequence representation.
8) Classification head (Dropout + Linear) → logits → Softmax → probabilities → predicted label.

## Tokenization and normalization
- XLM-R uses a SentencePiece BPE tokenizer with a shared multilingual vocabulary.
- It operates directly on raw text (no language-specific pre-tokenization needed).
- Text is split into subword units; uncommon words become multiple pieces.
- Special tokens used (RoBERTa family):
  - `<s>`: start of sequence (equivalent to [CLS])
  - `</s>`: end of sequence ([SEP])
  - `<pad>`: padding token
  - `<unk>`: unknown token
  - `<mask>`: used only during masked language modeling pretraining
- Unlike BERT, RoBERTa/XLM-R does not use token_type_ids (segment IDs), even for sentence pairs. Separation is done with `</s>` tokens.

In [2]:
# Illustrative only: how raw text becomes tokens and ids
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')  # or your fine-tuned path
text = "This model supports many languages!"
enc = tokenizer(text, padding='max_length', truncation=True, max_length=16, return_tensors='pt')
print('Tokens:', tokenizer.convert_ids_to_tokens(enc['input_ids'][0]))
print('input_ids shape:', enc['input_ids'].shape)
print('attention_mask shape:', enc['attention_mask'].shape)
# Note: XLM-R does not use token_type_ids (it may be absent or all zeros)

Tokens: ['<s>', '▁This', '▁model', '▁support', 's', '▁many', '▁language', 's', '!', '</s>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
input_ids shape: torch.Size([1, 16])
attention_mask shape: torch.Size([1, 16])


## From text to tensors
- input_ids: shape (batch_size, seq_len), integer IDs for each token.
- attention_mask: same shape, 1 for tokens to attend to, 0 for paddings.
- Padding/truncation: sequences are padded to the longest in the batch or a fixed max_length; longer sequences are truncated (commonly at 512 tokens).
- For sentence pairs: format is `<s> sentence1 </s> </s> sentence2 </s>`; still no token_type_ids.

## Batching and ingestion
- Datasets are tokenized into dicts with keys like `input_ids`, `attention_mask`, `labels`.
- A DataLoader (or Hugging Face Trainer) batches these tensors.
- On each step, a batch (e.g., (B, L)) is moved to the target device (CPU/GPU) and passed to the model.
- The attention_mask ensures the model does not attend to padding positions.

## Inside XLM-R (encoder) — what happens
1) Embeddings layer:
   - Token embeddings: lookup vectors for each input_id.
   - Positional embeddings: add position information (0..L-1).
   - (No segment/token_type embeddings).
2) Transformer encoder stack (repeated N times, 12 in base):
   - Multi-Head Self-Attention (MHSA) uses attention_mask to ignore pads.
   - Add & LayerNorm (residual connection).
   - Feed-Forward Network (GELU activation) per position.
   - Add & LayerNorm again.
3) Sequence representation:
   - For classification, use the hidden state at position 0 (the `<s>` token) as a pooled representation.

## Classification head → logits → probabilities
- Head: Dropout → Linear (hidden_size → num_labels).
- Output: logits (shape (batch_size, num_labels)).
- Training: CrossEntropyLoss compares logits vs. true `labels`. Class weights or focal loss can be used (as in your Phase 2 code).
- Inference: apply Softmax to logits → probabilities per class.
- Predicted label: argmax over probabilities.

In [3]:
# Illustrative only: logits to probabilities
import torch, torch.nn.functional as F
logits = torch.tensor([[1.2, -0.3, 0.5]])  # pretend output for 3 classes
probs = F.softmax(logits, dim=-1)
pred = torch.argmax(probs, dim=-1).item()
print('probs:', probs.tolist())
print('predicted class index:', pred)

probs: [[0.5814915299415588, 0.12974829971790314, 0.28876012563705444]]
predicted class index: 0


## Mapping to your Phase 2 notebook
- Pre-processing: you clean text, drop nulls/duplicates, engineer labels.
- Label encoding: you map label strings ↔ integers (`label_to_id`, `id_to_label`).
- Tokenization: your `tokenize_dataset(...)` builds `input_ids` and `attention_mask` for each split.
- Data format: `dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])` prepares tensors.
- Trainer loop: batches tensors, feeds to model, computes loss (optionally weighted or focal), evaluates metrics.
- Evaluation: you compute Accuracy, Macro-F1, and confusion matrices from predictions vs. labels.
- Inference helpers: your `predict_text` / `predict_text_probs` wrap tokenizer → model → softmax → label mapping.

## Practical notes and gotchas
- Max length and truncation: long texts are truncated; consider summarizing or sliding windows when context is critical.
- Tokenizer path: use your fine-tuned tokenizer/model paths for consistent vocabulary and special tokens.
- Class imbalance: adjust loss (weights, focal loss) or rebalance data.
- Multilingual inputs: XLM-R's shared vocabulary handles many languages, but domain-specific slang may be split into many subwords.
- No token_type_ids: RoBERTa family ignores segment embeddings; rely on `</s>` separators for pairs.
- Attention mask: always pass it to avoid attending to padding.

## Summary
Raw text → SentencePiece subwords + special tokens → IDs and masks → Transformer encoder → `<s>` representation → classification head → logits → Softmax probabilities.
This is exactly how your Phase 2 pipeline turns text into final class probabilities and labels.