# Transformers Complete Guide

**Summary:**
This notebook provides a comprehensive guide to transformer models (BERT, GPT, etc.) for text analysis, including practical examples, comparisons, and applications in literary studies.

# Transformers Complete Guide

This comprehensive notebook covers transformer models using the [Hugging Face Transformers](https://huggingface.co/docs/transformers/) library, from basic pre-trained pipelines to fine-tuning for custom classification tasks.

**What you'll learn:**
1. Text generation with GPT-2
2. Text embeddings and semantic similarity with BERT
3. Pre-trained pipelines (sentiment analysis, NER, zero-shot)
4. Fine-tuning BERT for classification tasks
5. Advanced zero-shot prompting with FLAN-T5

**Authors:** Maria Antoniak, Melanie Walsh, and the [AI for Humanists](https://aiforhumanists.com/) Team  
**Updated:** 2025-02-08

---
## Part 1: Introduction & Setup

### Prerequisites

This notebook requires the following packages:
```bash
pip install transformers torch torchvision torchaudio
pip install scikit-learn pandas numpy matplotlib seaborn
pip install requests gdown sentencepiece
```

### What are Transformers?

Transformer models use self-attention mechanisms to process text, enabling them to:
- Understand context and relationships between words
- Generate coherent, contextually-appropriate text
- Transfer learning from large pre-trained models to specific tasks

**Popular transformer models:**
- **BERT** (Bidirectional Encoder Representations): Best for understanding text (classification, NER, Q&A)
- **GPT** (Generative Pre-trained Transformer): Best for generating text
- **T5** (Text-to-Text Transfer Transformer): Frames all NLP tasks as text generation
- **DistilBERT**: Smaller, faster version of BERT with ~95% of its performance

### Installation and Imports

In [None]:
# Uncomment to install required packages
# !pip install transformers torch scikit-learn pandas numpy matplotlib seaborn requests gdown sentencepiece

In [None]:
# Basic Python modules
from collections import defaultdict
import random
import pickle
import os
import gzip
import json

# For downloading files
import requests
try:
    import gdown
except ImportError:
    print("Warning: gdown not installed. Some dataset downloads may not work.")

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Machine learning and evaluation
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report

# Deep learning
import torch
from torch.nn.functional import cosine_similarity

# Transformers
os.environ.setdefault("TRANSFORMERS_NO_TF", "1")
from transformers import (
    pipeline,
    GPT2LMHeadModel, GPT2Tokenizer,
    BertModel, BertTokenizer,
    DistilBertTokenizerFast, DistilBertForSequenceClassification,
    T5Tokenizer, T5ForConditionalGeneration,
    Trainer, TrainingArguments
)

# Visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import ticker
sns.set(style='ticks', font_scale=1.2)

# Progress bars and display
from tqdm.auto import tqdm
from IPython.display import display, Markdown

print("All imports successful!")

### Device Setup (CUDA/MPS/CPU)

This notebook automatically detects your available hardware:
- **CUDA**: NVIDIA GPUs (fastest)
- **MPS**: Apple Silicon GPUs (M1/M2/M3)
- **CPU**: Fallback (slowest)

Fine-tuning on CPU can be very slow. If you don't have GPU access, consider:
- Using Google Colab (free GPU)
- Reducing dataset size
- Using smaller models (DistilBERT instead of BERT)

In [None]:
# Detect available device
if torch.cuda.is_available():
    device = "cuda"
    device_name = torch.cuda.get_device_name(0)
    print(f"Using CUDA GPU: {device_name}")
elif torch.backends.mps.is_available():
    device = "mps"
    print("Using Apple Silicon GPU (MPS)")
else:
    device = "cpu"
    print("Using CPU (this will be slow for fine-tuning)")

print(f"PyTorch version: {torch.__version__}")
print(f"Device: {device}")

---
## Part 2: Text Generation with GPT-2

GPT-2 is an autoregressive language model that generates text by predicting the next token. It was trained on a large corpus of internet text and can generate surprisingly coherent and contextually appropriate text.

**Available GPT-2 models:**

| Model | Parameters | HF Name |
|-------|-----------|----------|
| Small | 124M | `gpt2` |
| Medium | 355M | `gpt2-medium` |
| Large | 774M | `gpt2-large` |
| XL | 1.5B | `gpt2-xl` |

We'll start with the small model for speed.

### Loading GPT-2

In [None]:
# Load GPT-2 model and tokenizer
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2").to(device)

print(f"Loaded GPT-2 ({gpt2_model.num_parameters():,} parameters)")

### Generation Parameters

Understanding these parameters helps you control the quality and creativity of generated text:

- **max_new_tokens**: Maximum number of tokens to generate
- **temperature**: Controls randomness (0.3 = focused/deterministic, 1.0+ = creative/random)
- **top_k**: Only consider the k most likely next tokens
- **top_p**: Nucleus sampling — only consider tokens whose cumulative probability reaches p
- **num_samples**: How many different completions to generate

In [None]:
def generate(model, tokenizer, prompt, max_new_tokens=100, temperature=0.7,
             top_k=50, top_p=0.9, num_samples=1):
    """Generate text from a prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        do_sample=True,
        num_return_sequences=num_samples,
        pad_token_id=tokenizer.eos_token_id,
    )
    for i, output in enumerate(outputs):
        text = tokenizer.decode(output, skip_special_tokens=True)
        if num_samples > 1:
            print(f"--- Sample {i + 1} ---")
        print(text)
        print()

In [None]:
# Simple generation
generate(gpt2_model, gpt2_tokenizer, "The secret of life is", max_new_tokens=50)

In [None]:
# Generate multiple creative samples
generate(gpt2_model, gpt2_tokenizer, "Once upon a time",
         max_new_tokens=80, temperature=0.9, num_samples=3)

In [None]:
# Try your own prompts!
# Experiment with different temperature values:
# - Low (0.3-0.5): More focused, coherent, deterministic
# - Medium (0.7-0.8): Balanced creativity and coherence
# - High (0.9-1.2): More creative, unpredictable

generate(gpt2_model, gpt2_tokenizer, "In the future, artificial intelligence will",
         max_new_tokens=100, temperature=0.7)

---
## Part 3: Text Embeddings with BERT

BERT produces **contextual embeddings** — vector representations where the meaning of a word depends on its surrounding context. This is in contrast to older word embeddings (like Word2Vec) where each word has a single fixed vector.

**Example:** The word "bank" has different meanings in:
- "I deposited money at the **bank**" (financial institution)
- "We sat by the river **bank**" (land alongside water)

BERT will produce different embeddings for "bank" in each context.

### The [CLS] Token

BERT adds special tokens to the input:
- `[CLS]`: Start of sequence token — its embedding is used to represent the entire sequence
- `[SEP]`: Separator between sentences
- `[PAD]`: Padding to make all sequences the same length

For classification and similarity tasks, we use the `[CLS]` token's embedding.

### Loading BERT

In [None]:
# Load BERT model and tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased").to(device)

print(f"Loaded BERT ({bert_model.num_parameters():,} parameters)")

### Getting Embeddings

In [None]:
def get_embeddings(texts, tokenizer, model):
    """Get [CLS] token embeddings for a list of texts."""
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
    # [CLS] token is at position 0
    return outputs.last_hidden_state[:, 0, :]

In [None]:
# Example sentences with different semantic meanings
sentences = [
    "The cat sat on the mat.",
    "A kitten rested on the rug.",
    "Stock prices rose sharply today.",
    "The financial markets surged.",
]

embeddings = get_embeddings(sentences, bert_tokenizer, bert_model)
print(f"Embedding shape: {embeddings.shape}")
print(f"(batch_size={embeddings.shape[0]}, hidden_size={embeddings.shape[1]})")

### Semantic Similarity with Cosine Distance

Cosine similarity measures how similar two vectors are, ranging from -1 to 1:
- **1.0**: Identical vectors (same direction)
- **0.0**: Orthogonal vectors (no similarity)
- **-1.0**: Opposite vectors

For text embeddings, higher cosine similarity indicates more semantically similar sentences.

In [None]:
# Calculate pairwise cosine similarity
n = len(sentences)
sim_matrix = torch.zeros(n, n)
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = cosine_similarity(embeddings[i].unsqueeze(0), embeddings[j].unsqueeze(0))

# Create a nice labeled dataframe
labels = [s[:30] + "..." if len(s) > 30 else s for s in sentences]
sim_df = pd.DataFrame(sim_matrix.cpu().numpy(), index=labels, columns=labels)
sim_df.style.background_gradient(cmap="YlOrRd", vmin=0.8, vmax=1.0).format("{:.3f}")

**Observations:**
- Sentences 1 and 2 (about cats) should have high similarity
- Sentences 3 and 4 (about finance) should have high similarity
- Cross-topic pairs should have lower similarity

In [None]:
# Visualize similarity matrix as a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(sim_df, annot=True, fmt=".3f", cmap="YlOrRd", vmin=0.8, vmax=1.0)
plt.title("Semantic Similarity Matrix")
plt.tight_layout()
plt.show()

---
## Part 4: Pre-trained Pipelines

Hugging Face provides a high-level `pipeline` API for common NLP tasks. These pipelines use pre-trained models fine-tuned for specific tasks, so you can use them immediately without any training.

**Available pipelines:**
- Sentiment analysis
- Named Entity Recognition (NER)
- Zero-shot classification
- Question answering
- Summarization
- Translation
- And many more!

### Sentiment Analysis

Classifies text as positive or negative. The default model is DistilBERT fine-tuned on movie reviews.

In [None]:
# Load sentiment analysis pipeline
sentiment = pipeline("sentiment-analysis", device=0 if device == "cuda" else -1)

reviews = [
    "This movie was absolutely wonderful! The acting was superb.",
    "Terrible film. I walked out after 30 minutes.",
    "It was okay, nothing special but not bad either.",
    "A masterpiece of modern cinema. Truly breathtaking.",
    "The plot made no sense and the dialogue was awful.",
]

results = sentiment(reviews)

for review, result in zip(reviews, results):
    print(f"{result['label']:8} ({result['score']:.3f})  {review}")

### Named Entity Recognition (NER)

Identifies and classifies named entities in text:
- **PER**: Person
- **ORG**: Organization
- **LOC**: Location
- **MISC**: Miscellaneous

In [None]:
# Named Entity Recognition
ner = pipeline("ner", aggregation_strategy="simple", device=0 if device == "cuda" else -1)
entities = ner("Barack Obama graduated from Harvard Law School and served as President of the United States.")

for ent in entities:
    print(f"{ent['entity_group']:10} {ent['word']:25} (score: {ent['score']:.3f})")

### Zero-Shot Classification

Classify text into categories **without any training examples**. You just provide:
1. The text to classify
2. A list of candidate labels

The model uses natural language inference to determine which label best fits the text.

In [None]:
# Zero-shot classification
classifier = pipeline("zero-shot-classification", device=0 if device == "cuda" else -1)

result = classifier(
    "The new iPhone features a faster processor and improved camera system.",
    candidate_labels=["technology", "politics", "sports", "science"]
)

for label, score in zip(result["labels"], result["scores"]):
    print(f"{label:15} {score:.3f}")

In [None]:
# Try zero-shot classification on your own text!
# Example: classify news headlines, tweets, or book reviews

texts = [
    "The Federal Reserve announced an interest rate cut.",
    "Scientists discovered a new species in the Amazon rainforest.",
    "The Lakers won in overtime against the Celtics.",
]

labels = ["business", "science", "sports", "entertainment"]

for text in texts:
    result = classifier(text, candidate_labels=labels)
    print(f"\nText: {text}")
    print(f"Top prediction: {result['labels'][0]} ({result['scores'][0]:.3f})")

---
## Part 5: Fine-Tuning for Classification

While pre-trained pipelines are powerful, you often need to fine-tune a model on your specific dataset. This section demonstrates fine-tuning DistilBERT to classify Goodreads book reviews by genre.

**The fine-tuning process:**
1. Download and prepare the dataset
2. Split into training and test sets
3. Encode texts for BERT (tokenization, padding, special tokens)
4. Create PyTorch datasets
5. Load pre-trained model
6. Fine-tune on training data
7. Evaluate on test data
8. Analyze results and errors

### Dataset: Goodreads Reviews by Genre

We'll use the [UCSD Book Graph dataset](https://mengtingwan.github.io/data/goodreads.html) which contains millions of book reviews. We'll classify reviews into genres:

- poetry
- comics & graphic
- fantasy & paranormal
- history & biography
- mystery, thriller, & crime
- romance
- young adult
- children

In [None]:
# URLs for Goodreads review data by genre
genre_url_dict = {
    'poetry':                 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_poetry.json.gz',
    'children':               'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_children.json.gz',
    'comics_graphic':         'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_comics_graphic.json.gz',
    'fantasy_paranormal':     'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_fantasy_paranormal.json.gz',
    'history_biography':      'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_history_biography.json.gz',
    'mystery_thriller_crime': 'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_mystery_thriller_crime.json.gz',
    'romance':                'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_romance.json.gz',
    'young_adult':            'https://datarepo.eng.ucsd.edu/mcauley_group/gdrive/goodreads/byGenre/goodreads_reviews_young_adult.json.gz'
}

### Download and Sample Data

We'll stream the data to avoid downloading huge files. For each genre:
1. Stream the first 10,000 reviews
2. Randomly sample 2,000 reviews

This gives us a manageable dataset while maintaining diversity.

In [None]:
def load_reviews(url, head=10000, sample_size=2000):
    """Stream reviews from URL and collect a subset."""
    reviews = []
    count = 0
    
    try:
        response = requests.get(url, stream=True, timeout=30)
        response.raise_for_status()
        
        with gzip.open(response.raw, 'rt', encoding='utf-8') as file:
            for line in file:
                try:
                    d = json.loads(line)
                    if 'review_text' in d:
                        reviews.append(d['review_text'])
                        count += 1
                except json.JSONDecodeError:
                    continue
                
                if head is not None and count >= head:
                    break
    except Exception as e:
        print(f"Error loading from {url}: {e}")
        return []
    
    # Return random sample of reviews
    return random.sample(reviews, min(sample_size, len(reviews)))

# Load reviews for each genre
genre_reviews_dict = {}

for genre, url in genre_url_dict.items():
    print(f'Loading reviews for genre: {genre}')
    genre_reviews_dict[genre] = load_reviews(url, head=10000, sample_size=2000)
    print(f"  Loaded {len(genre_reviews_dict[genre])} reviews")

In [None]:
# Preview a random review from each genre
for genre, reviews in genre_reviews_dict.items():
    if reviews:
        print(f"\n{genre.upper()}:")
        print(random.choice(reviews)[:200] + "...")

In [None]:
# Save for later use
pickle.dump(genre_reviews_dict, open('genre_reviews_dict.pickle', 'wb'))
print("Saved genre_reviews_dict.pickle")

# To reload later:
# genre_reviews_dict = pickle.load(open('genre_reviews_dict.pickle', 'rb'))

### Split Data into Training and Test Sets

**Important:** When training machine learning models, we MUST split data into:
- **Training set**: Used to train the model (80% of data)
- **Test set**: Used to evaluate performance on unseen data (20% of data)

**Never** train and test on the same data — this would give falsely high accuracy!

For production systems, you should also have a **validation set** for hyperparameter tuning.

In [None]:
train_texts = []
train_labels = []

test_texts = []
test_labels = []

for genre, reviews in genre_reviews_dict.items():
    # Sample 1000 reviews per genre for this example
    reviews = random.sample(reviews, min(1000, len(reviews)))
    
    # 80/20 split
    split_idx = int(len(reviews) * 0.8)
    
    for review in reviews[:split_idx]:
        train_texts.append(review)
        train_labels.append(genre)
    
    for review in reviews[split_idx:]:
        test_texts.append(review)
        test_labels.append(genre)

print(f"Training samples: {len(train_texts):,}")
print(f"Test samples: {len(test_texts):,}")
print(f"\nExample training sample:")
print(f"Label: {train_labels[0]}")
print(f"Text: {train_texts[0][:200]}...")

### Baseline: Logistic Regression with TF-IDF

Before using BERT, let's establish a baseline with a simpler model:
- **TF-IDF**: Represents text as weighted word frequencies
- **Logistic Regression**: Simple, fast classifier

This baseline helps us understand:
1. How difficult the classification task is
2. Whether BERT provides improvement over simpler methods
3. If our data is good quality

**Random baseline:** With 8 genres, random guessing would give ~12.5% accuracy.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

print("Training baseline model (TF-IDF + Logistic Regression)...")

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)

# Train logistic regression
baseline_model = LogisticRegression(max_iter=1000, random_state=42)
baseline_model.fit(X_train, train_labels)

# Evaluate
baseline_predictions = baseline_model.predict(X_test)
print("\nBaseline Results:")
print(classification_report(test_labels, baseline_predictions))

**Interpreting the baseline:**
- **Precision**: Of all predictions for a genre, what % were correct?
- **Recall**: Of all true instances of a genre, what % did we find?
- **F1-score**: Harmonic mean of precision and recall
- **Support**: Number of true instances of each class

If baseline accuracy > 50%, the task is learnable. Let's see if BERT can do better!

### Encoding Data for BERT

BERT requires text to be processed in a specific way:

1. **Tokenization**: Split text into subword tokens ("running" → "run", "##ning")
2. **Special tokens**: Add `[CLS]`, `[SEP]`, and `[PAD]`
3. **Truncation**: Limit to 512 tokens (BERT's maximum)
4. **Padding**: Add `[PAD]` tokens to make all sequences the same length
5. **Label encoding**: Convert genre names to integers

Fortunately, Hugging Face handles most of this automatically!

#### Special Tokens in BERT

| Token | Purpose |
|-------|----------|
| `[CLS]` | Classification token — placed at start, its embedding represents the whole sequence |
| `[SEP]` | Separator — marks boundaries between sentences |
| `[PAD]` | Padding — fills sequences to the same length |
| `##` | Word piece continuation — indicates this token continues the previous word |

In [None]:
# Fine-tuning parameters
model_name = 'distilbert-base-cased'
max_length = 512
cached_model_directory_name = 'distilbert-reviews-genres'

# Load tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
print(f"Loaded tokenizer: {model_name}")

In [None]:
# Create label mappings
unique_labels = sorted(set(train_labels))
label2id = {label: id for id, label in enumerate(unique_labels)}
id2label = {id: label for label, id in label2id.items()}

print("Label mappings:")
for label, id in label2id.items():
    print(f"  {id}: {label}")

In [None]:
# Encode texts
print("Encoding training data...")
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
print("Encoding test data...")
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=max_length)

# Encode labels
train_labels_encoded = [label2id[y] for y in train_labels]
test_labels_encoded = [label2id[y] for y in test_labels]

print("\nEncoding complete!")
print(f"Training samples: {len(train_encodings['input_ids'])}")
print(f"Test samples: {len(test_encodings['input_ids'])}")

In [None]:
# Examine an encoded example
print("Example of BERT tokenization:")
print("\nOriginal text:")
print(train_texts[0][:200])
print("\nTokenized (first 50 tokens):")
print(' '.join(train_encodings.tokens(0)[:50]))

### Create PyTorch Datasets

PyTorch uses `Dataset` objects to handle data loading and batching. We'll create a custom dataset that:
1. Stores our encoded texts and labels
2. Returns individual examples in the format BERT expects
3. Handles conversion to PyTorch tensors

In [None]:
class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ReviewDataset(train_encodings, train_labels_encoded)
test_dataset = ReviewDataset(test_encodings, test_labels_encoded)

print(f"Created training dataset with {len(train_dataset)} examples")
print(f"Created test dataset with {len(test_dataset)} examples")

### Load Pre-trained Model

We'll load DistilBERT for sequence classification. The model has:
- Pre-trained weights from general language modeling
- A classification head (randomly initialized) for our 8 genres

**Why DistilBERT?**
- 40% smaller than BERT
- 60% faster
- Retains 95% of BERT's performance
- Perfect for learning and experimentation

In [None]:
# Load pre-trained DistilBERT for sequence classification
model = DistilBertForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=len(id2label)
).to(device)

print(f"Loaded {model_name} with {len(id2label)} output classes")
print(f"Total parameters: {model.num_parameters():,}")

### Set Training Arguments

These hyperparameters control the fine-tuning process. For your own projects, you should experiment with these values:

| Parameter | Purpose | Typical Range |
|-----------|---------|---------------|
| `num_train_epochs` | How many times to iterate through full dataset | 2-5 |
| `per_device_train_batch_size` | Training examples per GPU batch | 8-32 |
| `learning_rate` | Step size for weight updates | 2e-5 to 5e-5 |
| `warmup_steps` | Gradual learning rate increase at start | 100-1000 |
| `weight_decay` | Regularization to prevent overfitting | 0.01-0.1 |
| `logging_steps` | How often to print progress | 50-500 |

**Warning:** Training on CPU can take hours! Use a GPU if possible.

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=20,
    learning_rate=5e-5,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_strategy='steps',
    save_strategy='epoch',
    load_best_model_at_end=True,
    report_to=[],  # Disable wandb logging
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Device: {device}")

### Define Evaluation Metrics

In [None]:
def compute_metrics(pred):
    """Calculate accuracy from predictions."""
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

### Fine-Tune the Model

Now we create a `Trainer` object and start fine-tuning!

The Trainer will:
1. Feed batches of training data through the model
2. Calculate loss (how wrong the predictions are)
3. Update model weights to reduce loss
4. Periodically evaluate on test data
5. Log progress

**What to watch for:**
- **Training loss** should decrease steadily
- **Eval accuracy** should increase
- If loss plateaus or accuracy stops improving, training is complete
- If loss increases or accuracy decreases, you may be overfitting

In [None]:
# Disable wandb logging
os.environ["WANDB_DISABLED"] = "true"

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

print("Starting fine-tuning...")
print("This may take several minutes (or hours on CPU).")
print("Watch for decreasing loss and increasing accuracy.\n")

In [None]:
# Train the model
trainer.train()

### Save the Fine-Tuned Model

Save your fine-tuned model so you can use it later without retraining.

In [None]:
trainer.save_model(cached_model_directory_name)
print(f"Model saved to {cached_model_directory_name}/")

# To reload later:
# model = DistilBertForSequenceClassification.from_pretrained(cached_model_directory_name)

### Evaluate the Fine-Tuned Model

Let's see how well our model performs on the test set and compare to the baseline.

In [None]:
# Evaluate on test set
eval_results = trainer.evaluate()

print("\nFinal Evaluation Results:")
print(f"  Test Accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"  Test Loss: {eval_results['eval_loss']:.4f}")

In [None]:
# Get detailed predictions
predicted_results = trainer.predict(test_dataset)
predicted_labels_encoded = predicted_results.predictions.argmax(-1)
predicted_labels = [id2label[l] for l in predicted_labels_encoded]

print("\nDetailed Classification Report:")
print(classification_report(test_labels, predicted_labels))

### Confusion Matrix and Error Analysis

A confusion matrix shows which genres are most often confused with each other. This can reveal:
- Which genres are easy to distinguish
- Which genres have similar review language
- Where the model needs improvement

In [None]:
# Create confusion matrix data
genre_classifications_dict = defaultdict(int)
for true_label, predicted_label in zip(test_labels, predicted_labels):
    genre_classifications_dict[(true_label, predicted_label)] += 1

# Convert to dataframe for visualization
dicts_to_plot = []
for (true_genre, predicted_genre), count in genre_classifications_dict.items():
    dicts_to_plot.append({
        'True Genre': true_genre,
        'Predicted Genre': predicted_genre,
        'Number of Classifications': count
    })

df_to_plot = pd.DataFrame(dicts_to_plot)
df_wide = df_to_plot.pivot_table(
    index='True Genre',
    columns='Predicted Genre',
    values='Number of Classifications'
)

# Plot full confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df_wide, annot=True, fmt='g', cmap='Purples', linewidths=1)
plt.title('Confusion Matrix: All Predictions')
plt.xlabel('Predicted Genre')
plt.ylabel('True Genre')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Plot misclassifications only (remove diagonal)
genre_misclassifications_dict = defaultdict(int)
for true_label, predicted_label in zip(test_labels, predicted_labels):
    if true_label != predicted_label:
        genre_misclassifications_dict[(true_label, predicted_label)] += 1

dicts_to_plot = []
for (true_genre, predicted_genre), count in genre_misclassifications_dict.items():
    dicts_to_plot.append({
        'True Genre': true_genre,
        'Predicted Genre': predicted_genre,
        'Number of Misclassifications': count
    })

if dicts_to_plot:
    df_to_plot = pd.DataFrame(dicts_to_plot)
    df_wide = df_to_plot.pivot_table(
        index='True Genre',
        columns='Predicted Genre',
        values='Number of Misclassifications'
    )
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(df_wide, annot=True, fmt='g', cmap='Reds', linewidths=1)
    plt.title('Confusion Matrix: Misclassifications Only')
    plt.xlabel('Predicted Genre')
    plt.ylabel('True Genre')
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()
else:
    print("Perfect classification — no misclassifications to plot!")

### Examine Correct and Incorrect Predictions

Looking at individual examples helps us understand:
- What the model learned
- What kinds of errors it makes
- Whether errors are understandable (ambiguous cases) or concerning

In [None]:
# Show correctly classified examples
print("CORRECTLY CLASSIFIED EXAMPLES:\n" + "="*80)
correct_count = 0
for true_label, predicted_label, text in random.sample(list(zip(test_labels, predicted_labels, test_texts)), 50):
    if true_label == predicted_label and correct_count < 10:
        print(f"\nGenre: {true_label}")
        print(f"Review: {text[:200]}...")
        correct_count += 1
        if correct_count >= 10:
            break

In [None]:
# Show misclassified examples
print("\n\nMISCLASSIFIED EXAMPLES:\n" + "="*80)
error_count = 0
for true_label, predicted_label, text in random.sample(list(zip(test_labels, predicted_labels, test_texts)), 50):
    if true_label != predicted_label and error_count < 10:
        print(f"\nTrue Genre: {true_label}")
        print(f"Predicted Genre: {predicted_label}")
        print(f"Review: {text[:200]}...")
        error_count += 1
        if error_count >= 10:
            break

---
## Part 6: Advanced Topics - FLAN-T5 Zero-Shot

**Optional:** This section demonstrates advanced zero-shot classification using FLAN-T5, an encoder-decoder model fine-tuned on instruction-following tasks.

Unlike the pipeline-based zero-shot classifier, FLAN-T5:
- Uses **prompt engineering** to guide the model
- Evaluates choices by **comparing loss values**
- Can be more flexible but requires more code

**Use cases:**
- When you want fine control over prompts
- When you need to classify with complex instructions
- For research on prompt engineering

**Requirements:** GPU recommended for larger FLAN-T5 models.

### Load FLAN-T5 Model

In [None]:
# Load FLAN-T5 (instruction-tuned encoder-decoder)
# Options: flan-t5-small, flan-t5-base, flan-t5-large, flan-t5-xl
model_id = "google/flan-t5-large"

print(f'Loading {model_id} on {device}...')
try:
    flan_model = T5ForConditionalGeneration.from_pretrained(model_id).to(device)
    flan_tokenizer = T5Tokenizer.from_pretrained(model_id)
    print("FLAN-T5 loaded successfully!")
except Exception as e:
    print(f"Could not load FLAN-T5: {e}")
    print("Skipping advanced zero-shot section.")
    flan_model = None
    flan_tokenizer = None

### Prompt Templates

Different prompts can significantly affect model performance. Let's define several templates to experiment with.

In [None]:
def apply_prompt_1(text, possible_choices):
    return f'Which genre of book is the following review about?\nReview: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

def apply_prompt_2(text, possible_choices):
    return f'Review: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nGenre:'

def apply_prompt_3(text, possible_choices):
    return f'Review: {text}\nGenre:'

def apply_prompt_4(text, possible_choices):
    return f'\nReview: {text}\nWhich genre of book is the review about?'

def apply_prompt_5(text, possible_choices):
    return f'Review: {text}\nChoices: {possible_choices[0]} or {possible_choices[1]}\nAnswer:'

# Test a prompt
test_text = "This novel has amazing world-building and magical creatures."
test_choices = ["fantasy", "history"]
print("Example prompt:")
print(apply_prompt_1(test_text, test_choices))

### Loss-by-Choice Classification

This approach:
1. For each possible choice, calculate how likely the model thinks that choice is
2. The choice with lowest loss (highest likelihood) is the prediction
3. Works without any training — purely zero-shot

In [None]:
def classify_example(text, label, possible_choices, model, tokenizer, verbose=False):
    """Classify a single example using loss-by-choice."""
    inputs = tokenizer(text, return_tensors='pt', truncation=True).to(device)
    input_ids = inputs.input_ids
    
    losses_and_targets = []
    for target_text in possible_choices:
        target = tokenizer(target_text, return_tensors='pt', truncation=True).to(device)
        target_ids = target.input_ids
        
        with torch.no_grad():
            outputs = model(input_ids=input_ids, labels=target_ids)
        
        loss = outputs.loss.item()
        losses_and_targets.append((loss, target_text))
        
        if verbose:
            print(f"  {target_text}: loss = {loss:.4f}")
    
    losses_and_targets.sort()
    _, best_choice = losses_and_targets[0]
    
    if verbose:
        print(f"  → Predicted: {best_choice} (True: {label})")
    
    return best_choice == label

def classify_dataset(prompted_examples, labels, possible_choices, model, tokenizer, verbose=False):
    """Classify a dataset and return accuracy."""
    num_examples = len(prompted_examples)
    correct = 0
    
    for i in tqdm(range(num_examples), desc="Classifying"):
        prompted_example = prompted_examples[i]
        label = labels[i]
        is_correct = classify_example(
            prompted_example, label, possible_choices, model, tokenizer,
            verbose=(i < 5 and verbose)
        )
        correct += int(is_correct)
    
    return correct / num_examples

### Example: Binary Classification

Let's test FLAN-T5 on a binary classification task: distinguishing history/biography from poetry reviews.

In [None]:
if flan_model is not None and genre_reviews_dict:
    # Helper function to subsample two classes
    def subsample_two_classes(all_texts, all_labels, label_1, label_2, n):
        all_texts = np.array(all_texts)
        all_labels = np.array(all_labels)
        idxs_label_1 = np.where(all_labels == label_1)[0].tolist()
        idxs_label_2 = np.where(all_labels == label_2)[0].tolist()
        n_each_class = int(n/2)
        idxs_label_1 = idxs_label_1[:n_each_class]
        idxs_label_2 = idxs_label_2[:n_each_class]
        subset_idxs = idxs_label_1 + idxs_label_2
        random.shuffle(subset_idxs)
        subset_texts = list(all_texts[subset_idxs])
        subset_labels = list(all_labels[subset_idxs])
        return subset_texts, subset_labels
    
    # Prepare binary classification task
    task_texts, task_labels = subsample_two_classes(
        test_texts, test_labels, 
        'history_biography', 'poetry', 
        n=100
    )
    
    # Simplify labels for prompting
    original_label_to_new_name = {
        'history_biography': 'history/biography', 
        'poetry': 'poetry'
    }
    possible_choices = list(original_label_to_new_name.values())
    task_labels = [original_label_to_new_name[l] for l in task_labels]
    
    # Apply prompt
    task_texts_prompted = [apply_prompt_1(t, possible_choices) for t in task_texts]
    
    print("\nExample prompted text:")
    print(task_texts_prompted[0][:300])
    print("\n" + "="*80)
    
    # Classify
    accuracy = classify_dataset(
        task_texts_prompted, task_labels, possible_choices,
        flan_model, flan_tokenizer, verbose=True
    )
    
    print(f"\n\nZero-shot accuracy (history/biography vs poetry): {accuracy*100:.2f}%")
else:
    print("Skipping zero-shot example (model not loaded or data unavailable)")

### Prompt Engineering Experiments

Try different prompts and see how they affect accuracy!

In [None]:
if flan_model is not None and 'task_texts' in locals():
    prompts_to_test = [
        ("Prompt 1: Full question", apply_prompt_1),
        ("Prompt 2: Choices then genre", apply_prompt_2),
        ("Prompt 3: Minimal", apply_prompt_3),
        ("Prompt 5: Review choices answer", apply_prompt_5),
    ]
    
    results = []
    for name, prompt_func in prompts_to_test:
        print(f"\nTesting {name}...")
        prompted = [prompt_func(t, possible_choices) for t in task_texts]
        acc = classify_dataset(prompted, task_labels, possible_choices, flan_model, flan_tokenizer)
        results.append({'Prompt': name, 'Accuracy': acc})
        print(f"{name}: {acc*100:.2f}%")
    
    # Show results
    results_df = pd.DataFrame(results)
    display(results_df.sort_values('Accuracy', ascending=False))
else:
    print("Skipping prompt experiments")

---
## Summary and Next Steps

Congratulations! You've completed a comprehensive tour of transformer models:

**What you learned:**
1. Text generation with GPT-2 and parameter tuning
2. Text embeddings and semantic similarity with BERT
3. Pre-trained pipelines for common NLP tasks
4. Fine-tuning BERT for custom classification
5. Advanced zero-shot classification with FLAN-T5

**Next steps:**
- Try fine-tuning on your own dataset
- Experiment with different model sizes
- Explore other Hugging Face models (RoBERTa, ALBERT, DeBERTa)
- Use embeddings for clustering and visualization
- Build applications with the Transformers library

**Resources:**
- [Hugging Face Documentation](https://huggingface.co/docs/transformers/)
- [Model Hub](https://huggingface.co/models)
- [AI for Humanists](https://aiforhumanists.com/)
- [BERT Paper](https://arxiv.org/abs/1810.04805)
- [GPT-2 Paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)