# Chapter 4: Text Classification - Hard Tasks

This notebook tackles advanced classification challenges. You'll implement hierarchical multi-level classifiers for complex taxonomies, use active learning to minimize labeling costs, build ensemble classifiers to improve robustness, and apply transfer learning across domains. These techniques are essential for production-level NLP systems.


---

## Setup

Run all cells in this section to set up the environment and load necessary data.


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


In [2]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Helper Functions


In [6]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

This section is divided into EASY, MEDIUM, & HARD.

---

## Hard Tasks


### Hard Tasks - Advanced Classification Challenges

These tasks require significant modifications and deeper understanding. Take your time and experiment

#### Hard Task 1: Hierarchical Multi-Level Classifier

Run the 2-level classifier first. Then try adding a 3rd level.


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

test_reviews = [
    # Positive - Quality
    "Brilliant performances and stunning cinematography",
    "Exceptional directing and beautiful visuals",

    # Positive - Entertainment
    "So much fun! Had a great time watching",
    "Really entertaining and enjoyable",

    # Negative - Boring
    "Incredibly dull and slow-paced",
    "Boring, nothing happens for two hours",

    # Negative - Quality issues
    "Poor acting and terrible script",
    "Awful production values and bad directing",
]

# TODO: Add reviews for 3rd level
# test_reviews.extend([...])

Define the hierarchy:


In [None]:
# Level 1: Sentiment
level1_labels = [
    "negative sentiment review",
    "positive sentiment review"
]

# Level 2: Aspects
level2_negative = [
    "review criticizing entertainment value and pacing",
    "review criticizing technical quality and production"
]

level2_positive = [
    "review praising technical quality and artistry",
    "review praising entertainment value and enjoyment"
]

# TODO: Add Level 3
# level3_positive_quality = [...]
# level3_positive_entertainment = [...]

Classification function:


In [None]:
def hierarchical_classify_2level(text):
    """Two-level classification: Sentiment -> Aspect"""
    text_embedding = model.encode([text])

    # Level 1: Determine sentiment
    level1_embeddings = model.encode(level1_labels)
    level1_sim = cosine_similarity(text_embedding, level1_embeddings)[0]
    level1_pred = np.argmax(level1_sim)
    level1_conf = level1_sim[level1_pred]

    # Level 2: Determine specific aspect based on Level 1
    if level1_pred == 0:  # Negative
        level2_labels = level2_negative
        sentiment = "Negative"
    else:  # Positive
        level2_labels = level2_positive
        sentiment = "Positive"

    level2_embeddings = model.encode(level2_labels)
    level2_sim = cosine_similarity(text_embedding, level2_embeddings)[0]
    level2_pred = np.argmax(level2_sim)
    level2_conf = level2_sim[level2_pred]

    return {
        'level1_pred': level1_pred,
        'level1_label': level1_labels[level1_pred],
        'level1_conf': level1_conf,
        'level2_pred': level2_pred,
        'level2_label': level2_labels[level2_pred],
        'level2_conf': level2_conf,
        'sentiment': sentiment,
        'path': f"{sentiment} -> {level2_labels[level2_pred]}"
    }

# TODO: Uncomment to implement 3-level classification
# def hierarchical_classify_3level(text):
#     """Three-level classification: Sentiment -> Aspect -> Specific"""
#     # Start with levels 1 and 2
#     result = hierarchical_classify_2level(text)
#
#     text_embedding = model.encode([text])
#
#     # Level 3: Even more specific based on Level 2
#     if result['sentiment'] == "Positive":
#         if result['level2_pred'] == 0:  # Quality
#             level3_labels = level3_positive_quality
#         else:  # Entertainment
#             level3_labels = level3_positive_entertainment
#     else:  # Negative
#         if result['level2_pred'] == 0:  # Entertainment
#             level3_labels = level3_negative_entertainment
#         else:  # Quality
#             level3_labels = level3_negative_quality
#
#     level3_embeddings = model.encode(level3_labels)
#     level3_sim = cosine_similarity(text_embedding, level3_embeddings)[0]
#     level3_pred = np.argmax(level3_sim)
#     level3_conf = level3_sim[level3_pred]
#
#     result['level3_pred'] = level3_pred
#     result['level3_label'] = level3_labels[level3_pred]
#     result['level3_conf'] = level3_conf
#     result['path'] = f"{result['sentiment']} -> L2 -> {level3_labels[level3_pred]}"
#
#     return result

# TODO: Implement 3-level version
# def hierarchical_classify_3level(text):
#     ...

Classify the reviews:


In [None]:
print("="*80)
print("HIERARCHICAL CLASSIFICATION (2 LEVELS)")
print("="*80)

for i, review in enumerate(test_reviews):
    result = hierarchical_classify_2level(review)

    print(f"\nReview {i+1}: '{review}'")
    print(f"\n  Level 1 (Sentiment):")
    print(f"     {result['level1_label']}")
    print(f"     Confidence: {result['level1_conf']:.3f}")

    print(f"\n  Level 2 (Specific Aspect):")
    print(f"     {result['level2_label']}")
    print(f"     Confidence: {result['level2_conf']:.3f}")

    print(f"\n  Final Classification Path:")
    print(f"     {result['path']}")
    print("-"*80)

# Compare with flat classification
print("\n" + "="*80)
print("COMPARISON: Hierarchical vs Flat Classification")
print("="*80)

# Flat: All 4 categories at once
flat_labels = [
    "review criticizing entertainment value and pacing",      # 0
    "review criticizing technical quality and production",    # 1
    "review praising technical quality and artistry",         # 2
    "review praising entertainment value and enjoyment"       # 3
]

flat_embeddings = model.encode(flat_labels)
review_embeddings = model.encode(test_reviews)
flat_sim = cosine_similarity(review_embeddings, flat_embeddings)

print("\nShowing first 3 reviews:")
for i in range(min(3, len(test_reviews))):
    hier_result = hierarchical_classify_2level(test_reviews[i])
    flat_pred = np.argmax(flat_sim[i])
    flat_conf = flat_sim[i][flat_pred]

    print(f"\nReview: '{test_reviews[i][:50]}...'")
    print(f"  Hierarchical: {hier_result['level2_label']}")
    print(f"     Confidence: {hier_result['level2_conf']:.3f}")
    print(f"  Flat:         {flat_labels[flat_pred]}")
    print(f"     Confidence: {flat_conf:.3f}")
    print(f"  Confidence Diff: {hier_result['level2_conf'] - flat_conf:+.3f}")

# TODO: After implementing 3-level, uncomment to test it:
# print("\n" + "="*80)
# print("TESTING 3-LEVEL HIERARCHICAL CLASSIFICATION")
# print("="*80)
#
# for i, review in enumerate(test_reviews):
#     result = hierarchical_classify_3level(review)
#     print(f"\n{i+1}. '{review[:60]}...'")
#     print(f"   Path: {result['path']}")
#     print(f"   Level 3 confidence: {result['level3_conf']:.3f}")

### Questions

1. Compare confidence scores for hierarchical vs flat. Does breaking into steps help?

2. Can the classifier get Level 2 right even if Level 1 is wrong?

3. After adding 3 levels: Did the extra granularity help or hurt?


#### Hard Task 2: Active Learning to Minimize Labeling

Compare active learning (picking uncertain samples) vs random selection.


In [36]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from datasets import load_dataset
import numpy as np

data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Prepare datasets
pool_size = 2000
test_size = 300

train_pool = data["train"].shuffle(seed=42).select(range(pool_size))
test_set = data["test"].shuffle(seed=42).select(range(test_size))

# Generate embeddings upfront (faster)
print("Generating embeddings for 2000 training pool and 300 test samples...")
pool_embeddings = model.encode(train_pool["text"], show_progress_bar=True)
test_embeddings = model.encode(test_set["text"], show_progress_bar=False)
print(" Embeddings ready\n")

def uncertainty_sampling(clf, unlabeled_embeddings, n_samples=50):
    """
    Select samples where model is most uncertain.
    Strategy: Pick samples with lowest confidence (closest to 50-50)
    """
    probs = clf.predict_proba(unlabeled_embeddings)
    # Uncertainty = 1 - max(prob) = how close to 50-50 the prediction is
    uncertainties = 1 - np.max(probs, axis=1)

    # Get indices of most uncertain samples
    most_uncertain_indices = np.argsort(uncertainties)[-n_samples:]
    return most_uncertain_indices

# TODO: Uncomment to implement alternative selection strategy
# def uncertainty_sampling(clf, unlabeled_embeddings, n_samples=50):
#     """
#     Alternative strategy: Margin sampling
#     Select samples where top two predictions are closest
#     """
#     probs = clf.predict_proba(unlabeled_embeddings)
#     # Sort probabilities for each sample
#     sorted_probs = np.sort(probs, axis=1)
#     # Margin = difference between top two
#     margins = sorted_probs[:, -1] - sorted_probs[:, -2]
#
#     # Get indices of smallest margins (most uncertain)
#     most_uncertain_indices = np.argsort(margins)[:n_samples]
#     return most_uncertain_indices

def random_sampling(n_available, n_samples=50):
    """Baseline: Random selection"""
    return np.random.choice(n_available, size=min(n_samples, n_available), replace=False)

# Active Learning Simulation
print("="*80)
print("ACTIVE LEARNING SIMULATION")
print("="*80)
print("Strategy: Start with 100 labeled, then iteratively add 50 most uncertain samples")
print("Compare: Active Learning vs Random Sampling\n")

# Configuration
initial_size = 100
samples_per_iteration = 50
n_iterations = 10

# Initialize
labeled_indices = list(range(initial_size))
unlabeled_indices = list(range(initial_size, pool_size))

active_scores = []
random_scores = []
iteration_labeled_sizes = []

for iteration in range(n_iterations):
    current_size = len(labeled_indices)
    iteration_labeled_sizes.append(current_size)

    print(f"{'='*80}")
    print(f"Iteration {iteration + 1}/{n_iterations} - Labeled samples: {current_size}")
    print('='*80)

    # Get current labeled data
    labeled_embeddings = pool_embeddings[labeled_indices]
    labeled_labels = [train_pool["label"][i] for i in labeled_indices]

    # Train classifier
    clf = LogisticRegression(random_state=42, max_iter=1000)
    clf.fit(labeled_embeddings, labeled_labels)

    # Evaluate
    test_pred = clf.predict(test_embeddings)
    active_f1 = f1_score(test_set["label"], test_pred, average='weighted')
    active_scores.append(active_f1)

    print(f"Active Learning F1: {active_f1:.4f}")

    # Compare with random sampling (same number of samples)
    random_indices = list(range(initial_size)) + list(
        np.random.choice(range(initial_size, pool_size),
                        size=min(current_size - initial_size, pool_size - initial_size),
                        replace=False)
    )
    random_embeddings = pool_embeddings[random_indices]
    random_labels = [train_pool["label"][i] for i in random_indices]

    clf_random = LogisticRegression(random_state=42, max_iter=1000)
    clf_random.fit(random_embeddings, random_labels)
    random_pred = clf_random.predict(test_embeddings)
    random_f1 = f1_score(test_set["label"], random_pred, average='weighted')
    random_scores.append(random_f1)

    print(f"Random Sampling F1:  {random_f1:.4f}")
    print(f"Improvement:         {active_f1 - random_f1:+.4f}")

    # Select next batch using active learning
    if len(unlabeled_indices) < samples_per_iteration:
        print(f"\n Stopping: Only {len(unlabeled_indices)} samples left")
        break

    unlabeled_embeddings = pool_embeddings[unlabeled_indices]
    uncertain_local_indices = uncertainty_sampling(clf, unlabeled_embeddings, samples_per_iteration)

    # Convert to global indices
    newly_labeled = [unlabeled_indices[i] for i in uncertain_local_indices]

    # Show examples of selected samples
    print(f"\nExamples of selected UNCERTAIN samples:")
    for i, idx in enumerate(newly_labeled[:3]):
        probs = clf.predict_proba(pool_embeddings[idx].reshape(1, -1))[0]
        uncertainty = 1 - np.max(probs)
        print(f"  {i+1}. '{train_pool['text'][idx][:60]}...'")
        print(f"     Uncertainty: {uncertainty:.3f}")
        print(f"     Probs: [neg={probs[0]:.3f}, pos={probs[1]:.3f}]")

    # Update sets
    labeled_indices.extend(newly_labeled)
    unlabeled_indices = [idx for idx in unlabeled_indices if idx not in newly_labeled]
    print()

# Results Summary
print("="*80)
print("FINAL RESULTS - LEARNING CURVES")
print("="*80)

print(f"\n{'Labeled':<10s} {'Active F1':<12s} {'Random F1':<12s} {'Difference':<12s}")
print("-"*50)
for size, active, random in zip(iteration_labeled_sizes, active_scores, random_scores):
    diff = active - random
    marker = "  " if diff > 0.01 else ""
    print(f"{size:<10d} {active:.4f}       {random:.4f}       {diff:+.4f}{marker}")

avg_improvement = np.mean(np.array(active_scores) - np.array(random_scores))
print(f"\nAverage Improvement: {avg_improvement:+.4f}")

# Find when active learning reaches target F1
target_f1 = 0.85
active_reached = next((size for size, f1 in zip(iteration_labeled_sizes, active_scores) if f1 >= target_f1), None)
random_reached = next((size for size, f1 in zip(iteration_labeled_sizes, random_scores) if f1 >= target_f1), None)

if active_reached or random_reached:
    print(f"\nTo reach F1={target_f1}:")
    if active_reached:
        print(f"  Active Learning: {active_reached} labeled samples")
    else:
        print(f"  Active Learning: Did not reach {target_f1}")

    if random_reached:
        print(f"  Random Sampling: {random_reached} labeled samples")
    else:
        print(f"  Random Sampling: Did not reach {target_f1}")

    if active_reached and random_reached:
        savings = random_reached - active_reached
        print(f"   Active Learning saved {savings} labeled samples ({savings/random_reached:.1%})")

print("\n" + "="*80)
print("TODO: Try different selection strategies and record results:")
print("="*80)
print("| Strategy              | Samples to F1=0.85 | Avg Improvement | Notes |")
print("|----------------------|-------------------|-----------------|-------|")
print("| Uncertainty (1-max)  | ???               | ???             |       |")
print("| Margin sampling      | ???               | ???             |       |")
print("| YOUR STRATEGY        | ???               | ???             |       |")

Generating embeddings for 2000 training pool and 300 test samples...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

✓ Embeddings ready

ACTIVE LEARNING SIMULATION
Strategy: Start with 100 labeled, then iteratively add 50 most uncertain samples
Compare: Active Learning vs Random Sampling

Iteration 1/10 - Labeled samples: 100
Active Learning F1: 0.7535
Random Sampling F1:  0.7535
Improvement:         +0.0000

Examples of selected UNCERTAIN samples:
  1. 'it uses the pain and violence of war as background material ...'
     Uncertainty: 0.495
     Probs: [neg=0.505, pos=0.495]
  2. '. . . the tale of her passionate , tumultuous affair with mu...'
     Uncertainty: 0.495
     Probs: [neg=0.505, pos=0.495]
  3. 'inventive , fun , intoxicatingly sexy , violent , self-indul...'
     Uncertainty: 0.495
     Probs: [neg=0.505, pos=0.495]

Iteration 2/10 - Labeled samples: 150
Active Learning F1: 0.8023
Random Sampling F1:  0.8222
Improvement:         -0.0199

Examples of selected UNCERTAIN samples:
  1. 'a vile , incoherent mess . . . a scummy ripoff of david cron...'
     Uncertainty: 0.492
     Probs: [ne

### Questions

1. What makes the uncertain samples uncertain? Are they using hedging language, mixed sentiment, or ambiguous wording?

2. At what point did active learning pull ahead of random sampling? How much can active learning reduce labeling costs?

3. Why are samples with probabilities like [0.52, 0.48] more valuable for training than confident samples?

**About This Task:**

Ensemble methods combine multiple classifiers to improve robustness and accuracy. By leveraging diverse models or training strategies, ensembles reduce individual model biases and achieve more reliable predictions.


#### Hard Task 3: Ensemble Classifier

Compare 3 different models individually vs combined as an ensemble.


In [37]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
from datasets import load_dataset
import numpy as np

data = load_dataset("rotten_tomatoes")

# Configuration
train_size = 1500
test_size = 300

train_subset = data["train"].shuffle(seed=42).select(range(train_size))
test_subset = data["test"].shuffle(seed=42).select(range(test_size))

print("="*80)
print("BUILDING ENSEMBLE OF CLASSIFIERS")
print("="*80)

# Model 1: Task-Specific (Twitter RoBERTa)
print("\n[1/3] Loading Task-Specific Model (Twitter RoBERTa)...")
task_model = pipeline(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True,
    device=-1
)

# Model 2: Embedding + Logistic Regression
print("[2/3] Training Embedding Classifier...")
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_embeddings = embedding_model.encode(train_subset["text"], show_progress_bar=True)
test_embeddings = embedding_model.encode(test_subset["text"], show_progress_bar=False)

clf_embedding = LogisticRegression(random_state=42, max_iter=1000)
clf_embedding.fit(train_embeddings, train_subset["label"])

# Model 3: Zero-Shot
print("[3/3] Setting up Zero-Shot Classifier...")
zero_shot_labels = ["A very negative movie review", "A very positive movie review"]
zero_shot_label_embeddings = embedding_model.encode(zero_shot_labels)

# TODO: Uncomment to add Model 4 - Different embedding model
# print("[4/4] Training with alternative embedding model...")
# embedding_model_alt = SentenceTransformer('all-MiniLM-L6-v2')  # Smaller, faster
# train_embeddings_alt = embedding_model_alt.encode(train_subset["text"], show_progress_bar=True)
# test_embeddings_alt = embedding_model_alt.encode(test_subset["text"], show_progress_bar=False)
# clf_embedding_alt = LogisticRegression(random_state=42, max_iter=1000)
# clf_embedding_alt.fit(train_embeddings_alt, train_subset["label"])

print("\n All models ready")

# Get predictions from all models
print("\n" + "="*80)
print("GENERATING PREDICTIONS")
print("="*80)

print("\nModel 1: Task-Specific...")
pred_task = []
conf_task = []
for text in test_subset["text"]:
    output = task_model(text)[0]
    neg_score = output[0]["score"]
    pos_score = output[2]["score"]
    pred_task.append(1 if pos_score > neg_score else 0)
    conf_task.append(max(neg_score, pos_score))

print("Model 2: Embedding Classifier...")
pred_embedding = clf_embedding.predict(test_embeddings)
conf_embedding = np.max(clf_embedding.predict_proba(test_embeddings), axis=1)

print("Model 3: Zero-Shot...")
from sklearn.metrics.pairwise import cosine_similarity
zero_shot_sim = cosine_similarity(test_embeddings, zero_shot_label_embeddings)
pred_zero_shot = np.argmax(zero_shot_sim, axis=1)
conf_zero_shot = np.max(zero_shot_sim, axis=1)

# TODO: Uncomment if you added Model 4
# print("Model 4: Alternative Embedding...")
# pred_alt = clf_embedding_alt.predict(test_embeddings_alt)
# conf_alt = np.max(clf_embedding_alt.predict_proba(test_embeddings_alt), axis=1)

# Evaluate individual models
print("\n" + "="*80)
print("INDIVIDUAL MODEL PERFORMANCE")
print("="*80)

models = [
    ("Task-Specific", pred_task),
    ("Embedding + LR", pred_embedding),
    ("Zero-Shot", pred_zero_shot),
]

# TODO: Uncomment if Model 4 added
# models.append(("Alternative Embedding", pred_alt))

individual_scores = []
for name, predictions in models:
    f1 = f1_score(test_subset["label"], predictions, average='weighted')
    acc = accuracy_score(test_subset["label"], predictions)
    individual_scores.append(f1)
    print(f"\n{name}:")
    print(f"  F1 Score:  {f1:.4f}")
    print(f"  Accuracy:  {acc:.4f}")

# Ensemble Methods
print("\n" + "="*80)
print("ENSEMBLE METHODS")
print("="*80)

# Method 1: Simple Majority Voting
ensemble_votes = np.array([pred_task, pred_embedding, pred_zero_shot])
# TODO: Add Model 4 if available
# ensemble_votes = np.array([pred_task, pred_embedding, pred_zero_shot, pred_alt])

pred_majority = np.apply_along_axis(lambda x: np.bincount(x).argmax(), 0, ensemble_votes)

maj_f1 = f1_score(test_subset["label"], pred_majority, average='weighted')
maj_acc = accuracy_score(test_subset["label"], pred_majority)

print(f"\n1. Simple Majority Voting:")
print(f"   F1 Score:  {maj_f1:.4f}")
print(f"   Accuracy:  {maj_acc:.4f}")

# Method 2: Confidence-Weighted Voting
weights = np.array([conf_task, conf_embedding, conf_zero_shot])
# TODO: Add Model 4 weights if available
# weights = np.array([conf_task, conf_embedding, conf_zero_shot, conf_alt])

weighted_votes = np.zeros((len(test_subset), 2))
for i in range(len(test_subset)):
    for model_idx in range(len(models)):
        vote = ensemble_votes[model_idx, i]
        weight = weights[model_idx, i]
        weighted_votes[i, vote] += weight

pred_weighted = np.argmax(weighted_votes, axis=1)

weight_f1 = f1_score(test_subset["label"], pred_weighted, average='weighted')
weight_acc = accuracy_score(test_subset["label"], pred_weighted)

print(f"\n2. Confidence-Weighted Voting:")
print(f"   F1 Score:  {weight_f1:.4f}")
print(f"   Accuracy:  {weight_acc:.4f}")

# TODO: Uncomment to implement Method 3: Performance-Weighted Voting
# Method 3: Weight models by their F1 scores
# print(f"\n3. Performance-Weighted Voting:")
# model_weights = np.array(individual_scores)  # Use F1 scores as weights
# model_weights = model_weights / model_weights.sum()  # Normalize
#
# perf_weighted_votes = np.zeros((len(test_subset), 2))
# for i in range(len(test_subset)):
#     for model_idx in range(len(models)):
#         vote = ensemble_votes[model_idx, i]
#         weight = model_weights[model_idx]
#         perf_weighted_votes[i, vote] += weight
#
# pred_perf_weighted = np.argmax(perf_weighted_votes, axis=1)
# perf_f1 = f1_score(test_subset["label"], pred_perf_weighted, average='weighted')
# perf_acc = accuracy_score(test_subset["label"], pred_perf_weighted)
# print(f"   F1 Score:  {perf_f1:.4f}")
# print(f"   Accuracy:  {perf_acc:.4f}")

# Comparison Table
print("\n" + "="*80)
print("PERFORMANCE COMPARISON")
print("="*80)

results = [
    ("Task-Specific (Model 1)", individual_scores[0]),
    ("Embedding (Model 2)", individual_scores[1]),
    ("Zero-Shot (Model 3)", individual_scores[2]),
    ("" * 30, None),
    ("Ensemble: Majority Vote", maj_f1),
    ("Ensemble: Confidence-Weighted", weight_f1),
]

# TODO: Add Model 4 and performance-weighted if implemented
# results.insert(3, ("Alternative Embedding (Model 4)", individual_scores[3]))
# results.append(("Ensemble: Performance-Weighted", perf_f1))

best_individual = max(individual_scores)

print(f"\n{'Method':<35s} {'F1 Score':<12s} {'vs Best Individual':<20s}")
print("-"*70)
for name, score in results:
    if score is None:
        print(name)
    else:
        diff = score - best_individual if score is not None else 0
        improvement = "" if diff > 0.001 else ""
        print(f"{name:<35s} {score:.4f}       {diff:+.4f}  {improvement}")

# Analyze disagreements
print("\n" + "="*80)
print("ANALYZING MODEL DISAGREEMENTS")
print("="*80)

disagreements = []
unanimous_correct = 0
unanimous_wrong = 0

for i in range(len(test_subset)):
    votes = ensemble_votes[:, i]
    unique_votes = len(set(votes))

    if unique_votes > 1:  # Disagreement
        disagreements.append({
            'index': i,
            'text': test_subset["text"][i],
            'true': test_subset["label"][i],
            'votes': votes,
            'ensemble': pred_majority[i],
            'models': [models[j][0] for j in range(len(models))]
        })
    else:  # Unanimous
        if votes[0] == test_subset["label"][i]:
            unanimous_correct += 1
        else:
            unanimous_wrong += 1

print(f"\nVoting Patterns:")
print(f"  Unanimous Correct: {unanimous_correct} ({100*unanimous_correct/len(test_subset):.1f}%)")
print(f"  Unanimous Wrong:   {unanimous_wrong} ({100*unanimous_wrong/len(test_subset):.1f}%)")
print(f"  Disagreements:     {len(disagreements)} ({100*len(disagreements)/len(test_subset):.1f}%)")

print(f"\n" + "-"*80)
print(f"Examples of Disagreements (first 5):")
print("-"*80)

for i, case in enumerate(disagreements[:5]):
    true_label = "Positive" if case['true'] == 1 else "Negative"
    ensemble_label = "Positive" if case['ensemble'] == 1 else "Negative"
    ensemble_correct = "" if case['ensemble'] == case['true'] else ""

    print(f"\n{i+1}. '{case['text'][:60]}...'")
    print(f"   True: {true_label}")

    for j, model_name in enumerate(case['models']):
        vote_label = "Positive" if case['votes'][j] == 1 else "Negative"
        vote_correct = "" if case['votes'][j] == case['true'] else ""
        print(f"   {model_name:20s}: {vote_label:8s} {vote_correct}")

    print(f"   Ensemble Decision:   {ensemble_label:8s} {ensemble_correct}")

BUILDING ENSEMBLE OF CLASSIFIERS

[1/3] Loading Task-Specific Model (Twitter RoBERTa)...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[2/3] Training Embedding Classifier...


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

[3/3] Setting up Zero-Shot Classifier...

✓ All models ready

GENERATING PREDICTIONS

Model 1: Task-Specific...
Model 2: Embedding Classifier...
Model 3: Zero-Shot...

INDIVIDUAL MODEL PERFORMANCE

Task-Specific:
  F1 Score:  0.7709
  Accuracy:  0.7733

Embedding + LR:
  F1 Score:  0.8699
  Accuracy:  0.8700

Zero-Shot:
  F1 Score:  0.8255
  Accuracy:  0.8267

ENSEMBLE METHODS

1. Simple Majority Voting:
   F1 Score:  0.8667
   Accuracy:  0.8667

2. Confidence-Weighted Voting:
   F1 Score:  0.8632
   Accuracy:  0.8633

PERFORMANCE COMPARISON

Method                              F1 Score     vs Best Individual  
----------------------------------------------------------------------
Task-Specific (Model 1)             0.7709       -0.0990  
Embedding (Model 2)                 0.8699       +0.0000  
Zero-Shot (Model 3)                 0.8255       -0.0444  
──────────────────────────────
Ensemble: Majority Vote             0.8667       -0.0032  
Ensemble: Confidence-Weighted       0.8632 

### Questions

1. Did the ensemble beat the best individual model?

2. When models disagree, which one is usually correct?

3. Compare majority voting vs confidence-weighted voting. Which performed better?


#### Hard Task 4: Cross-Domain Transfer Learning

Train on movie reviews, test on restaurant/product/book reviews. See which domains transfer well.


In [38]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np

# Source domain: Movie reviews
movie_data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

movie_train = movie_data["train"].shuffle(seed=42).select(range(2000))
movie_test = movie_data["test"].shuffle(seed=42).select(range(200))

# Target domains with labeled examples
restaurant_reviews = {
    'text': [
        "Amazing food and excellent service!",
        "Best restaurant in town, highly recommend",
        "Delicious meals and great atmosphere",
        "Outstanding cuisine and friendly staff",
        "Terrible food, very disappointing",
        "Awful service and poor quality",
        "Not worth the money, mediocre at best",
        "Disgusting food and rude waiters",
        "The pasta was okay but nothing special",
        "Decent place for a quick meal"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

product_reviews = {
    'text': [
        "This product is amazing! Works perfectly",
        "Excellent quality, very satisfied",
        "Great value for money, highly recommend",
        "Perfect! Exactly what I needed",
        "Terrible product, broke immediately",
        "Waste of money, very poor quality",
        "Doesn't work as advertised, disappointed",
        "Awful, don't buy this",
        "It's okay, does the job",
        "Average product, nothing special"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

book_reviews = {
    'text': [
        "Brilliant book! Couldn't put it down",
        "Masterfully written, highly engaging",
        "One of the best books I've read",
        "Fantastic story and great characters",
        "Boring and poorly written",
        "Terrible book, waste of time",
        "Disappointing, not worth reading",
        "Awful plot and weak characters",
        "Decent read but nothing groundbreaking",
        "It was fine, not great not terrible"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

# TODO: Add your own domain - try something different!
# YOUR_DOMAIN_reviews = {
#     'text': [
#         "Positive example 1",
#         "Positive example 2",
#         "Positive example 3",
#         "Positive example 4",
#         "Negative example 1",
#         "Negative example 2",
#         "Negative example 3",
#         "Negative example 4",
#         "Neutral example 1",
#         "Neutral example 2",
#     ],
#     'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
# }

print("="*80)
print("CROSS-DOMAIN TRANSFER LEARNING EXPERIMENT")
print("="*80)

# Train on source domain (movies)
print("\nTraining classifier on MOVIE REVIEWS (source domain)...")
train_embeddings = model.encode(movie_train["text"], show_progress_bar=True)
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, movie_train["label"])

# Test on source domain (baseline)
print("\n" + "-"*80)
print("BASELINE: Performance on Source Domain (Movies)")
print("-"*80)

movie_test_embeddings = model.encode(movie_test["text"], show_progress_bar=False)
movie_test_pred = clf.predict(movie_test_embeddings)
source_f1 = f1_score(movie_test["label"], movie_test_pred, average='weighted')

print(f"Source Domain F1: {source_f1:.4f}")
print("This is how well the classifier does on its training domain")

# Zero-shot transfer to target domains
print("\n" + "="*80)
print("ZERO-SHOT TRANSFER TO TARGET DOMAINS")
print("="*80)

target_domains = {
    "Restaurant Reviews": restaurant_reviews,
    "Product Reviews": product_reviews,
    "Book Reviews": book_reviews,
}

# TODO: Add your domain if created
# target_domains["YOUR DOMAIN"] = YOUR_DOMAIN_reviews

transfer_results = {}

for domain_name, domain_data in target_domains.items():
    print(f"\n{domain_name}:")
    print("-" * 60)

    # Test without adaptation
    domain_embeddings = model.encode(domain_data['text'])
    domain_pred = clf.predict(domain_embeddings)

    domain_f1 = f1_score(domain_data['label'], domain_pred, average='weighted')
    domain_acc = accuracy_score(domain_data['label'], domain_pred)

    print(f"F1 Score: {domain_f1:.4f}")
    print(f"Accuracy: {domain_acc:.4f}")
    print(f"Performance Drop: {source_f1 - domain_f1:.4f} ({(source_f1-domain_f1)/source_f1*100:.1f}%)")

    # Show some predictions
    print(f"\nExample predictions:")
    for i in range(3):
        true_label = "Positive" if domain_data['label'][i] == 1 else "Negative"
        pred_label = "Positive" if domain_pred[i] == 1 else "Negative"
        correct = "" if domain_pred[i] == domain_data['label'][i] else ""

        print(f"  '{domain_data['text'][i][:50]}...'")
        print(f"  True: {true_label} | Pred: {pred_label} {correct}")

    transfer_results[domain_name] = {
        'zero_shot_f1': domain_f1,
        'zero_shot_acc': domain_acc,
        'embeddings': domain_embeddings,
        'predictions': domain_pred
    }

# TODO: Uncomment to implement few-shot domain adaptation
print("\n" + "="*80)
print("FEW-SHOT DOMAIN ADAPTATION")
print("="*80)
print("Strategy: Add first 4 examples from each target domain to training set")

adaptation_size = 4

for domain_name, domain_data in target_domains.items():
    print(f"\n{domain_name}:")
    print("-" * 60)

    # Split domain data
    adapt_texts = domain_data['text'][:adaptation_size]
    adapt_labels = domain_data['label'][:adaptation_size]

    test_texts = domain_data['text'][adaptation_size:]
    test_labels = domain_data['label'][adaptation_size:]

    # Combine source + adaptation examples
    adapt_embeddings = model.encode(adapt_texts)
    combined_embeddings = np.vstack([train_embeddings, adapt_embeddings])
    combined_labels = list(movie_train["label"]) + adapt_labels

    # Retrain
    clf_adapted = LogisticRegression(random_state=42, max_iter=1000)
    clf_adapted.fit(combined_embeddings, combined_labels)

    # Test
    test_embeddings = model.encode(test_texts)
    adapted_pred = clf_adapted.predict(test_embeddings)
    adapted_f1 = f1_score(test_labels, adapted_pred, average='weighted')

    zero_shot_f1 = transfer_results[domain_name]['zero_shot_f1']
    improvement = adapted_f1 - zero_shot_f1

    print(f"Zero-shot F1:  {zero_shot_f1:.4f}")
    print(f"Adapted F1:    {adapted_f1:.4f}")
    print(f"Improvement:   {improvement:+.4f}")

    if improvement > 0.05:
        print(f" Significant improvement! Domain adaptation helped a lot")
    elif improvement > 0:
        print(f" Slight improvement from adaptation")
    else:
        print(f" No improvement or slight degradation")

    transfer_results[domain_name]['adapted_f1'] = adapted_f1
    transfer_results[domain_name]['improvement'] = improvement

# Analyze domain similarity
print("\n" + "="*80)
print("DOMAIN SIMILARITY ANALYSIS")
print("="*80)

# Calculate domain centroids (average embedding)
source_centroid = np.mean(train_embeddings, axis=0)

print("\nDomain distances from source (movie reviews):")
print("-"*60)

domain_distances = []
for domain_name, domain_data in target_domains.items():
    domain_embeddings = transfer_results[domain_name]['embeddings']
    domain_centroid = np.mean(domain_embeddings, axis=0)

    distance = np.linalg.norm(source_centroid - domain_centroid)
    zero_shot_f1 = transfer_results[domain_name]['zero_shot_f1']
    drop = source_f1 - zero_shot_f1

    domain_distances.append((domain_name, distance, drop))

    print(f"\n{domain_name}:")
    print(f"  Embedding distance: {distance:.4f}")
    print(f"  Performance drop:   {drop:.4f}")
    print(f"  Zero-shot F1:       {zero_shot_f1:.4f}")

    if 'improvement' in transfer_results[domain_name]:
        improvement = transfer_results[domain_name]['improvement']
        print(f"  Adaptation gain:    {improvement:+.4f}")

# Correlation analysis
print("\n" + "-"*80)
print("Correlation: Distance vs Performance")
print("-"*80)

domain_distances.sort(key=lambda x: x[1])
print("\nRanked by distance to source:")
for name, dist, drop in domain_distances:
    print(f"  {name:20s}: distance={dist:.3f}, drop={drop:.3f}")

print("\nObservation:")
print("   Domains closer to movies in embedding space tend to transfer better")
print("   Larger embedding distance correlates with larger performance drop")

# Summary table
print("\n" + "="*80)
print("TRANSFER LEARNING SUMMARY")
print("="*80)

print(f"\n{'Domain':<20s} {'Zero-Shot F1':<15s} {'Adapted F1':<15s} {'Improvement':<12s}")
print("-"*65)
for domain_name in target_domains.keys():
    zero_f1 = transfer_results[domain_name]['zero_shot_f1']
    adapted_f1 = transfer_results[domain_name].get('adapted_f1', 0)
    improvement = transfer_results[domain_name].get('improvement', 0)

    marker = "" if improvement > 0.05 else ""
    print(f"{domain_name:<20s} {zero_f1:.4f}          {adapted_f1:.4f}          {improvement:+.4f}     {marker}")

print(f"\nSource (Movies):      {source_f1:.4f}          N/A             N/A")

CROSS-DOMAIN TRANSFER LEARNING EXPERIMENT

Training classifier on MOVIE REVIEWS (source domain)...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]


--------------------------------------------------------------------------------
BASELINE: Performance on Source Domain (Movies)
--------------------------------------------------------------------------------
Source Domain F1: 0.8497
This is how well the classifier does on its training domain

ZERO-SHOT TRANSFER TO TARGET DOMAINS

Restaurant Reviews:
------------------------------------------------------------
F1 Score: 0.9010
Accuracy: 0.9000
Performance Drop: -0.0513 (-6.0%)

Example predictions:
  'Amazing food and excellent service!...'
  True: Positive | Pred: Positive ✓
  'Best restaurant in town, highly recommend...'
  True: Positive | Pred: Positive ✓
  'Delicious meals and great atmosphere...'
  True: Positive | Pred: Positive ✓

Product Reviews:
------------------------------------------------------------
F1 Score: 1.0000
Accuracy: 1.0000
Performance Drop: -0.1503 (-17.7%)

Example predictions:
  'This product is amazing! Works perfectly...'
  True: Positive | Pred: Positiv

### Questions

1. Which domain transferred best from movies? Which worst?

2. Do you see patterns in what transfers well vs what fails?

3. After few-shot adaptation: Which domains benefited most from just 4 examples?
