# Chapter 4: Text Classification - Hard Tasks

This notebook tackles advanced classification challenges. You'll implement hierarchical multi-level classifiers for complex taxonomies, use active learning to minimize labeling costs, build ensemble classifiers to improve robustness, and apply transfer learning across domains. These techniques are essential for production-level NLP systems.


---

## Setup

Run all cells in this section to set up the environment and load necessary data.


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


In [2]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Helper Functions


In [6]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

This section is divided into EASY, MEDIUM, & HARD.

---

## Hard Tasks


### Hard Tasks - Advanced Classification Challenges

These tasks require significant modifications and deeper understanding. Take your time and experiment

#### Hard Task 1: Hierarchical Multi-Level Classifier

Run the 2-level classifier first. Then try adding a 3rd level.


In [None]:
#### Hard Task 1: Hierarchical Multi-Level Classifier

Instead of flat classification (choosing from all categories at once), hierarchical classification makes decisions in steps: first broad categories, then fine-grained ones. This mirrors how humans often reason.

**What to do:**
1. Run the 2-level classifier (Sentiment → Specific Aspect)
2. Compare with flat classification to see confidence differences
3. Try adding a 3rd level by uncommenting the code
4. Analyze whether breaking decisions into steps helps or hurts

In [None]:
# Level 1: Broad sentiment
level1_labels = [
    "negative sentiment review",
    "positive sentiment review"
]

# Level 2: Specific aspects (conditional on Level 1)
level2_negative = [
    "review criticizing entertainment value and pacing",
    "review criticizing technical quality and production"
]

level2_positive = [
    "review praising technical quality and artistry",
    "review praising entertainment value and enjoyment"
]

# TODO: Add Level 3 for even finer granularity
# level3_positive_quality = [...]
# level3_positive_entertainment = [...]

Classification function:


In [None]:
#### Hard Task 2: Active Learning to Minimize Labeling Costs

Labeling data is expensive. Active learning strategically selects the most informative samples to label, potentially saving 50%+ of labeling effort compared to random selection. Run the simulation to see active learning vs random sampling, observe which samples the model finds "uncertain", and track how many samples each approach needs to reach F1=0.85. Try implementing the alternative selection strategy (margin sampling) and compare labeling cost savings.

Classify the reviews:


In [None]:
print("="*80)
print("HIERARCHICAL CLASSIFICATION (2 LEVELS)")
print("="*80)

for i, review in enumerate(test_reviews):
    result = hierarchical_classify_2level(review)

    print(f"\nReview {i+1}: '{review}'")
    print(f"\n  Level 1 (Sentiment):")
    print(f"     {result['level1_label']}")
    print(f"     Confidence: {result['level1_conf']:.3f}")

    print(f"\n  Level 2 (Specific Aspect):")
    print(f"     {result['level2_label']}")
    print(f"     Confidence: {result['level2_conf']:.3f}")

    print(f"\n  Final Classification Path:")
    print(f"     {result['path']}")
    print("-"*80)

# Compare with flat classification
print("\n" + "="*80)
print("COMPARISON: Hierarchical vs Flat Classification")
print("="*80)

# Flat: All 4 categories at once
flat_labels = [
    "review criticizing entertainment value and pacing",      # 0
    "review criticizing technical quality and production",    # 1
    "review praising technical quality and artistry",         # 2
    "review praising entertainment value and enjoyment"       # 3
]

flat_embeddings = model.encode(flat_labels)
review_embeddings = model.encode(test_reviews)
flat_sim = cosine_similarity(review_embeddings, flat_embeddings)

print("\nShowing first 3 reviews:")
for i in range(min(3, len(test_reviews))):
    hier_result = hierarchical_classify_2level(test_reviews[i])
    flat_pred = np.argmax(flat_sim[i])
    flat_conf = flat_sim[i][flat_pred]

    print(f"\nReview: '{test_reviews[i][:50]}...'")
    print(f"  Hierarchical: {hier_result['level2_label']}")
    print(f"     Confidence: {hier_result['level2_conf']:.3f}")
    print(f"  Flat:         {flat_labels[flat_pred]}")
    print(f"     Confidence: {flat_conf:.3f}")
    print(f"  Confidence Diff: {hier_result['level2_conf'] - flat_conf:+.3f}")

# TODO: After implementing 3-level, uncomment to test it:
# print("\n" + "="*80)
# print("TESTING 3-LEVEL HIERARCHICAL CLASSIFICATION")
# print("="*80)
#
# for i, review in enumerate(test_reviews):
#     result = hierarchical_classify_3level(review)
#     print(f"\n{i+1}. '{review[:60]}...'")
#     print(f"   Path: {result['path']}")
#     print(f"   Level 3 confidence: {result['level3_conf']:.3f}")

**As you can see,** the hierarchical approach makes decisions in stages:
1. **Level 1:** Determines if the review is positive or negative
2. **Level 2:** Based on that sentiment, classifies the specific aspect

Notice the **confidence scores** at each level. The hierarchical classifier's final confidence is often higher than flat classification because it's making simpler decisions at each step.

**Looking at the comparison** between hierarchical and flat approaches, you may notice:
- Hierarchical often has higher confidence (makes easier per-step decisions)
- But if Level 1 is wrong, Level 2 has no chance to correct it
- Flat classification considers all options at once but may be less confident

As you can see, the hierarchical approach makes decisions in stages. The Level 1 determines if the review is positive or negative, then Level 2 classifies the specific aspect. The final confidence is often higher than flat classification because it's making simpler decisions at each step, but if Level 1 is wrong, Level 2 has no chance to correct it.

#### Hard Task 2: Active Learning to Minimize Labeling

Compare active learning (picking uncertain samples) vs random selection.


**As you can see,** active learning iteratively:
1. Trains on currently labeled data
2. Finds the most uncertain unlabeled samples  
3. "Labels" those samples (adds them to training set)
4. Repeats

Notice the **uncertain samples** being selected - they typically have prediction probabilities close to 50-50. These are the most informative because they lie near the decision boundary.

**Looking at the learning curves,** compare how quickly each approach improves. Active learning often reaches target performance with fewer labeled samples, saving significant labeling costs.

In [None]:
The results show active learning iteratively trains on labeled data, finds uncertain samples, labels them, and repeats. The uncertain samples typically have prediction probabilities close to 50-50 - these are the most informative because they lie near the decision boundary. Compare the learning curves to see how quickly each approach improves.

#### Hard Task 3: Ensemble Classifier for Improved Robustness

**Ensemble methods** combine multiple models to reduce individual biases and improve reliability. The wisdom of crowds principle: multiple imperfect models together often beat any single model.

**What to do:**
1. Run to see 3 individual models compared to ensemble methods
2. Compare simple majority voting vs confidence-weighted voting
3. Examine disagreement cases - when models disagree, which is usually right?
4. Optionally add a 4th model and performance-weighted voting
5. Determine if the ensemble beats the best individual model

#### Hard Task 3: Ensemble Classifier for Improved Robustness

Ensemble methods combine multiple models to reduce individual biases and improve reliability. The wisdom of crowds principle: multiple imperfect models together often beat any single model.

Try this:
- Run to see 3 individual models compared to ensemble methods
- Compare simple majority voting vs confidence-weighted voting
- Examine disagreement cases - when models disagree, which is usually right?
- Optionally add a 4th model and performance-weighted voting

#### Hard Task 3: Ensemble Classifier

Compare 3 different models individually vs combined as an ensemble.


In [None]:
Looking at the voting patterns, most samples have unanimous agreement (all models concur) - these are easy cases. The interesting cases are disagreements where models learned different patterns. Notice the ensemble performance: sometimes it beats all individual models by leveraging their diverse strengths.

**As you can see,** the classifier trained on movies transfers to other domains with varying success:

**Zero-shot transfer results** show how well the model generalizes without any target domain examples. Domains closer to movies (like books) often transfer better than very different domains.

**Few-shot adaptation** demonstrates the power of adding just a handful of target domain examples. Even 4 labeled samples can significantly improve performance.

**Looking at the domain similarity analysis,** embedding distance correlates with transfer performance. Domains with smaller embedding distance from movies tend to have better zero-shot performance. This helps you predict which domains will transfer well before running experiments.

#### Hard Task 4: Cross-Domain Transfer Learning

Can a classifier trained on movie reviews work on restaurant, product, or book reviews? Observe zero-shot transfer (no adaptation) performance on each domain, see which domains transfer well and which don't, then try few-shot adaptation (adding just 4 examples from target domain). Analyze domain similarity using embedding distances and optionally add your own custom domain to test.

In [38]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np

# Source domain: Movie reviews
movie_data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

movie_train = movie_data["train"].shuffle(seed=42).select(range(2000))
movie_test = movie_data["test"].shuffle(seed=42).select(range(200))

# Target domains with labeled examples
restaurant_reviews = {
    'text': [
        "Amazing food and excellent service!",
        "Best restaurant in town, highly recommend",
        "Delicious meals and great atmosphere",
        "Outstanding cuisine and friendly staff",
        "Terrible food, very disappointing",
        "Awful service and poor quality",
        "Not worth the money, mediocre at best",
        "Disgusting food and rude waiters",
        "The pasta was okay but nothing special",
        "Decent place for a quick meal"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

product_reviews = {
    'text': [
        "This product is amazing! Works perfectly",
        "Excellent quality, very satisfied",
        "Great value for money, highly recommend",
        "Perfect! Exactly what I needed",
        "Terrible product, broke immediately",
        "Waste of money, very poor quality",
        "Doesn't work as advertised, disappointed",
        "Awful, don't buy this",
        "It's okay, does the job",
        "Average product, nothing special"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

book_reviews = {
    'text': [
        "Brilliant book! Couldn't put it down",
        "Masterfully written, highly engaging",
        "One of the best books I've read",
        "Fantastic story and great characters",
        "Boring and poorly written",
        "Terrible book, waste of time",
        "Disappointing, not worth reading",
        "Awful plot and weak characters",
        "Decent read but nothing groundbreaking",
        "It was fine, not great not terrible"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

# TODO: Add your own domain - try something different!
# YOUR_DOMAIN_reviews = {
#     'text': [
#         "Positive example 1",
#         "Positive example 2",
#         "Positive example 3",
#         "Positive example 4",
#         "Negative example 1",
#         "Negative example 2",
#         "Negative example 3",
#         "Negative example 4",
#         "Neutral example 1",
#         "Neutral example 2",
#     ],
#     'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
# }

print("="*80)
print("CROSS-DOMAIN TRANSFER LEARNING EXPERIMENT")
print("="*80)

# Train on source domain (movies)
print("\nTraining classifier on MOVIE REVIEWS (source domain)...")
train_embeddings = model.encode(movie_train["text"], show_progress_bar=True)
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, movie_train["label"])

# Test on source domain (baseline)
print("\n" + "-"*80)
print("BASELINE: Performance on Source Domain (Movies)")
print("-"*80)

movie_test_embeddings = model.encode(movie_test["text"], show_progress_bar=False)
movie_test_pred = clf.predict(movie_test_embeddings)
source_f1 = f1_score(movie_test["label"], movie_test_pred, average='weighted')

print(f"Source Domain F1: {source_f1:.4f}")
print("This is how well the classifier does on its training domain")

# Zero-shot transfer to target domains
print("\n" + "="*80)
print("ZERO-SHOT TRANSFER TO TARGET DOMAINS")
print("="*80)

target_domains = {
    "Restaurant Reviews": restaurant_reviews,
    "Product Reviews": product_reviews,
    "Book Reviews": book_reviews,
}

# TODO: Add your domain if created
# target_domains["YOUR DOMAIN"] = YOUR_DOMAIN_reviews

transfer_results = {}

for domain_name, domain_data in target_domains.items():
    print(f"\n{domain_name}:")
    print("-" * 60)

    # Test without adaptation
    domain_embeddings = model.encode(domain_data['text'])
    domain_pred = clf.predict(domain_embeddings)

    domain_f1 = f1_score(domain_data['label'], domain_pred, average='weighted')
    domain_acc = accuracy_score(domain_data['label'], domain_pred)

    print(f"F1 Score: {domain_f1:.4f}")
    print(f"Accuracy: {domain_acc:.4f}")
    print(f"Performance Drop: {source_f1 - domain_f1:.4f} ({(source_f1-domain_f1)/source_f1*100:.1f}%)")

    # Show some predictions
    print(f"\nExample predictions:")
    for i in range(3):
        true_label = "Positive" if domain_data['label'][i] == 1 else "Negative"
        pred_label = "Positive" if domain_pred[i] == 1 else "Negative"
        correct = "" if domain_pred[i] == domain_data['label'][i] else ""

        print(f"  '{domain_data['text'][i][:50]}...'")
        print(f"  True: {true_label} | Pred: {pred_label} {correct}")

    transfer_results[domain_name] = {
        'zero_shot_f1': domain_f1,
        'zero_shot_acc': domain_acc,
        'embeddings': domain_embeddings,
        'predictions': domain_pred
    }

# TODO: Uncomment to implement few-shot domain adaptation
print("\n" + "="*80)
print("FEW-SHOT DOMAIN ADAPTATION")
print("="*80)
print("Strategy: Add first 4 examples from each target domain to training set")

adaptation_size = 4

for domain_name, domain_data in target_domains.items():
    print(f"\n{domain_name}:")
    print("-" * 60)

    # Split domain data
    adapt_texts = domain_data['text'][:adaptation_size]
    adapt_labels = domain_data['label'][:adaptation_size]

    test_texts = domain_data['text'][adaptation_size:]
    test_labels = domain_data['label'][adaptation_size:]

    # Combine source + adaptation examples
    adapt_embeddings = model.encode(adapt_texts)
    combined_embeddings = np.vstack([train_embeddings, adapt_embeddings])
    combined_labels = list(movie_train["label"]) + adapt_labels

    # Retrain
    clf_adapted = LogisticRegression(random_state=42, max_iter=1000)
    clf_adapted.fit(combined_embeddings, combined_labels)

    # Test
    test_embeddings = model.encode(test_texts)
    adapted_pred = clf_adapted.predict(test_embeddings)
    adapted_f1 = f1_score(test_labels, adapted_pred, average='weighted')

    zero_shot_f1 = transfer_results[domain_name]['zero_shot_f1']
    improvement = adapted_f1 - zero_shot_f1

    print(f"Zero-shot F1:  {zero_shot_f1:.4f}")
    print(f"Adapted F1:    {adapted_f1:.4f}")
    print(f"Improvement:   {improvement:+.4f}")

    if improvement > 0.05:
        print(f" Significant improvement! Domain adaptation helped a lot")
    elif improvement > 0:
        print(f" Slight improvement from adaptation")
    else:
        print(f" No improvement or slight degradation")

    transfer_results[domain_name]['adapted_f1'] = adapted_f1
    transfer_results[domain_name]['improvement'] = improvement

# Analyze domain similarity
print("\n" + "="*80)
print("DOMAIN SIMILARITY ANALYSIS")
print("="*80)

# Calculate domain centroids (average embedding)
source_centroid = np.mean(train_embeddings, axis=0)

print("\nDomain distances from source (movie reviews):")
print("-"*60)

domain_distances = []
for domain_name, domain_data in target_domains.items():
    domain_embeddings = transfer_results[domain_name]['embeddings']
    domain_centroid = np.mean(domain_embeddings, axis=0)

    distance = np.linalg.norm(source_centroid - domain_centroid)
    zero_shot_f1 = transfer_results[domain_name]['zero_shot_f1']
    drop = source_f1 - zero_shot_f1

    domain_distances.append((domain_name, distance, drop))

    print(f"\n{domain_name}:")
    print(f"  Embedding distance: {distance:.4f}")
    print(f"  Performance drop:   {drop:.4f}")
    print(f"  Zero-shot F1:       {zero_shot_f1:.4f}")

    if 'improvement' in transfer_results[domain_name]:
        improvement = transfer_results[domain_name]['improvement']
        print(f"  Adaptation gain:    {improvement:+.4f}")

# Correlation analysis
print("\n" + "-"*80)
print("Correlation: Distance vs Performance")
print("-"*80)

domain_distances.sort(key=lambda x: x[1])
print("\nRanked by distance to source:")
for name, dist, drop in domain_distances:
    print(f"  {name:20s}: distance={dist:.3f}, drop={drop:.3f}")

print("\nObservation:")
print("   Domains closer to movies in embedding space tend to transfer better")
print("   Larger embedding distance correlates with larger performance drop")

# Summary table
print("\n" + "="*80)
print("TRANSFER LEARNING SUMMARY")
print("="*80)

print(f"\n{'Domain':<20s} {'Zero-Shot F1':<15s} {'Adapted F1':<15s} {'Improvement':<12s}")
print("-"*65)
for domain_name in target_domains.keys():
    zero_f1 = transfer_results[domain_name]['zero_shot_f1']
    adapted_f1 = transfer_results[domain_name].get('adapted_f1', 0)
    improvement = transfer_results[domain_name].get('improvement', 0)

    marker = "" if improvement > 0.05 else ""
    print(f"{domain_name:<20s} {zero_f1:.4f}          {adapted_f1:.4f}          {improvement:+.4f}     {marker}")

print(f"\nSource (Movies):      {source_f1:.4f}          N/A             N/A")

CROSS-DOMAIN TRANSFER LEARNING EXPERIMENT

Training classifier on MOVIE REVIEWS (source domain)...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]


--------------------------------------------------------------------------------
BASELINE: Performance on Source Domain (Movies)
--------------------------------------------------------------------------------
Source Domain F1: 0.8497
This is how well the classifier does on its training domain

ZERO-SHOT TRANSFER TO TARGET DOMAINS

Restaurant Reviews:
------------------------------------------------------------
F1 Score: 0.9010
Accuracy: 0.9000
Performance Drop: -0.0513 (-6.0%)

Example predictions:
  'Amazing food and excellent service!...'
  True: Positive | Pred: Positive ✓
  'Best restaurant in town, highly recommend...'
  True: Positive | Pred: Positive ✓
  'Delicious meals and great atmosphere...'
  True: Positive | Pred: Positive ✓

Product Reviews:
------------------------------------------------------------
F1 Score: 1.0000
Accuracy: 1.0000
Performance Drop: -0.1503 (-17.7%)

Example predictions:
  'This product is amazing! Works perfectly...'
  True: Positive | Pred: Positiv

### Questions

1. Which domain transferred best from movies? Which worst?

2. Do you see patterns in what transfers well vs what fails?

3. After few-shot adaptation: Which domains benefited most from just 4 examples?
