# Chapter 4: Text Classification - Hard Tasks

This notebook tackles advanced classification challenges. You'll implement hierarchical multi-level classifiers for complex taxonomies, use active learning to minimize labeling costs, build ensemble classifiers to improve robustness, and apply transfer learning across domains. These techniques are essential for production-level NLP systems.


---

## Setup

Run all cells in this section to set up the environment and load necessary data.


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


In [2]:
from datasets import load_dataset
# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Helper Functions


In [3]:
from sklearn.metrics import classification_report
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

This section is divided into EASY, MEDIUM, & HARD.

---

## Hard Tasks


### Hard Tasks - Advanced Classification Challenges

These tasks require significant modifications and deeper understanding. Take your time and experiment

#### Hard Task 1: Hierarchical Multi-Level Classifier
Instead of flat classification (choosing from all categories at once), hierarchical classification makes decisions in steps: first broad categories, then fine-grained ones. This mirrors how humans often reason.
Instructions:
1. Run the 2-level classifier (Sentiment → Specific Aspect)
2. Compare with flat classification to see confidence differences
3. Try adding a 3rd level by uncommenting the code
4. Analyze whether breaking decisions into steps helps or hurts

In [4]:
# Level 1: Broad sentiment
level1_labels = [
    "negative sentiment review",
    "positive sentiment review"
]
# Level 2: Specific aspects (conditional on Level 1)
level2_negative = [
    "review criticizing entertainment value and pacing",
    "review criticizing technical quality and production"
]
level2_positive = [
    "review praising technical quality and artistry",
    "review praising entertainment value and enjoyment"
]
# TODO: Add Level 3 for even finer granularity
# level3_positive_quality = [...]
# level3_positive_entertainment = [...]

In [5]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

test_reviews = [
    "Amazing cinematography and brilliant direction!",
    "Terrible pacing, very boring throughout",
    "Excellent acting but weak storyline",
    "Poor production quality, disappointing visuals"
]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
def hierarchical_classify_2level(review):
    """
    Classify review using 2-level hierarchy:
    Level 1: Sentiment (positive/negative)
    Level 2: Specific aspect (quality/entertainment)
    """
    # Level 1: Determine sentiment
    level1_emb = model.encode(level1_labels)
    review_emb = model.encode([review])
    level1_sim = cosine_similarity(review_emb, level1_emb)[0]

    level1_pred = np.argmax(level1_sim)
    level1_conf = level1_sim[level1_pred]
    level1_label = level1_labels[level1_pred]

    # Level 2: Conditional on Level 1
    if level1_pred == 0:  # Negative
        level2_labels = level2_negative
        path = "Negative → "
    else:  # Positive
        level2_labels = level2_positive
        path = "Positive → "

    level2_emb = model.encode(level2_labels)
    level2_sim = cosine_similarity(review_emb, level2_emb)[0]

    level2_pred = np.argmax(level2_sim)
    level2_conf = level2_sim[level2_pred]
    level2_label = level2_labels[level2_pred]
    path += level2_label

    return {
        "level1_label": level1_label,
        "level1_conf": level1_conf,
        "level2_label": level2_label,
        "level2_conf": level2_conf,
        "path": path
    }

Classify the reviews:


In [7]:
print("HIERARCHICAL CLASSIFICATION (2 LEVELS)")
for i, review in enumerate(test_reviews):
    result = hierarchical_classify_2level(review)
    print(f"\nReview {i+1}: '{review}'")
    print(f"\n  Level 1 (Sentiment):")
    print(f"     {result['level1_label']}")
    print(f"     Confidence: {result['level1_conf']:.3f}")
    print(f"\n  Level 2 (Specific Aspect):")
    print(f"     {result['level2_label']}")
    print(f"     Confidence: {result['level2_conf']:.3f}")
    print(f"\n  Final Classification Path:")
    print(f"     {result['path']}")
# Compare with flat classification
print()
print("COMPARISON: Hierarchical vs Flat Classification")
# Flat: All 4 categories at once
flat_labels = [
    "review criticizing entertainment value and pacing",      # 0
    "review criticizing technical quality and production",    # 1
    "review praising technical quality and artistry",         # 2
    "review praising entertainment value and enjoyment"       # 3
]
flat_embeddings = model.encode(flat_labels)
review_embeddings = model.encode(test_reviews)
flat_sim = cosine_similarity(review_embeddings, flat_embeddings)
print("\nShowing first 3 reviews:")
for i in range(min(3, len(test_reviews))):
    hier_result = hierarchical_classify_2level(test_reviews[i])
    flat_pred = np.argmax(flat_sim[i])
    flat_conf = flat_sim[i][flat_pred]
    print(f"\nReview: '{test_reviews[i][:50]}...'")
    print(f"  Hierarchical: {hier_result['level2_label']}")
    print(f"     Confidence: {hier_result['level2_conf']:.3f}")
    print(f"  Flat:         {flat_labels[flat_pred]}")
    print(f"     Confidence: {flat_conf:.3f}")
    print(f"  Confidence Diff: {hier_result['level2_conf'] - flat_conf:+.3f}")
# TODO: After implementing 3-level, uncomment to test it:
# print()
# print("TESTING 3-LEVEL HIERARCHICAL CLASSIFICATION")
# for i, review in enumerate(test_reviews):
#     result = hierarchical_classify_3level(review)
#     print(f"\n{i+1}. '{review[:60]}...'")
#     print(f"   Path: {result['path']}")
#     print(f"   Level 3 confidence: {result['level3_conf']:.3f}")

HIERARCHICAL CLASSIFICATION (2 LEVELS)

Review 1: 'Amazing cinematography and brilliant direction!'

  Level 1 (Sentiment):
     positive sentiment review
     Confidence: 0.148

  Level 2 (Specific Aspect):
     review praising technical quality and artistry
     Confidence: 0.306

  Final Classification Path:
     Positive → review praising technical quality and artistry

Review 2: 'Terrible pacing, very boring throughout'

  Level 1 (Sentiment):
     negative sentiment review
     Confidence: 0.295

  Level 2 (Specific Aspect):
     review criticizing entertainment value and pacing
     Confidence: 0.651

  Final Classification Path:
     Negative → review criticizing entertainment value and pacing

Review 3: 'Excellent acting but weak storyline'

  Level 1 (Sentiment):
     negative sentiment review
     Confidence: 0.275

  Level 2 (Specific Aspect):
     review criticizing entertainment value and pacing
     Confidence: 0.507

  Final Classification Path:
     Negative → review c

The hierarchical approach makes decisions in stages:
1. Level 1: Determines if the review is positive or negative
2. Level 2: Based on that sentiment, classifies the specific aspect

Notice the confidence scores at each level. The hierarchical classifier's final confidence is often higher than flat classification because it's making simpler decisions at each step.

Looking at the comparison between hierarchical and flat approaches:
- Hierarchical often has higher confidence (makes easier per-step decisions)
- But if Level 1 is wrong, Level 2 has no chance to correct it
- Flat classification considers all options at once but may be less confident

#### Hard Task 2: Active Learning to Minimize Labeling Costs
Labeling data is expensive. Active learning strategically selects the most informative samples to label, potentially saving 50%+ of labeling effort compared to random selection. Run the simulation to see active learning vs random sampling, observe which samples the model finds "uncertain", and track how many samples each approach needs to reach F1=0.85. Try implementing the alternative selection strategy (margin sampling) and compare labeling cost savings.

In [8]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
import numpy as np

# Simulate active learning vs random sampling
print("ACTIVE LEARNING SIMULATION")
print("="*60)

# Setup
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
train_data = data["train"].shuffle(seed=42).select(range(1000))
test_data = data["test"].shuffle(seed=42).select(range(200))

# Start with small labeled set
labeled_size = 50
target_f1 = 0.85

print(f"Starting with {labeled_size} labeled samples")
print(f"Target F1: {target_f1}")
print(f"\nNote: This is a simplified simulation")
print("In real active learning, you'd query an oracle (human) for labels\n")

ACTIVE LEARNING SIMULATION
Starting with 50 labeled samples
Target F1: 0.85

Note: This is a simplified simulation
In real active learning, you'd query an oracle (human) for labels



Active learning iteratively:
1. Trains on currently labeled data
2. Finds the most uncertain unlabeled samples
3. "Labels" those samples (adds them to training set)
4. Repeats

The uncertain samples being selected typically have prediction probabilities close to 50-50. These are the most informative because they lie near the decision boundary.

Compare the learning curves to see how quickly each approach improves. Active learning often reaches target performance with fewer labeled samples, saving significant labeling costs.

#### Hard Task 3: Ensemble Classifier for Improved Robustness

Ensemble methods combine multiple models to reduce individual biases and improve reliability. The wisdom of crowds principle: multiple imperfect models together often beat any single model.

Instructions:
1. Run to see 3 individual models compared to ensemble methods
2. Compare simple majority voting vs confidence-weighted voting
3. Examine disagreement cases - when models disagree, which is usually right?
4. Optionally add a 4th model and performance-weighted voting
5. Determine if the ensemble beats the best individual model

In [9]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
import numpy as np

print("ENSEMBLE CLASSIFIER COMPARISON")
print("="*60)

# Load 3 different models
model1 = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
model2 = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model3 = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L3-v2')

models = [model1, model2, model3]
model_names = ["MPNet", "MiniLM-L6", "MiniLM-L3"]

print(f"Using {len(models)} models:\n")
for name in model_names:
    print(f"  - {name}")

# Train each model
train_subset = data["train"].shuffle(seed=42).select(range(1000))
test_subset = data["test"].shuffle(seed=42).select(range(200))

print(f"\nTraining on {len(train_subset)} samples...")
print("Note: In practice, ensemble diversity comes from:")
print("  - Different model architectures")
print("  - Different training data subsets")
print("  - Different hyperparameters\n")

ENSEMBLE CLASSIFIER COMPARISON


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/69.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Using 3 models:

  - MPNet
  - MiniLM-L6
  - MiniLM-L3

Training on 1000 samples...
Note: In practice, ensemble diversity comes from:
  - Different model architectures
  - Different training data subsets
  - Different hyperparameters



#### Hard Task 4: Cross-Domain Transfer Learning

Can a classifier trained on movie reviews work on restaurant, product, or book reviews? Observe zero-shot transfer (no adaptation) performance on each domain, see which domains transfer well and which don't, then try few-shot adaptation (adding just 4 examples from target domain). Analyze domain similarity using embedding distances and optionally add your own custom domain to test.

Load dependencies and set up source/target domains:

In [10]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
import numpy as np

Define source domain (movies) and target domains:

In [11]:
movie_data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

movie_train = movie_data["train"].shuffle(seed=42).select(range(2000))
movie_test = movie_data["test"].shuffle(seed=42).select(range(200))

In [12]:
restaurant_reviews = {
    'text': [
        "Amazing food and excellent service!",
        "Best restaurant in town, highly recommend",
        "Delicious meals and great atmosphere",
        "Outstanding cuisine and friendly staff",
        "Terrible food, very disappointing",
        "Awful service and poor quality",
        "Not worth the money, mediocre at best",
        "Disgusting food and rude waiters",
        "The pasta was okay but nothing special",
        "Decent place for a quick meal"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

product_reviews = {
    'text': [
        "This product is amazing! Works perfectly",
        "Excellent quality, very satisfied",
        "Great value for money, highly recommend",
        "Perfect! Exactly what I needed",
        "Terrible product, broke immediately",
        "Waste of money, very poor quality",
        "Doesn't work as advertised, disappointed",
        "Awful, don't buy this",
        "It's okay, does the job",
        "Average product, nothing special"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

book_reviews = {
    'text': [
        "Brilliant book! Couldn't put it down",
        "Masterfully written, highly engaging",
        "One of the best books I've read",
        "Fantastic story and great characters",
        "Boring and poorly written",
        "Terrible book, waste of time",
        "Disappointing, not worth reading",
        "Awful plot and weak characters",
        "Decent read but nothing groundbreaking",
        "It was fine, not great not terrible"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

Train classifier on source domain (movies):

In [13]:
print("Training on movie reviews...")
train_embeddings = model.encode(movie_train["text"], show_progress_bar=True)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, movie_train["label"])

# Test on source domain
movie_test_embeddings = model.encode(movie_test["text"], show_progress_bar=False)
movie_test_pred = clf.predict(movie_test_embeddings)
source_f1 = f1_score(movie_test["label"], movie_test_pred, average='weighted')

print(f"\nSource domain F1: {source_f1:.4f}")

Training on movie reviews...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]


Source domain F1: 0.8497


Test zero-shot transfer to target domains (no adaptation):

In [14]:
target_domains = {
    "Restaurant": restaurant_reviews,
    "Product": product_reviews,
    "Book": book_reviews,
}

transfer_results = {}

print("\nZero-shot transfer results:")
for domain_name, domain_data in target_domains.items():
    domain_embeddings = model.encode(domain_data['text'])
    domain_pred = clf.predict(domain_embeddings)
    domain_f1 = f1_score(domain_data['label'], domain_pred, average='weighted')
    domain_acc = accuracy_score(domain_data['label'], domain_pred)

    print(f"\n{domain_name}: F1={domain_f1:.4f}, Acc={domain_acc:.4f}")
    print(f"  Drop from source: {source_f1 - domain_f1:.4f}")

    transfer_results[domain_name] = {
        'zero_shot_f1': domain_f1,
        'embeddings': domain_embeddings
    }


Zero-shot transfer results:

Restaurant: F1=0.9010, Acc=0.9000
  Drop from source: -0.0513

Product: F1=1.0000, Acc=1.0000
  Drop from source: -0.1503

Book: F1=0.9010, Acc=0.9000
  Drop from source: -0.0513


Few-shot adaptation (add 4 examples from target domain):

In [15]:
adaptation_size = 4

print("\nFew-shot adaptation (4 examples):")
for domain_name, domain_data in target_domains.items():
    # Split: first 4 for adaptation, rest for testing
    adapt_texts = domain_data['text'][:adaptation_size]
    adapt_labels = domain_data['label'][:adaptation_size]
    test_texts = domain_data['text'][adaptation_size:]
    test_labels = domain_data['label'][adaptation_size:]

    # Combine source + adaptation
    adapt_embeddings = model.encode(adapt_texts)
    combined_embeddings = np.vstack([train_embeddings, adapt_embeddings])
    combined_labels = list(movie_train["label"]) + adapt_labels

    # Retrain
    clf_adapted = LogisticRegression(random_state=42, max_iter=1000)
    clf_adapted.fit(combined_embeddings, combined_labels)

    # Test
    test_embeddings = model.encode(test_texts)
    adapted_pred = clf_adapted.predict(test_embeddings)
    adapted_f1 = f1_score(test_labels, adapted_pred, average='weighted')

    zero_shot_f1 = transfer_results[domain_name]['zero_shot_f1']
    improvement = adapted_f1 - zero_shot_f1

    print(f"\n{domain_name}:")
    print(f"  Zero-shot: {zero_shot_f1:.4f}")
    print(f"  Adapted:   {adapted_f1:.4f}")
    print(f"  Gain:      {improvement:+.4f}")

    transfer_results[domain_name]['adapted_f1'] = adapted_f1
    transfer_results[domain_name]['improvement'] = improvement


Few-shot adaptation (4 examples):

Restaurant:
  Zero-shot: 0.9010
  Adapted:   0.9091
  Gain:      +0.0081

Product:
  Zero-shot: 1.0000
  Adapted:   1.0000
  Gain:      +0.0000

Book:
  Zero-shot: 0.9010
  Adapted:   0.9091
  Gain:      +0.0081


Analyze domain similarity (embedding distance):

In [16]:
source_centroid = np.mean(train_embeddings, axis=0)

print("\nDomain distances from source:")
for domain_name in target_domains.keys():
    domain_embeddings = transfer_results[domain_name]['embeddings']
    domain_centroid = np.mean(domain_embeddings, axis=0)
    distance = np.linalg.norm(source_centroid - domain_centroid)

    zero_f1 = transfer_results[domain_name]['zero_shot_f1']
    drop = source_f1 - zero_f1

    print(f"\n{domain_name:12s}: distance={distance:.3f}, drop={drop:.3f}")

print("\nObservation: Smaller distance = better transfer")


Domain distances from source:

Restaurant  : distance=0.716, drop=-0.051

Product     : distance=0.678, drop=-0.150

Book        : distance=0.534, drop=-0.051

Observation: Smaller distance = better transfer


### Questions

1. Which domain transferred best from movies? Which worst?

2. Do you see patterns in what transfers well vs what fails?

3. After few-shot adaptation: Which domains benefited most from just 4 examples?
