# Chapter 4: Text Classification - Medium Tasks

This notebook focuses on building practical text classifiers. You'll create custom multi-class sentiment classifiers, evaluate performance with limited training data, implement confidence-based classification with uncertainty handling, and perform systematic failure analysis. These skills are crucial for real-world NLP applications where data and perfect accuracy are often limited.


---

## Setup

Run all cells in this section to set up the environment and load necessary data.


### [Optional] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **Note**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [None]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


In [None]:
from datasets import load_dataset
# Load our data
data = load_dataset("rotten_tomatoes")
data

### Helper Functions


In [None]:
from sklearn.metrics import classification_report
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

---

## Medium Tasks


### Medium Tasks - Building Real Classifiers

These tasks require more modification and experimentation. You'll build complete classification systems.

Run the code to see how 5-level classification works. Then try adding a 6th category.


#### Medium Task 1: Multi-Class Sentiment Classification
In this task, you'll build a sentiment classifier with 5 different categories (from extremely negative to extremely positive) instead of just binary positive/negative.
**What to do:**
1. Run the cells below to see baseline 5-level classification
2. Observe which reviews are uncertain (low margin between top predictions)
3. Try uncommenting the 6-level version to add more granularity
4. Compare how predictions change with more categories

Set up the 5 sentiment categories and compute embeddings:

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

Classify each review and show confidence scores:

In [None]:
# Define 5 sentiment categories
sentiment_labels = [
    "extremely negative review",
    "somewhat negative review",
    "neutral review",
    "somewhat positive review",
    "extremely positive review"
]

# Create embeddings for each category
label_embeddings = model.encode(sentiment_labels)

print("Sentiment categories:")
for i, label in enumerate(sentiment_labels):
    print(f"  {i}: {label}")

In [None]:
# Test reviews
test_reviews = [
    "This is the best movie I have ever seen! Absolute masterpiece!",
    "Pretty good film, I enjoyed it",
    "It was okay, nothing special",
    "Not very good, quite boring",
    "Terrible movie, waste of time"
]

review_embeddings = model.encode(test_reviews)
similarities = cosine_similarity(review_embeddings, label_embeddings)

In [None]:
# Complete this: Classify each review
print("Classification results:")

for i, review in enumerate(test_reviews):
    # Find which category has highest similarity
    predicted_idx = None  # use argmax
    confidence = None  # max similarity score
    
    # Calculate margin between top 2 predictions
    top_two = None  # sort and get top 2
    margin = None  # difference
    
    if predicted_idx is not None:
        print(f"\n'{review}'")
        print(f"  -> {sentiment_labels[predicted_idx]}")
        print(f"  Confidence: {confidence:.3f} | Margin: {margin:.3f}")

In [None]:
# Analyze category confusion
print("\nCategory Confusion Analysis")
label_similarity = cosine_similarity(label_embeddings)

# Find most similar category pairs
confusions = []
for i in range(len(sentiment_labels)):
    for j in range(i+1, len(sentiment_labels)):
        sim = label_similarity[i][j]
        confusions.append((i, j, sim))

# Show top 5 confusions
for i, j, sim in sorted(confusions, key=lambda x: x[2], reverse=True)[:5]:
    print(f"{sentiment_labels[i]} <-> {sentiment_labels[j]}: {sim:.3f}")

As you can see, the classifier assigns each review to one of the 5 sentiment categories. The **margin** (difference between top 2 predictions) indicates confidence - large margins (>0.15) mean the model is confident, while small margins (<0.05) indicate uncertainty. Reviews with extreme language ("best ever", "terrible") have higher confidence, while moderate reviews ("pretty good", "quite bad") show more uncertainty.

Analyze which categories are most similar to each other:

Notice that adjacent categories (like "somewhat negative" and "neutral") tend to have higher similarity scores, which explains why the model sometimes confuses them. Categories with similarity > 0.7 are particularly prone to confusion.

#### Medium Task 2: Classifier Performance with Limited Training Data

Try different training sizes (100, 500, 1000, 2000) and fill in the results table.


In [None]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np
data = load_dataset("rotten_tomatoes")
# Try different values: 100, 500, 1000, 2000, 5000
train_size = 1000
test_size = 300
train_subset = data["train"].shuffle(seed=42).select(range(min(train_size, len(data["train"]))))
test_subset = data["test"].shuffle(seed=42).select(range(test_size))
print(f"Experiment: Training Size = {train_size}")

In [None]:
print("\n[1/2] Testing Task-Specific Model...")
task_model = pipeline(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True,
    device=-1
)
y_pred_task = []
for text in test_subset["text"]:
    output = task_model(text)[0]
    neg_score = output[0]["score"]
    pos_score = output[2]["score"]
    y_pred_task.append(1 if pos_score > neg_score else 0)
task_f1 = f1_score(test_subset["label"], y_pred_task, average='weighted')
print(f" Task-Specific Model F1: {task_f1:.4f}")

Train the embedding-based classifier on your labeled data:

In [None]:
print(f"\n[2/2] Training Embedding Classifier on {train_size} samples...")
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
train_embeddings = embedding_model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = embedding_model.encode(test_subset["text"], show_progress_bar=False)
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])
y_pred_embed = clf.predict(test_embeddings)
embed_f1 = f1_score(test_subset["label"], y_pred_embed, average='weighted')
print(f" Embedding Classifier F1: {embed_f1:.4f}")

Compare the two approaches and show example predictions:

In [None]:
print("\nResults Summary")
print(f"Training samples: {train_size}")
print(f"Task-Specific F1: {task_f1:.4f}")
print(f"Embedding F1: {embed_f1:.4f}")
print(f"Difference: {embed_f1 - task_f1:+.4f}")

print("\nExample predictions:")
for i in range(3):
    print(f"\n'{test_subset['text'][i][:50]}...'")
    print(f"Task-Specific: {y_pred_task[i]} | Embedding: {y_pred_embed[i]}")

### Questions

1. At what training size did embedding classifier match the task-specific model?

2. Were there cases where one model was correct and the other wrong?

3. Is 100 samples enough labeled data?


#### Medium Task 3: Confidence-Based Classifier with Uncertainty Handling

In production, refusing a prediction beats making a wrong one. Here's the key insight: when your model is uncertain, it should say "I don't know" rather than guessing. This creates a trade-off between coverage (how many predictions you make) and accuracy (how often you're right).

Try this:
- Run with threshold of 0.15 first
- Test 0.05, 0.30, and 0.50 to see how the trade-off shifts
- Check the uncertain cases (typically have hedging language)
- Experiment with the alternative uncertainty measure

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Reviews with varying levels of clarity
test_reviews = [
    "Absolutely fantastic! Best movie ever!",           # Clear positive
    "Pretty good, I liked it",                          # Weak positive
    "It was fine, nothing special",                     # Ambiguous
    "Not bad but not great either",                     # Very ambiguous
    "Quite disappointing",                              # Weak negative
    "Terrible! Complete waste of time!",                # Clear negative
    "The movie had some interesting moments",           # Ambiguous positive
    "Outstanding performances all around!",             # Clear positive
]

# True labels (for evaluation)
y_true = [1, 1, 0, 0, 0, 0, 1, 1]  # 1=positive, 0=negative

In [None]:
labels = ["negative", "positive"]

# Experiment WITH THIS - Try: 0.05, 0.15, 0.30, 0.50
confidence_threshold = 0.15

In [None]:
def calculate_margin(similarities):
    """
    Margin = difference between top two predictions
    Small margin = uncertain (predictions are close)
    """
    sorted_sims = np.sort(similarities)[::-1]
    margin = sorted_sims[0] - sorted_sims[1]
    return margin

# After first run, uncomment this alternative uncertainty measure:
# def calculate_margin(similarities):
#     """
#     Alternative: Use absolute confidence in top prediction
#     Low confidence = uncertain
#     """
#     max_confidence = np.max(similarities)
#     # Convert to margin-like score (higher = more certain)
#     # If max is 0.6, margin = 0.6 - 0.5 = 0.1 (uncertain)
#     # If max is 0.9, margin = 0.9 - 0.5 = 0.4 (certain)
#     margin = max_confidence - 0.5
#     return margin

In [None]:
label_embeddings = model.encode(labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

In [None]:
# Classify with confidence threshold
results = []

print(f"Confidence-based classification (threshold={confidence_threshold})")

for i, review in enumerate(test_reviews):
    similarities = sim_matrix[i]
    predicted_idx = np.argmax(similarities)
    margin = calculate_margin(similarities)
    
    # Predict only if margin is above threshold
    if margin >= confidence_threshold:
        prediction = predicted_idx
    else:
        prediction = None
    
    pred_label = labels[predicted_idx] if prediction is not None else "uncertain"
    print(f"{i+1}. '{review[:40]}...' -> {pred_label} (margin: {margin:.3f})")
    
    results.append({'pred': prediction, 'true': y_true[i], 'margin': margin})

In [None]:
# Calculate metrics
made_predictions = [r for r in results if r['pred'] is not None]
correct = [r for r in made_predictions if r['pred'] == r['true']]

coverage = len(made_predictions) / len(results)
accuracy = len(correct) / len(made_predictions) if made_predictions else 0

print(f"\nPerformance:")
print(f"Coverage: {len(made_predictions)}/{len(results)} = {coverage:.1%}")
print(f"Accuracy: {len(correct)}/{len(made_predictions)} = {accuracy:.1%}")

### Questions

1. What do uncertain reviews have in common? Are they using hedging language like "kind of" or "somewhat"?

2. Compare results at threshold=0.05 vs 0.30. Describe the coverage vs accuracy trade-off. When would you want high coverage vs high accuracy?

3. How could you use confidence-based prediction in production? What should a system do when the model is uncertain?

#### Medium Task 4: Classifier Failure Analysis

Train the classifier and see what kinds of reviews it gets wrong. Then add your own test cases.


In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from datasets import load_dataset
import numpy as np

Load data and train a classifier.

In [None]:
data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_subset = data["train"].shuffle(seed=42).select(range(1000))
test_subset = data["test"].shuffle(seed=42).select(range(200))

print("Training classifier...")
train_embeddings = model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = model.encode(test_subset["text"], show_progress_bar=False)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])

predictions = clf.predict(test_embeddings)
probabilities = clf.predict_proba(test_embeddings)

Analyze the errors.

In [None]:
# Find and collect errors
errors = []

for i in range(len(test_subset)):
    if predictions[i] != test_subset["label"][i]:
        errors.append({
            'text': test_subset["text"][i],
            'true_label': test_subset["label"][i],
            'predicted_label': predictions[i],
            'confidence': probabilities[i][predictions[i]]
        })

accuracy = (len(test_subset) - len(errors)) / len(test_subset)
print(f"Accuracy: {accuracy:.1%}")
print(f"Errors: {len(errors)}")

Look at high-confidence errors (the most surprising mistakes).

In [None]:
# Look at high-confidence errors
high_conf_errors = [e for e in errors if e['confidence'] > 0.7]

print("High-confidence errors:")
for error in high_conf_errors[:5]:
    true_sent = "positive" if error['true_label'] == 1 else "negative"
    pred_sent = "positive" if error['predicted_label'] == 1 else "negative"
    print(f"\n'{error['text'][:50]}...'")
    print(f"True: {true_sent} | Predicted: {pred_sent} (conf: {error['confidence']:.3f})")

Test on edge cases like sarcasm and mixed sentiment.

In [None]:
edge_cases = [
    ("Sarcastic", "Oh great, another masterpiece. not!", 0),
    ("Mixed", "The acting was great but the plot was terrible", 0),
    ("Ambiguous", "It was a movie", 0),
]

edge_embeddings = model.encode([text for _, text, _ in edge_cases])
edge_predictions = clf.predict(edge_embeddings)

print("\nEdge Cases:")
correct = 0
for i, (category, text, true_label) in enumerate(edge_cases):
    pred = edge_predictions[i]
    match = "correct" if pred == true_label else "wrong"
    if pred == true_label:
        correct += 1
    print(f"{category}: '{text}' -> {pred} ({match})")

print(f"\nEdge case accuracy: {correct}/{len(edge_cases)}")

### Questions

1. What do high-confidence errors have in common?

2. Do errors tend to be shorter or longer than correct predictions?

3. Which edge cases failed most - sarcasm, mixed sentiment, or double negatives?
