# Chapter 4: Text Classification - Medium Tasks (Solutions)

Complete working solutions.

## Setup

Run all cells in this section to set up the environment and load necessary data.


### [Optional] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **Note**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


In [2]:
from datasets import load_dataset
# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Helper Functions


In [3]:
from sklearn.metrics import classification_report
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

## Medium Tasks


### Medium Tasks - Building Real Classifiers

These tasks require more modification and experimentation. You'll build complete classification systems.

Run the code to see how 5-level classification works. Then try adding a 6th category.


#### Medium Task 1: Multi-Class Sentiment Classification
In this task, you'll build a sentiment classifier with 5 different categories (from extremely negative to extremely positive) instead of just binary positive/negative.
**What to do:**
1. Run the cells below to see baseline 5-level classification
2. Observe which reviews are uncertain (low margin between top predictions)
3. Try uncommenting the 6-level version to add more granularity
4. Compare how predictions change with more categories

Set up the 5 sentiment categories and compute embeddings:

In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Classify each review and show confidence scores:

In [5]:
# Define 5 sentiment categories
sentiment_labels = [
    "extremely negative review",
    "somewhat negative review",
    "neutral review",
    "somewhat positive review",
    "extremely positive review"
]

# Create embeddings for each category
label_embeddings = model.encode(sentiment_labels)

print("Sentiment categories:")
for i, label in enumerate(sentiment_labels):
    print(f"  {i}: {label}")

Sentiment categories:
  0: extremely negative review
  1: somewhat negative review
  2: neutral review
  3: somewhat positive review
  4: extremely positive review


In [6]:
# Test reviews
test_reviews = [
    "This is the best movie I have ever seen! Absolute masterpiece!",
    "Pretty good film, I enjoyed it",
    "It was okay, nothing special",
    "Not very good, quite boring",
    "Terrible movie, waste of time"
]

review_embeddings = model.encode(test_reviews)
similarities = cosine_similarity(review_embeddings, label_embeddings)

In [8]:
import numpy as np
# Complete this: Classify each review
print("Classification results:")

for i, review in enumerate(test_reviews):
    predicted_idx = np.argmax(similarities[i])
    confidence = similarities[i].max()

    top_two = sorted(similarities[i], reverse=True)[:2]
    margin = top_two[0] - top_two[1]

    print(f"\n'{review}'")
    print(f"  -> {sentiment_labels[predicted_idx]}")
    print(f"  Confidence: {confidence:.3f} | Margin: {margin:.3f}")

Classification results:

'This is the best movie I have ever seen! Absolute masterpiece!'
  -> extremely positive review
  Confidence: 0.256 | Margin: 0.099

'Pretty good film, I enjoyed it'
  -> somewhat positive review
  Confidence: 0.428 | Margin: 0.073

'It was okay, nothing special'
  -> somewhat negative review
  Confidence: 0.439 | Margin: 0.004

'Not very good, quite boring'
  -> somewhat positive review
  Confidence: 0.378 | Margin: 0.000

'Terrible movie, waste of time'
  -> extremely negative review
  Confidence: 0.426 | Margin: 0.009


In [9]:
# Analyze category confusion
print("\nCategory Confusion Analysis")
label_similarity = cosine_similarity(label_embeddings)

# Find most similar category pairs
confusions = []
for i in range(len(sentiment_labels)):
    for j in range(i+1, len(sentiment_labels)):
        sim = label_similarity[i][j]
        confusions.append((i, j, sim))

# Show top 5 confusions
for i, j, sim in sorted(confusions, key=lambda x: x[2], reverse=True)[:5]:
    print(f"{sentiment_labels[i]} <-> {sentiment_labels[j]}: {sim:.3f}")


Category Confusion Analysis
extremely negative review <-> somewhat negative review: 0.952
somewhat negative review <-> somewhat positive review: 0.863
somewhat positive review <-> extremely positive review: 0.818
somewhat negative review <-> neutral review: 0.794
extremely negative review <-> somewhat positive review: 0.766


As you can see, the classifier assigns each review to one of the 5 sentiment categories. The **margin** (difference between top 2 predictions) indicates confidence - large margins (>0.15) mean the model is confident, while small margins (<0.05) indicate uncertainty. Reviews with extreme language ("best ever", "terrible") have higher confidence, while moderate reviews ("pretty good", "quite bad") show more uncertainty.

Analyze which categories are most similar to each other:

Notice that adjacent categories (like "somewhat negative" and "neutral") tend to have higher similarity scores, which explains why the model sometimes confuses them. Categories with similarity > 0.7 are particularly prone to confusion.

#### Medium Task 2: Classifier Performance with Limited Training Data

Try different training sizes (100, 500, 1000, 2000) and fill in the results table.


In [10]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np

data = load_dataset("rotten_tomatoes")

# Experiment: Try different training sizes
train_size = 1000
# Uncomment to try different sizes:
# train_size = 100    # Very small - will it work?
# train_size = 500    # Small dataset
# train_size = 2000   # Medium dataset
# train_size = 5000   # Large dataset

test_size = 300

train_subset = data["train"].shuffle(seed=42).select(range(min(train_size, len(data["train"]))))
test_subset = data["test"].shuffle(seed=42).select(range(test_size))

print(f"Experiment: Training Size = {train_size}")

Experiment: Training Size = 1000


In [11]:
print("\n[1/2] Testing Task-Specific Model...")
task_model = pipeline(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True,
    device=-1
)
y_pred_task = []
for text in test_subset["text"]:
    output = task_model(text)[0]
    neg_score = output[0]["score"]
    pos_score = output[2]["score"]
    y_pred_task.append(1 if pos_score > neg_score else 0)
task_f1 = f1_score(test_subset["label"], y_pred_task, average='weighted')
print(f" Task-Specific Model F1: {task_f1:.4f}")


[1/2] Testing Task-Specific Model...


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


 Task-Specific Model F1: 0.7709


Train the embedding-based classifier on your labeled data:

In [12]:
print(f"\n[2/2] Training Embedding Classifier on {train_size} samples...")
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
train_embeddings = embedding_model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = embedding_model.encode(test_subset["text"], show_progress_bar=False)
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])
y_pred_embed = clf.predict(test_embeddings)
embed_f1 = f1_score(test_subset["label"], y_pred_embed, average='weighted')
print(f" Embedding Classifier F1: {embed_f1:.4f}")


[2/2] Training Embedding Classifier on 1000 samples...
 Embedding Classifier F1: 0.8699


Compare the two approaches and show example predictions:

In [13]:
print("\nResults Summary")
print(f"Training samples: {train_size}")
print(f"Task-Specific F1: {task_f1:.4f}")
print(f"Embedding F1: {embed_f1:.4f}")
print(f"Difference: {embed_f1 - task_f1:+.4f}")

print("\nExample predictions:")
for i in range(3):
    print(f"\n'{test_subset['text'][i][:50]}...'")
    print(f"Task-Specific: {y_pred_task[i]} | Embedding: {y_pred_embed[i]}")


Results Summary
Training samples: 1000
Task-Specific F1: 0.7709
Embedding F1: 0.8699
Difference: +0.0990

Example predictions:

'unpretentious , charming , quirky , original...'
Task-Specific: 1 | Embedding: 1

'a film really has to be exceptional to justify a t...'
Task-Specific: 0 | Embedding: 0

'working from a surprisingly sensitive script co-wr...'
Task-Specific: 1 | Embedding: 1


### Questions

1. At what training size did embedding classifier match the task-specific model?

2. Were there cases where one model was correct and the other wrong?

3. Is 100 samples enough labeled data?


#### Medium Task 3: Confidence-Based Classifier with Uncertainty Handling

In production, refusing a prediction beats making a wrong one. Here's the key insight: when your model is uncertain, it should say "I don't know" rather than guessing. This creates a trade-off between coverage (how many predictions you make) and accuracy (how often you're right).

Try this:
- Run with threshold of 0.15 first
- Test 0.05, 0.30, and 0.50 to see how the trade-off shifts
- Check the uncertain cases (typically have hedging language)
- Experiment with the alternative uncertainty measure

In [14]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [15]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Reviews with varying levels of clarity
test_reviews = [
    "Absolutely fantastic! Best movie ever!",           # Clear positive
    "Pretty good, I liked it",                          # Weak positive
    "It was fine, nothing special",                     # Ambiguous
    "Not bad but not great either",                     # Very ambiguous
    "Quite disappointing",                              # Weak negative
    "Terrible! Complete waste of time!",                # Clear negative
    "The movie had some interesting moments",           # Ambiguous positive
    "Outstanding performances all around!",             # Clear positive
]

# True labels (for evaluation)
y_true = [1, 1, 0, 0, 0, 0, 1, 1]  # 1=positive, 0=negative

In [16]:
labels = ["negative", "positive"]

# Experiment: Try different confidence thresholds
confidence_threshold = 0.15

# Uncomment to experiment with different thresholds:
# confidence_threshold = 0.05   # Low - almost always predict
# confidence_threshold = 0.30   # High - more uncertain cases
# confidence_threshold = 0.50   # Very high - very selective

print(f"Using confidence threshold: {confidence_threshold}")

Using confidence threshold: 0.15


In [17]:
def calculate_margin(similarities):
    """
    Calculate margin between top two predictions
    Hint: Sort similarities in descending order, then subtract 2nd from 1st
    """
    sorted_sims = np.sort(similarities)[::-1]
    margin = sorted_sims[0] - sorted_sims[1]
    return margin

In [18]:
label_embeddings = model.encode(labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

In [19]:
# Complete this: Classify with confidence threshold
results = []

print(f"Confidence-based classification (threshold={confidence_threshold})")

for i, review in enumerate(test_reviews):
    similarities = sim_matrix[i]
    predicted_idx = np.argmax(similarities)

    margin = calculate_margin(similarities)

    if margin >= confidence_threshold:
        prediction = predicted_idx
    else:
        prediction = None

    pred_label = labels[predicted_idx] if prediction is not None else "uncertain"
    print(f"{i+1}. '{review[:40]}...' -> {pred_label} (margin: {margin:.3f})")

    results.append({'pred': prediction, 'true': y_true[i], 'margin': margin})

Confidence-based classification (threshold=0.15)
1. 'Absolutely fantastic! Best movie ever!...' -> uncertain (margin: 0.047)
2. 'Pretty good, I liked it...' -> uncertain (margin: 0.033)
3. 'It was fine, nothing special...' -> uncertain (margin: 0.055)
4. 'Not bad but not great either...' -> uncertain (margin: 0.003)
5. 'Quite disappointing...' -> uncertain (margin: 0.061)
6. 'Terrible! Complete waste of time!...' -> uncertain (margin: 0.062)
7. 'The movie had some interesting moments...' -> uncertain (margin: 0.054)
8. 'Outstanding performances all around!...' -> uncertain (margin: 0.068)


In [20]:
# Calculate metrics
made_predictions = [r for r in results if r['pred'] is not None]
correct = [r for r in made_predictions if r['pred'] == r['true']]

coverage = len(made_predictions) / len(results)
accuracy = len(correct) / len(made_predictions) if made_predictions else 0

print(f"\nPerformance:")
print(f"Coverage: {len(made_predictions)}/{len(results)} = {coverage:.1%}")
print(f"Accuracy: {len(correct)}/{len(made_predictions)} = {accuracy:.1%}")


Performance:
Coverage: 0/8 = 0.0%
Accuracy: 0/0 = 0.0%


### Questions

1. What do uncertain reviews have in common? Are they using hedging language like "kind of" or "somewhat"?

2. Compare results at threshold=0.05 vs 0.30. Describe the coverage vs accuracy trade-off. When would you want high coverage vs high accuracy?

3. How could you use confidence-based prediction in production? What should a system do when the model is uncertain?

#### Medium Task 4: Classifier Failure Analysis

Train the classifier and see what kinds of reviews it gets wrong. Then add your own test cases.


In [21]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from datasets import load_dataset
import numpy as np

Load data and train a classifier.

In [26]:
data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_subset = data["train"].shuffle(seed=42).select(range(1000))
test_subset = data["test"].shuffle(seed=42).select(range(200))

print("Training classifier...")
train_embeddings = model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = model.encode(test_subset["text"], show_progress_bar=False)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])

predictions = clf.predict(test_embeddings)
probabilities = clf.predict_proba(test_embeddings)

Training classifier...


Analyze the errors.

In [23]:
# Complete this: Find and collect errors
errors = []

for i in range(len(test_subset)):
    # Check if prediction is wrong
    # Hint: Compare predictions[i] with test_subset["label"][i]
    if predictions[i] != test_subset["label"][i]:  # prediction != true_label
        errors.append({
            'text': test_subset["text"][i],
            'true_label': test_subset["label"][i],
            'predicted_label': predictions[i],
            'confidence': probabilities[i][predictions[i]]
        })

accuracy = (len(test_subset) - len(errors)) / len(test_subset)
print(f"Accuracy: {accuracy:.1%}")
print(f"Errors: {len(errors)}")

Accuracy: 87.0%
Errors: 26


Look at high-confidence errors (the most surprising mistakes).

In [24]:
# Complete this: Filter for high-confidence errors
# Hint: Use list comprehension [e for e in errors if condition]
high_conf_errors = [e for e in errors if e['confidence'] > 0.7]

print("High-confidence errors:")
if high_conf_errors:
    for error in high_conf_errors[:5]:
        true_sent = "positive" if error['true_label'] == 1 else "negative"
        pred_sent = "positive" if error['predicted_label'] == 1 else "negative"
        print(f"\n'{error['text'][:50]}...'")
        print(f"True: {true_sent} | Predicted: {pred_sent} (conf: {error['confidence']:.3f})")

High-confidence errors:

'an uneasy mix of run-of-the-mill raunchy humor and...'
True: negative | Predicted: positive (conf: 0.701)

'the stunt work is top-notch ; the dialogue and dra...'
True: negative | Predicted: positive (conf: 0.867)

'goldmember is funny enough to justify the embarras...'
True: positive | Predicted: negative (conf: 0.710)

'steven soderbergh doesn't remake andrei tarkovsky'...'
True: positive | Predicted: negative (conf: 0.730)

'" what really happened ? " is a question for philo...'
True: positive | Predicted: negative (conf: 0.717)


Test on edge cases like sarcasm and mixed sentiment.

In [25]:
# Complete this: Test edge cases
edge_cases = [
    ("Sarcastic", "Oh great, another masterpiece. not!", 0),
    ("Mixed", "The acting was great but the plot was terrible", 0),
    ("Ambiguous", "It was a movie", 0),
]

# Experiment: Add your own edge cases!
# Try adding:
# - Double negatives: ("Double neg", "Not bad at all", 1)
# - Very short: ("Short", "Meh", 0)
# - Emojis/slang: ("Slang", "This movie slaps fr fr", 1)
#
# edge_cases.append(("Your category", "Your test text", expected_label))

# Get embeddings and predictions
# Hint: Use list comprehension [text for _, text, _ in edge_cases]
edge_embeddings = model.encode([text for _, text, _ in edge_cases])
edge_predictions = clf.predict(edge_embeddings)

print("\nEdge Cases:")
correct = 0
for i, (category, text, true_label) in enumerate(edge_cases):
    pred = edge_predictions[i]
    match = "correct" if pred == true_label else "wrong"
    if pred == true_label:
        correct += 1
    print(f"{category}: '{text}' -> {pred} ({match})")

print(f"\nEdge case accuracy: {correct}/{len(edge_cases)}")


Edge Cases:
Sarcastic: 'Oh great, another masterpiece. not!' -> 1 (wrong)
Mixed: 'The acting was great but the plot was terrible' -> 0 (correct)
Ambiguous: 'It was a movie' -> 0 (correct)

Edge case accuracy: 2/3


### Questions

1. What do high-confidence errors have in common?

2. Do errors tend to be shorter or longer than correct predictions?

3. Which edge cases failed most - sarcasm, mixed sentiment, or double negatives?
