# Chapter 4: Text Classification - Medium Tasks (Solutions)

Complete working solutions.

---

## Setup

Run all cells in this section to set up the environment and load necessary data.


### [Optional] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **Note**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


In [2]:
from datasets import load_dataset
# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Helper Functions


In [3]:
from sklearn.metrics import classification_report
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

---

## Medium Tasks


### Medium Tasks - Building Real Classifiers

These tasks require more modification and experimentation. You'll build complete classification systems.

Run the code to see how 5-level classification works. Then try adding a 6th category.


#### Medium Task 1: Multi-Class Sentiment Classification
In this task, you'll build a sentiment classifier with 5 different categories (from extremely negative to extremely positive) instead of just binary positive/negative.
**What to do:**
1. Run the cells below to see baseline 5-level classification
2. Observe which reviews are uncertain (low margin between top predictions)
3. Try uncommenting the 6-level version to add more granularity
4. Compare how predictions change with more categories

Set up the 5 sentiment categories and compute embeddings:

In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

Classify each review and show confidence scores:

In [5]:
# Define 5 sentiment categories
sentiment_labels = [
    "extremely negative review",
    "somewhat negative review",
    "neutral review",
    "somewhat positive review",
    "extremely positive review"
]

# Create embeddings for each category
label_embeddings = model.encode(sentiment_labels)

print("Sentiment categories:")
for i, label in enumerate(sentiment_labels):
    print(f"  {i}: {label}")

Sentiment categories:
  0: extremely negative review
  1: somewhat negative review
  2: neutral review
  3: somewhat positive review
  4: extremely positive review


In [6]:
# Test on some sample reviews
test_reviews = [
    "This is the best movie I have ever seen! Absolute masterpiece!",
    "Pretty good film, I enjoyed it",
    "It was okay, nothing special",
    "Not very good, quite boring",
    "Terrible movie, waste of time"
]

review_embeddings = model.encode(test_reviews)
similarities = cosine_similarity(review_embeddings, label_embeddings)

print("Classification results:")
for i, review in enumerate(test_reviews):
    predicted_category = sentiment_labels[similarities[i].argmax()]
    confidence = similarities[i].max()
    top_two_scores = sorted(similarities[i], reverse=True)[:2]
    margin = top_two_scores[0] - top_two_scores[1]

    print(f"\n'{review}'")
    print(f"  → {predicted_category}")
    print(f"  Confidence: {confidence:.3f} | Margin: {margin:.3f}")

Classification results:

'This is the best movie I have ever seen! Absolute masterpiece!'
  → extremely positive review
  Confidence: 0.256 | Margin: 0.099

'Pretty good film, I enjoyed it'
  → somewhat positive review
  Confidence: 0.428 | Margin: 0.073

'It was okay, nothing special'
  → somewhat negative review
  Confidence: 0.439 | Margin: 0.004

'Not very good, quite boring'
  → somewhat positive review
  Confidence: 0.378 | Margin: 0.000

'Terrible movie, waste of time'
  → extremely negative review
  Confidence: 0.426 | Margin: 0.009


In [7]:
print()
print("Category Confusion Analysis")
label_similarity = cosine_similarity(label_embeddings)
print(f"{'Category Pair':<60s} {'Similarity':<12s}")
confusions = []
for i in range(len(sentiment_labels)):
    for j in range(i+1, len(sentiment_labels)):
        sim = label_similarity[i][j]
        confusions.append((i, j, sim))
for i, j, sim in sorted(confusions, key=lambda x: x[2], reverse=True)[:10]:
    pair_name = f"{sentiment_labels[i]} <-> {sentiment_labels[j]}"
    marker = " " if sim > 0.7 else ""
    print(f"{marker}{pair_name:<60s} {sim:.3f}")


CATEGORY CONFUSION ANALYSIS
Category Pair                                                Similarity  
 extremely negative review <-> somewhat negative review       0.952
 somewhat negative review <-> somewhat positive review        0.863
 somewhat positive review <-> extremely positive review       0.818
 somewhat negative review <-> neutral review                  0.794
 extremely negative review <-> somewhat positive review       0.766
 extremely negative review <-> neutral review                 0.748
 neutral review <-> somewhat positive review                  0.748
somewhat negative review <-> extremely positive review       0.674
extremely negative review <-> extremely positive review      0.664
neutral review <-> extremely positive review                 0.616


As you can see, the classifier assigns each review to one of the 5 sentiment categories. The **margin** (difference between top 2 predictions) indicates confidence - large margins (>0.15) mean the model is confident, while small margins (<0.05) indicate uncertainty. Reviews with extreme language ("best ever", "terrible") have higher confidence, while moderate reviews ("pretty good", "quite bad") show more uncertainty.

Analyze which categories are most similar to each other:

Notice that adjacent categories (like "somewhat negative" and "neutral") tend to have higher similarity scores, which explains why the model sometimes confuses them. Categories with similarity > 0.7 are particularly prone to confusion.

#### Medium Task 2: Classifier Performance with Limited Training Data

Try different training sizes (100, 500, 1000, 2000) and fill in the results table.


In [8]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np
data = load_dataset("rotten_tomatoes")
# Try different values: 100, 500, 1000, 2000, 5000
train_size = 1000
test_size = 300
train_subset = data["train"].shuffle(seed=42).select(range(min(train_size, len(data["train"]))))
test_subset = data["test"].shuffle(seed=42).select(range(test_size))
print(f"Experiment: Training Size = {train_size}")

EXPERIMENT: Training Size = 1000


In [9]:
print("\n[1/2] Testing Task-Specific Model...")
task_model = pipeline(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True,
    device=-1
)
y_pred_task = []
for text in test_subset["text"]:
    output = task_model(text)[0]
    neg_score = output[0]["score"]
    pos_score = output[2]["score"]
    y_pred_task.append(1 if pos_score > neg_score else 0)
task_f1 = f1_score(test_subset["label"], y_pred_task, average='weighted')
print(f" Task-Specific Model F1: {task_f1:.4f}")


[1/2] Testing Task-Specific Model...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


 Task-Specific Model F1: 0.7709


Train the embedding-based classifier on your labeled data:

In [10]:
print(f"\n[2/2] Training Embedding Classifier on {train_size} samples...")
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
train_embeddings = embedding_model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = embedding_model.encode(test_subset["text"], show_progress_bar=False)
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])
y_pred_embed = clf.predict(test_embeddings)
embed_f1 = f1_score(test_subset["label"], y_pred_embed, average='weighted')
print(f" Embedding Classifier F1: {embed_f1:.4f}")


[2/2] Training Embedding Classifier on 1000 samples...
 Embedding Classifier F1: 0.8699


Compare the two approaches and show example predictions:

In [11]:
print("\nResults Summary")
print(f"Training samples: {train_size}")
print(f"Task-Specific F1: {task_f1:.4f}")
print(f"Embedding F1: {embed_f1:.4f}")
print(f"Difference: {embed_f1 - task_f1:+.4f}")

print("\nExample predictions:")
for i in range(3):
    print(f"\n'{test_subset['text'][i][:50]}...'")
    print(f"Task-Specific: {y_pred_task[i]} | Embedding: {y_pred_embed[i]}")


RESULTS SUMMARY
Training samples used: 1000

Task-Specific (pre-trained):  F1 = 0.7709
Embedding + Classifier:       F1 = 0.8699
Difference:                       +0.0990

 Embedding approach WINS with 1000 samples!

EXAMPLE PREDICTIONS (first 5)

1. 'unpretentious , charming , quirky , original...'
   True: Positive
   Task-Specific: Positive 
   Embedding:     Positive 

2. 'a film really has to be exceptional to justify a three hour ...'
   True: Negative
   Task-Specific: Negative 
   Embedding:     Negative 

3. 'working from a surprisingly sensitive script co-written by g...'
   True: Positive
   Task-Specific: Positive 
   Embedding:     Positive 

4. 'it may not be particularly innovative , but the film's crisp...'
   True: Positive
   Task-Specific: Positive 
   Embedding:     Positive 

5. 'such a premise is ripe for all manner of lunacy , but kaufma...'
   True: Negative
   Task-Specific: Negative 
   Embedding:     Negative 

TODO: Record your results
Current: | 1000      

### Questions

1. At what training size did embedding classifier match the task-specific model?

2. Were there cases where one model was correct and the other wrong?

3. Is 100 samples enough labeled data?


#### Medium Task 3: Confidence-Based Classifier with Uncertainty Handling

In production, refusing a prediction beats making a wrong one. Here's the key insight: when your model is uncertain, it should say "I don't know" rather than guessing. This creates a trade-off between coverage (how many predictions you make) and accuracy (how often you're right).

Try this:
- Run with threshold of 0.15 first
- Test 0.05, 0.30, and 0.50 to see how the trade-off shifts
- Check the uncertain cases (typically have hedging language)
- Experiment with the alternative uncertainty measure

In [12]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [13]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Reviews with varying levels of clarity
test_reviews = [
    "Absolutely fantastic! Best movie ever!",           # Clear positive
    "Pretty good, I liked it",                          # Weak positive
    "It was fine, nothing special",                     # Ambiguous
    "Not bad but not great either",                     # Very ambiguous
    "Quite disappointing",                              # Weak negative
    "Terrible! Complete waste of time!",                # Clear negative
    "The movie had some interesting moments",           # Ambiguous positive
    "Outstanding performances all around!",             # Clear positive
]

# True labels (for evaluation)
y_true = [1, 1, 0, 0, 0, 0, 1, 1]  # 1=positive, 0=negative

In [14]:
labels = ["negative", "positive"]

# Experiment WITH THIS - Try: 0.05, 0.15, 0.30, 0.50
confidence_threshold = 0.15

In [15]:
def calculate_margin(similarities):
    """
    Margin = difference between top two predictions
    Small margin = uncertain (predictions are close)
    """
    sorted_sims = np.sort(similarities)[::-1]
    margin = sorted_sims[0] - sorted_sims[1]
    return margin

# After first run, uncomment this alternative uncertainty measure:
# def calculate_margin(similarities):
#     """
#     Alternative: Use absolute confidence in top prediction
#     Low confidence = uncertain
#     """
#     max_confidence = np.max(similarities)
#     # Convert to margin-like score (higher = more certain)
#     # If max is 0.6, margin = 0.6 - 0.5 = 0.1 (uncertain)
#     # If max is 0.9, margin = 0.9 - 0.5 = 0.4 (certain)
#     margin = max_confidence - 0.5
#     return margin

In [16]:
label_embeddings = model.encode(labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

In [17]:
# Classify with confidence threshold
results = []
predictions = []

print(f"Confidence-Based Classification (threshold={confidence_threshold})")

for i, review in enumerate(test_reviews):
    similarities = sim_matrix[i]
    predicted_idx = np.argmax(similarities)
    margin = calculate_margin(similarities)
    
    # Predict only if confident enough
    if margin >= confidence_threshold:
        prediction = predicted_idx
        predictions.append(prediction)
    else:
        prediction = None
        predictions.append(None)
    
    pred_label = labels[predicted_idx] if prediction is not None else "uncertain"
    print(f"{i+1}. '{review}' -> {pred_label} (margin: {margin:.3f})")
    
    results.append({'pred': prediction, 'true': y_true[i], 'margin': margin})

CONFIDENCE-BASED CLASSIFICATION (threshold=0.15)

1. 'Absolutely fantastic! Best movie ever!'
   True label: Positive
   Prediction: UNCERTAIN
   Top confidence: 0.173
   Margin: 0.047 ✗ Below threshold
   Status: UNCERTAIN

2. 'Pretty good, I liked it'
   True label: Positive
   Prediction: UNCERTAIN
   Top confidence: 0.217
   Margin: 0.033 ✗ Below threshold
   Status: UNCERTAIN

3. 'It was fine, nothing special'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.254
   Margin: 0.055 ✗ Below threshold
   Status: UNCERTAIN

4. 'Not bad but not great either'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.209
   Margin: 0.003 ✗ Below threshold
   Status: UNCERTAIN

5. 'Quite disappointing'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.276
   Margin: 0.061 ✗ Below threshold
   Status: UNCERTAIN

6. 'Terrible! Complete waste of time!'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.324
   Margin: 0.062

In [18]:
# Calculate metrics
made_predictions = [r for r in results if r['pred'] is not None]
correct = [r for r in made_predictions if r['pred'] == r['true']]

coverage = len(made_predictions) / len(results)
accuracy = len(correct) / len(made_predictions) if made_predictions else 0

print(f"\nPerformance:")
print(f"Coverage: {len(made_predictions)}/{len(results)} = {coverage:.1%}")
print(f"Accuracy: {len(correct)}/{len(made_predictions)} = {accuracy:.1%}")


PERFORMANCE ANALYSIS

Coverage: 0/8 = 0.0%
  → Made predictions for 0 reviews
  → Refused to predict on 8 reviews

Accuracy (on predictions made): 0/0 = 0.0%
  → Of the 0 predictions, 0 were correct

Trade-off Analysis:
  Threshold = 0.15
  → Higher threshold = fewer predictions but higher accuracy
  → Lower threshold = more predictions but lower accuracy


### Questions

1. What do uncertain reviews have in common? Are they using hedging language like "kind of" or "somewhat"?

2. Compare results at threshold=0.05 vs 0.30. Describe the coverage vs accuracy trade-off. When would you want high coverage vs high accuracy?

3. How could you use confidence-based prediction in production? What should a system do when the model is uncertain?

#### Medium Task 4: Classifier Failure Analysis

Train the classifier and see what kinds of reviews it gets wrong. Then add your own test cases.


In [19]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from datasets import load_dataset
import numpy as np

Load data and train a classifier.

In [20]:
data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_subset = data["train"].shuffle(seed=42).select(range(1000))
test_subset = data["test"].shuffle(seed=42).select(range(200))

print("Training classifier...")
train_embeddings = model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = model.encode(test_subset["text"], show_progress_bar=False)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])

predictions = clf.predict(test_embeddings)
probabilities = clf.predict_proba(test_embeddings)

Training classifier...


Analyze the errors.

In [21]:
errors = []
for i in range(len(test_subset)):
    if predictions[i] != test_subset["label"][i]:
        confidence = probabilities[i][predictions[i]]
        errors.append({
            'index': i,
            'text': test_subset["text"][i],
            'true_label': test_subset["label"][i],
            'predicted_label': predictions[i],
            'confidence': confidence,
            'length': len(test_subset["text"][i].split())
        })

total_errors = len(errors)
total_samples = len(test_subset)
accuracy = (total_samples - total_errors) / total_samples

print(f"Overall: {total_samples - total_errors}/{total_samples} correct ({accuracy:.1%})")
print(f"Errors: {total_errors}")

Overall: 174/200 correct (87.0%)
Errors: 26


Look at high-confidence errors (the most surprising mistakes).

In [22]:
high_conf_errors = [e for e in errors if e['confidence'] > 0.7]

print("High-Confidence Errors:")
for error in high_conf_errors[:5]:
    true_sent = "positive" if error['true_label'] == 1 else "negative"
    pred_sent = "positive" if error['predicted_label'] == 1 else "negative"
    print(f"\n'{error['text']}'")
    print(f"True: {true_sent} | Predicted: {pred_sent} (conf: {error['confidence']:.3f})")

HIGH-CONFIDENCE ERRORS:

1. 'an uneasy mix of run-of-the-mill raunchy humor and seemingly sincere personal reflection .'
   True: Negative | Predicted: Positive
   Confidence: 0.701

2. 'the stunt work is top-notch ; the dialogue and drama often food-spittingly funny .'
   True: Negative | Predicted: Positive
   Confidence: 0.867

3. 'goldmember is funny enough to justify the embarrassment of bringing a barf bag to the moviehouse .'
   True: Positive | Predicted: Negative
   Confidence: 0.710

4. 'steven soderbergh doesn't remake andrei tarkovsky's solaris so much as distill it .'
   True: Positive | Predicted: Negative
   Confidence: 0.730

5. '" what really happened ? " is a question for philosophers , not filmmakers ; all the filmmakers need to do is engage an audience .'
   True: Positive | Predicted: Negative
   Confidence: 0.717


Test on edge cases like sarcasm and mixed sentiment.

In [23]:
edge_cases = [
    ("Sarcastic", "Oh great, another masterpiece. not!", 0),
    ("Mixed", "The acting was great but the plot was terrible", 0),
    ("Ambiguous", "It was a movie", 0),
]

edge_embeddings = model.encode([text for _, text, _ in edge_cases])
edge_predictions = clf.predict(edge_embeddings)

print("\nEdge Cases:")
correct = 0
for i, (category, text, true_label) in enumerate(edge_cases):
    pred = edge_predictions[i]
    match = "correct" if pred == true_label else "wrong"
    if pred == true_label:
        correct += 1
    print(f"{category}: '{text}' -> {pred} ({match})")

print(f"\nEdge case accuracy: {correct}/{len(edge_cases)}")


EDGE CASES:

Sarcastic: 'Oh great, another masterpiece. NOT!'
  Predicted: 1 | Actual: 0 | WRONG

Mixed: 'The acting was great but the plot was terrible'
  Predicted: 0 | Actual: 0 | CORRECT

Backhanded: 'Not as bad as I expected'
  Predicted: 0 | Actual: 1 | WRONG

Double negative: 'Not unwatchable'
  Predicted: 0 | Actual: 1 | WRONG

Very short: 'Boring'
  Predicted: 0 | Actual: 0 | CORRECT

Ambiguous: 'It was a movie'
  Predicted: 0 | Actual: 0 | CORRECT

Edge case accuracy: 3/6


### Questions

1. What do high-confidence errors have in common?

2. Do errors tend to be shorter or longer than correct predictions?

3. Which edge cases failed most - sarcasm, mixed sentiment, or double negatives?
