# Chapter 4: Text Classification - Medium Tasks

This notebook focuses on building practical text classifiers. You'll create custom multi-class sentiment classifiers, evaluate performance with limited training data, implement confidence-based classification with uncertainty handling, and perform systematic failure analysis. These skills are crucial for real-world NLP applications where data and perfect accuracy are often limited.


---

## Setup

Run all cells in this section to set up the environment and load necessary data.


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

# **Data**

In [2]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

# **Text Classification with Representation Models**

## **Using a Task-specific Model**

In [4]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [5]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:10<00:00, 103.45it/s]


In [6]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [7]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



## **Classification Tasks that Leverage Embeddings**

### Supervised Classification

In [8]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [9]:
train_embeddings.shape

(8530, 768)

In [10]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

In [11]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



**Tip!**  

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

In [12]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot Classification

In [13]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [15]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



**Tip!**  

What would happen if you were to use different descriptions? Use **"A very negative movie review"** and **"A very positive movie review"** to see what happens!

## **Classification with Generative Models**

### Encoder-decoder Models

In [16]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [17]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [18]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [00:55<00:00, 19.05it/s]


In [19]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### ChatGPT for Classification

In [25]:
import openai

# Create client
client = openai.OpenAI(api_key="YOUR_KEY_HERE")

In [24]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
            },
        {
            "role": "user",
            "content":   prompt.replace("[DOCUMENT]", document)
            }
    ]
    chat_completion = client.chat.completions.create(
      messages=messages,
      model=model,
      temperature=0
    )
    return chat_completion.choices[0].message.content

In [22]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious , charming , quirky , original"
chatgpt_generation(prompt, document)

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: YOUR_KEY*HERE. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

The next step would be to run one of OpenAI's model against the entire evaluation dataset. However, only run this when you have sufficient tokens as this will call the API for the entire test dataset (1066 records).

In [23]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

  0%|          | 0/1066 [00:00<?, ?it/s]


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: YOUR_KEY*HERE. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.87      0.97      0.92       533
Positive Review       0.96      0.86      0.91       533

       accuracy                           0.91      1066
      macro avg       0.92      0.91      0.91      1066
   weighted avg       0.92      0.91      0.91      1066



## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

This section is divided into EASY, MEDIUM, & HARD.

---

## Medium Tasks


### Medium Tasks - Building Real Classifiers

These tasks require more modification and experimentation. You'll build complete classification systems.

**About This Task:**

Custom sentiment categories allow you to tailor classification to specific domains beyond generic positive/negative labels. This is critical for domain-specific applications like product reviews, medical feedback, or financial sentiment.


#### Medium Task 1: Multi-Class Sentiment Classifier with Custom Categories

**Instructions:**

1. Execute code to see how the 5-level sentiment classifier works
2. Analyze confusion matrix to identify which categories get confused
3. Uncomment TODO to add a 6th sentiment level
4. Write 3 reviews specifically for your new category
5. Compare performance before and after adding the category

In [30]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Test reviews covering different sentiment intensities
test_reviews = [
    "This is the best movie I've ever seen! Absolute masterpiece!",
    "Pretty good movie, I enjoyed it",
    "It was okay, nothing special",
    "Not great, pretty disappointing",
    "Absolutely terrible, worst movie ever",
    "Amazing performances and stunning visuals!",
    "Mediocre at best",
    "Quite bad, wouldn't recommend",
]

# TODO: After first run, add 3 reviews that should fit your new category:
# test_reviews.extend([
#     "Your review 1 here",
#     "Your review 2 here",
#     "Your review 3 here",
# ])

# 5-level sentiment classification
sentiment_labels = [
    "extremely negative sentiment",
    "somewhat negative sentiment",
    "neutral sentiment",
    "somewhat positive sentiment",
    "extremely positive sentiment"
]

# TODO: After analyzing results, uncomment to add 6th category between somewhat and extremely positive:
# sentiment_labels = [
#     "extremely negative sentiment",
#     "somewhat negative sentiment",
#     "neutral sentiment",
#     "somewhat positive sentiment",
#     "very positive sentiment",  # NEW CATEGORY
#     "extremely positive sentiment"
# ]

# Create embeddings
label_embeddings = model.encode(sentiment_labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

# Classify
predictions = np.argmax(sim_matrix, axis=1)

print("="*80)
print(f"MULTI-CLASS CLASSIFICATION ({len(sentiment_labels)} categories)")
print("="*80)

for i, review in enumerate(test_reviews):
    predicted_idx = predictions[i]
    predicted_label = sentiment_labels[predicted_idx]
    confidence = sim_matrix[i][predicted_idx]

    # Get top 2 predictions to see confusion
    top2_indices = np.argsort(sim_matrix[i])[-2:][::-1]
    second_best_idx = top2_indices[1]
    second_best_label = sentiment_labels[second_best_idx]
    second_best_score = sim_matrix[i][second_best_idx]
    margin = confidence - second_best_score

    print(f"\nReview {i+1}: '{review[:60]}...'")
    print(f"  1st: {predicted_label:30s} ({confidence:.3f})")
    print(f"  2nd: {second_best_label:30s} ({second_best_score:.3f})")
    print(f"  Margin: {margin:.3f}", end="")

    if margin < 0.05:
        print(" ⚠️  VERY UNCERTAIN - almost tied!")
    elif margin < 0.15:
        print(" ⚠️  Uncertain")
    else:
        print(" ✓ Confident")

# Analyze category confusion
print("\n" + "="*80)
print("CATEGORY CONFUSION ANALYSIS")
print("="*80)
print("How similar are the category descriptions to each other?")
print("(High similarity = easy to confuse)\n")

label_similarity = cosine_similarity(label_embeddings)

print(f"{'Category Pair':<60s} {'Similarity':<12s}")
print("-"*75)

confusions = []
for i in range(len(sentiment_labels)):
    for j in range(i+1, len(sentiment_labels)):
        sim = label_similarity[i][j]
        confusions.append((i, j, sim))

# Sort by similarity (most confusing first)
for i, j, sim in sorted(confusions, key=lambda x: x[2], reverse=True)[:10]:
    pair_name = f"{sentiment_labels[i]} <-> {sentiment_labels[j]}"
    marker = "⚠️ " if sim > 0.7 else ""
    print(f"{marker}{pair_name:<60s} {sim:.3f}")

MULTI-CLASS CLASSIFICATION (5 categories)

Review 1: 'This is the best movie I've ever seen! Absolute masterpiece!...'
  1st: extremely positive sentiment   (0.221)
  2nd: extremely negative sentiment   (0.119)
  Margin: 0.102 ⚠️  Uncertain

Review 2: 'Pretty good movie, I enjoyed it...'
  1st: somewhat positive sentiment    (0.320)
  2nd: extremely positive sentiment   (0.313)
  Margin: 0.007 ⚠️  VERY UNCERTAIN - almost tied!

Review 3: 'It was okay, nothing special...'
  1st: somewhat negative sentiment    (0.411)
  2nd: somewhat positive sentiment    (0.395)
  Margin: 0.016 ⚠️  VERY UNCERTAIN - almost tied!

Review 4: 'Not great, pretty disappointing...'
  1st: somewhat negative sentiment    (0.451)
  2nd: extremely negative sentiment   (0.443)
  Margin: 0.008 ⚠️  VERY UNCERTAIN - almost tied!

Review 5: 'Absolutely terrible, worst movie ever...'
  1st: extremely negative sentiment   (0.442)
  2nd: somewhat negative sentiment    (0.354)
  Margin: 0.087 ⚠️  Uncertain

Review 6: 'Amaz

**Questions:**

1. Which reviews have low margins (<0.10)? What linguistic features do they share? How does multi-class classification differ from binary?

2. Which adjacent categories have highest similarity in the confusion analysis? How could you rewrite label descriptions to create clearer boundaries?

3. After adding your 6th category: Did reviews switch to it? Did the new category create more uncertainty or resolve confusion?

**About This Task:**

Real-world scenarios often have limited labeled data. Understanding how classifiers perform with small datasets and techniques to maximize their effectiveness (data augmentation, transfer learning) is essential for practical applications.


#### Medium Task 2: Classifier Performance with Limited Training Data

**Instructions:**

1. Execute code to see task-specific model vs embedding classifier with 1000 training samples
2. Modify `train_size` to 100, then 2000, then 5000 - run after each change
3. Fill in the results table in the TODO section
4. Analyze at what point the embedding classifier matches the task-specific model

In [31]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np

data = load_dataset("rotten_tomatoes")

# TODO: EXPERIMENT WITH THIS VALUE - Try: 100, 500, 1000, 2000, 5000
train_size = 1000
test_size = 300

train_subset = data["train"].shuffle(seed=42).select(range(min(train_size, len(data["train"]))))
test_subset = data["test"].shuffle(seed=42).select(range(test_size))

print("="*80)
print(f"EXPERIMENT: Training Size = {train_size}")
print("="*80)

# Approach 1: Task-Specific Model (pre-trained for sentiment)
print("\n[1/2] Testing Task-Specific Model...")
print("Note: This model doesn't use our training data - it's already trained!")

task_model = pipeline(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True,
    device=-1
)

y_pred_task = []
for text in test_subset["text"]:
    output = task_model(text)[0]
    neg_score = output[0]["score"]
    pos_score = output[2]["score"]
    y_pred_task.append(1 if pos_score > neg_score else 0)

task_f1 = f1_score(test_subset["label"], y_pred_task, average='weighted')

print(f"✓ Task-Specific Model F1: {task_f1:.4f}")

# Approach 2: Embedding + Classifier (uses our training data)
print(f"\n[2/2] Training Embedding Classifier on {train_size} samples...")

embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_embeddings = embedding_model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = embedding_model.encode(test_subset["text"], show_progress_bar=False)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])

y_pred_embed = clf.predict(test_embeddings)
embed_f1 = f1_score(test_subset["label"], y_pred_embed, average='weighted')

print(f"✓ Embedding Classifier F1: {embed_f1:.4f}")

# Comparison
print("\n" + "="*80)
print("RESULTS SUMMARY")
print("="*80)
print(f"Training samples used: {train_size}")
print(f"\nTask-Specific (pre-trained):  F1 = {task_f1:.4f}")
print(f"Embedding + Classifier:       F1 = {embed_f1:.4f}")
print(f"Difference:                       {embed_f1 - task_f1:+.4f}")

if embed_f1 > task_f1:
    print(f"\n→ Embedding approach WINS with {train_size} samples!")
elif embed_f1 > task_f1 - 0.01:
    print(f"\n→ Essentially TIED - both perform similarly")
else:
    print(f"\n→ Task-specific model wins - embedding needs more data")

# Show some predictions
print("\n" + "="*80)
print("EXAMPLE PREDICTIONS (first 5 test samples)")
print("="*80)

for i in range(5):
    true_label = "Positive" if test_subset["label"][i] == 1 else "Negative"
    task_pred = "Positive" if y_pred_task[i] == 1 else "Negative"
    embed_pred = "Positive" if y_pred_embed[i] == 1 else "Negative"

    task_correct = "✓" if y_pred_task[i] == test_subset["label"][i] else "✗"
    embed_correct = "✓" if y_pred_embed[i] == test_subset["label"][i] else "✗"

    print(f"\n{i+1}. '{test_subset['text'][i][:60]}...'")
    print(f"   True: {true_label}")
    print(f"   Task-Specific: {task_pred} {task_correct}")
    print(f"   Embedding:     {embed_pred} {embed_correct}")

# TODO: RECORD YOUR RESULTS HERE
# After running with different train_size values, fill in this table:
print("\n" + "="*80)
print("YOUR EXPERIMENT RESULTS")
print("="*80)
print("Run the code multiple times with different train_size values and record:")
print()
print("| Train Size | Task F1 | Embedding F1 | Winner      |")
print("|------------|---------|--------------|-------------|")
print("| 100        | ?.????  | ?.????       | ?           |")
print("| 500        | ?.????  | ?.????       | ?           |")
print("| 1000       | ?.????  | ?.????       | ?           |")
print("| 2000       | ?.????  | ?.????       | ?           |")
print("| 5000       | ?.????  | ?.????       | ?           |")
print()
print(f"Current run: | {train_size:<10} | {task_f1:.4f}  | {embed_f1:.4f}       | {'Embed' if embed_f1 > task_f1 else 'Task':<11} |")

EXPERIMENT: Training Size = 1000

[1/2] Testing Task-Specific Model...
Note: This model doesn't use our training data - it's already trained!


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


✓ Task-Specific Model F1: 0.7709

[2/2] Training Embedding Classifier on 1000 samples...
✓ Embedding Classifier F1: 0.8699

RESULTS SUMMARY
Training samples used: 1000

Task-Specific (pre-trained):  F1 = 0.7709
Embedding + Classifier:       F1 = 0.8699
Difference:                       +0.0990

→ Embedding approach WINS with 1000 samples!

EXAMPLE PREDICTIONS (first 5 test samples)

1. 'unpretentious , charming , quirky , original...'
   True: Positive
   Task-Specific: Positive ✓
   Embedding:     Positive ✓

2. 'a film really has to be exceptional to justify a three hour ...'
   True: Negative
   Task-Specific: Negative ✓
   Embedding:     Negative ✓

3. 'working from a surprisingly sensitive script co-written by g...'
   True: Positive
   Task-Specific: Positive ✓
   Embedding:     Positive ✓

4. 'it may not be particularly innovative , but the film's crisp...'
   True: Positive
   Task-Specific: Positive ✓
   Embedding:     Positive ✓

5. 'such a premise is ripe for all manner of l

**Questions:**

1. At what training size did the embedding classifier match or beat the task-specific model? What does this reveal about data requirements?

2. Were there cases where one model was correct and the other wrong? What characteristics did those reviews have?

3. With train_size=100, is this enough labeled data? How does this compare to training models from scratch?

**About This Task:**

Production classifiers must handle uncertainty gracefully. Confidence thresholds and uncertainty quantification prevent incorrect predictions and allow human-in-the-loop workflows for ambiguous cases.


#### Medium Task 3: Confidence-Based Classifier with Uncertainty Handling

**Instructions:**

1. Execute code to see classifier handling uncertain predictions with threshold=0.15
2. Analyze which reviews were marked as "uncertain" and why
3. Change `confidence_threshold` to 0.05, then 0.30 to observe trade-offs
4. Uncomment TODO to implement an alternative uncertainty measure
5. Compare which uncertainty measure works better

In [32]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Reviews with varying levels of clarity
test_reviews = [
    "Absolutely fantastic! Best movie ever!",           # Clear positive
    "Pretty good, I liked it",                          # Weak positive
    "It was fine, nothing special",                     # Ambiguous
    "Not bad but not great either",                     # Very ambiguous
    "Quite disappointing",                              # Weak negative
    "Terrible! Complete waste of time!",                # Clear negative
    "The movie had some interesting moments",           # Ambiguous positive
    "Outstanding performances all around!",             # Clear positive
]

# True labels (for evaluation)
y_true = [1, 1, 0, 0, 0, 0, 1, 1]  # 1=positive, 0=negative

labels = ["A negative movie review", "A positive movie review"]

# TODO: EXPERIMENT WITH THIS - Try: 0.05, 0.15, 0.30, 0.50
confidence_threshold = 0.15

label_embeddings = model.encode(labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

def calculate_margin(similarities):
    """
    Margin = difference between top two predictions
    Small margin = uncertain (predictions are close)
    """
    sorted_sims = np.sort(similarities)[::-1]
    margin = sorted_sims[0] - sorted_sims[1]
    return margin

# TODO: After first run, uncomment this alternative uncertainty measure:
# def calculate_margin(similarities):
#     """
#     Alternative: Use absolute confidence in top prediction
#     Low confidence = uncertain
#     """
#     max_confidence = np.max(similarities)
#     # Convert to margin-like score (higher = more certain)
#     # If max is 0.6, margin = 0.6 - 0.5 = 0.1 (uncertain)
#     # If max is 0.9, margin = 0.9 - 0.5 = 0.4 (certain)
#     margin = max_confidence - 0.5
#     return margin

# Classify with confidence threshold
results = []
predictions = []
confidences = []

print("="*80)
print(f"CONFIDENCE-BASED CLASSIFICATION (threshold={confidence_threshold})")
print("="*80)

for i, review in enumerate(test_reviews):
    similarities = sim_matrix[i]
    predicted_idx = np.argmax(similarities)
    top_confidence = similarities[predicted_idx]
    margin = calculate_margin(similarities)

    # Decision: predict only if confident enough
    if margin >= confidence_threshold:
        prediction = predicted_idx
        status = "PREDICTED"
        predictions.append(prediction)
    else:
        prediction = None
        status = "UNCERTAIN"
        predictions.append(None)

    true_label = "Positive" if y_true[i] == 1 else "Negative"
    pred_label = labels[predicted_idx] if prediction is not None else "UNCERTAIN"

    print(f"\n{i+1}. '{review}'")
    print(f"   True label: {true_label}")
    print(f"   Prediction: {pred_label}")
    print(f"   Top confidence: {top_confidence:.3f}")
    print(f"   Margin: {margin:.3f} {'✓ Above threshold' if margin >= confidence_threshold else '✗ Below threshold'}")
    print(f"   Status: {status}", end="")

    if prediction is not None:
        correct = prediction == y_true[i]
        print(f" - {'✓ CORRECT' if correct else '✗ INCORRECT'}")
    else:
        print()

    results.append({
        'review': review,
        'true': y_true[i],
        'pred': prediction,
        'margin': margin,
        'status': status
    })

# Calculate metrics
print("\n" + "="*80)
print("PERFORMANCE ANALYSIS")
print("="*80)

made_predictions = [r for r in results if r['pred'] is not None]
uncertain_cases = [r for r in results if r['pred'] is None]
correct_predictions = [r for r in made_predictions if r['pred'] == r['true']]

total = len(results)
n_predicted = len(made_predictions)
n_uncertain = len(uncertain_cases)
n_correct = len(correct_predictions)

coverage = n_predicted / total
accuracy = n_correct / n_predicted if n_predicted > 0 else 0

print(f"\nCoverage: {n_predicted}/{total} = {coverage:.1%}")
print(f"  → Made predictions for {n_predicted} reviews")
print(f"  → Refused to predict on {n_uncertain} reviews")

print(f"\nAccuracy (on predictions made): {n_correct}/{n_predicted} = {accuracy:.1%}")
print(f"  → Of the {n_predicted} predictions, {n_correct} were correct")

print(f"\nTrade-off Analysis:")
print(f"  Threshold = {confidence_threshold}")
print(f"  → Higher threshold = fewer predictions but higher accuracy")
print(f"  → Lower threshold = more predictions but lower accuracy")

# Show which reviews were uncertain
if n_uncertain > 0:
    print(f"\n" + "-"*80)
    print(f"UNCERTAIN CASES (margin < {confidence_threshold}):")
    print("-"*80)
    for r in uncertain_cases:
        print(f"  • '{r['review']}'")
        print(f"    Margin: {r['margin']:.3f} (too close to call)")

# TODO: After experimenting with thresholds, analyze the trade-off
print("\n" + "="*80)
print("EXPERIMENT LOG - Fill this in as you try different thresholds:")
print("="*80)
print("| Threshold | Coverage | Accuracy | Notes                    |")
print("|-----------|----------|----------|--------------------------|")
print("| 0.05      | ??.?%    | ??.?%    | ?                        |")
print("| 0.15      | ??.?%    | ??.?%    | ?                        |")
print("| 0.30      | ??.?%    | ??.?%    | ?                        |")
print("| 0.50      | ??.?%    | ??.?%    | ?                        |")
print()
print(f"Current:    | {confidence_threshold:<9.2f} | {coverage*100:>5.1f}%    | {accuracy*100:>5.1f}%    |")

CONFIDENCE-BASED CLASSIFICATION (threshold=0.15)

1. 'Absolutely fantastic! Best movie ever!'
   True label: Positive
   Prediction: UNCERTAIN
   Top confidence: 0.451
   Margin: 0.092 ✗ Below threshold
   Status: UNCERTAIN

2. 'Pretty good, I liked it'
   True label: Positive
   Prediction: UNCERTAIN
   Top confidence: 0.410
   Margin: 0.018 ✗ Below threshold
   Status: UNCERTAIN

3. 'It was fine, nothing special'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.418
   Margin: 0.051 ✗ Below threshold
   Status: UNCERTAIN

4. 'Not bad but not great either'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.414
   Margin: 0.053 ✗ Below threshold
   Status: UNCERTAIN

5. 'Quite disappointing'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.354
   Margin: 0.080 ✗ Below threshold
   Status: UNCERTAIN

6. 'Terrible! Complete waste of time!'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.397
   Margin: 0.121

**Questions:**

1. What do uncertain reviews have in common? Are they using hedging language like "kind of" or "somewhat"?

2. Compare results at threshold=0.05 vs 0.30. Describe the coverage vs accuracy trade-off. When would you want high coverage vs high accuracy?

3. How could you use confidence-based prediction in production? What should a system do when the model is uncertain?

**About This Task:**

Systematic failure analysis reveals model weaknesses and guides improvements. Understanding when and why classifiers fail is crucial for iterative model development and setting realistic performance expectations.


#### Medium Task 4: Classifier Failure Analysis

**Instructions:**

1. Train classifier and review overall error analysis
2. Study error patterns to understand which reviews failed and why
3. Uncomment TODO to add your own "hard cases" that you predict will fail
4. Test hypotheses: Do sarcastic reviews fail? Short reviews? Mixed sentiment?
5. Propose fixes based on your analysis

In [33]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from datasets import load_dataset
import numpy as np

# Load data
data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Use subset for faster experimentation
train_subset = data["train"].shuffle(seed=42).select(range(1000))
test_subset = data["test"].shuffle(seed=42).select(range(200))

# Train classifier
print("Training classifier on 1000 movie reviews...")
train_embeddings = model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = model.encode(test_subset["text"], show_progress_bar=False)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])

# Get predictions
predictions = clf.predict(test_embeddings)
probabilities = clf.predict_proba(test_embeddings)

# Analyze errors
print("\n" + "="*80)
print("ERROR ANALYSIS")
print("="*80)

errors = []
for i in range(len(test_subset)):
    if predictions[i] != test_subset["label"][i]:
        confidence = probabilities[i][predictions[i]]
        errors.append({
            'index': i,
            'text': test_subset["text"][i],
            'true_label': test_subset["label"][i],
            'predicted_label': predictions[i],
            'confidence': confidence,
            'length': len(test_subset["text"][i].split())
        })

total_errors = len(errors)
total_samples = len(test_subset)
accuracy = (total_samples - total_errors) / total_samples

print(f"\nOverall Performance:")
print(f"  Correct: {total_samples - total_errors}/{total_samples} ({accuracy:.1%})")
print(f"  Errors:  {total_errors}/{total_samples} ({total_errors/total_samples:.1%})")

# Categorize errors
false_positives = [e for e in errors if e['predicted_label'] == 1]
false_negatives = [e for e in errors if e['predicted_label'] == 0]

print(f"\nError Types:")
print(f"  False Positives: {len(false_positives)} (predicted positive, actually negative)")
print(f"  False Negatives: {len(false_negatives)} (predicted negative, actually positive)")

# Show high-confidence errors (most surprising)
high_conf_errors = [e for e in errors if e['confidence'] > 0.7]

print(f"\n" + "-"*80)
print(f"HIGH-CONFIDENCE ERRORS (confidence > 0.7)")
print(f"These are the most surprising mistakes:")
print("-"*80)

for i, error in enumerate(high_conf_errors[:5]):
    true_sent = "Positive" if error['true_label'] == 1 else "Negative"
    pred_sent = "Positive" if error['predicted_label'] == 1 else "Negative"

    print(f"\n{i+1}. '{error['text']}'")
    print(f"   True: {true_sent} | Predicted: {pred_sent} | Confidence: {error['confidence']:.3f}")
    print(f"   Length: {error['length']} words")

# Analyze by text length
print(f"\n" + "-"*80)
print("ERROR ANALYSIS BY TEXT LENGTH")
print("-"*80)

error_lengths = [e['length'] for e in errors]
correct_lengths = [len(test_subset["text"][i].split())
                   for i in range(len(test_subset))
                   if predictions[i] == test_subset["label"][i]]

avg_error_length = np.mean(error_lengths) if error_lengths else 0
avg_correct_length = np.mean(correct_lengths) if correct_lengths else 0

print(f"\nAverage length of ERROR reviews: {avg_error_length:.1f} words")
print(f"Average length of CORRECT reviews: {avg_correct_length:.1f} words")

if avg_error_length < avg_correct_length:
    print(f"→ Observation: Errors tend to be SHORTER")
elif avg_error_length > avg_correct_length:
    print(f"→ Observation: Errors tend to be LONGER")
else:
    print(f"→ Observation: No clear length pattern")

# Test edge cases
print("\n" + "="*80)
print("TESTING EDGE CASES")
print("="*80)

edge_cases = [
    ("Sarcastic", "Oh great, another masterpiece. NOT!", 0),
    ("Mixed", "The acting was great but the plot was terrible", 0),
    ("Backhanded", "Not as bad as I expected", 1),
    ("Double negative", "Not unwatchable", 1),
    ("Very short", "Boring", 0),
    ("Ambiguous", "It was a movie", 0),
]

# TODO: After analyzing above errors, add your own test cases:
# edge_cases.extend([
#     ("Your category", "Your test review here", expected_label_0_or_1),
#     ("Another category", "Another test review", expected_label),
# ])

print("\nTesting challenging cases that often fail:")
print("-"*80)

edge_embeddings = model.encode([text for _, text, _ in edge_cases])
edge_predictions = clf.predict(edge_embeddings)
edge_probs = clf.predict_proba(edge_embeddings)

correct_count = 0
for i, (category, text, true_label) in enumerate(edge_cases):
    pred = edge_predictions[i]
    conf = edge_probs[i][pred]
    correct = pred == true_label
    if correct:
        correct_count += 1

    true_sent = "Positive" if true_label == 1 else "Negative"
    pred_sent = "Positive" if pred == 1 else "Negative"

    print(f"\n{category}: '{text}'")
    print(f"  True: {true_sent} | Predicted: {pred_sent} | Confidence: {conf:.3f}")
    print(f"  Result: {'✓ CORRECT' if correct else '✗ WRONG'}")

edge_accuracy = correct_count / len(edge_cases)
print(f"\n" + "-"*80)
print(f"Edge Case Accuracy: {correct_count}/{len(edge_cases)} ({edge_accuracy:.1%})")
print(f"Regular Test Accuracy: {accuracy:.1%}")
print(f"Difference: {accuracy - edge_accuracy:+.1%}")

# Summary and insights
print("\n" + "="*80)
print("KEY INSIGHTS FROM ERROR ANALYSIS")
print("="*80)

print("\n1. Error Distribution:")
print(f"   - False Positives (predicted too optimistic): {len(false_positives)}")
print(f"   - False Negatives (predicted too pessimistic): {len(false_negatives)}")
if len(false_positives) > len(false_negatives):
    print(f"   → Classifier has POSITIVE BIAS")
elif len(false_negatives) > len(false_positives):
    print(f"   → Classifier has NEGATIVE BIAS")

print("\n2. Challenging Cases:")
failing_categories = [cat for cat, text, true in edge_cases
                     if clf.predict(model.encode([text]))[0] != true]
if failing_categories:
    print(f"   The classifier struggles with: {', '.join(failing_categories)}")

print("\n3. Confidence Analysis:")
if high_conf_errors:
    print(f"   Found {len(high_conf_errors)} high-confidence errors")
    print(f"   → The model is 'confidently wrong' on some cases")

print("\n" + "="*80)
print("TODO: Based on your error analysis, propose improvements:")
print("="*80)
print("# Write your observations here:")
print("# 1. What patterns did you notice in the errors?")
print("# 2. Which edge cases failed most?")
print("# 3. How would you improve the classifier?")
print("#    - Better training data?")
print("#    - Different features?")
print("#    - Ensemble approach?")
print("#    - Confidence thresholds?")

Training classifier on 1000 movie reviews...

ERROR ANALYSIS

Overall Performance:
  Correct: 174/200 (87.0%)
  Errors:  26/200 (13.0%)

Error Types:
  False Positives: 11 (predicted positive, actually negative)
  False Negatives: 15 (predicted negative, actually positive)

--------------------------------------------------------------------------------
HIGH-CONFIDENCE ERRORS (confidence > 0.7)
These are the most surprising mistakes:
--------------------------------------------------------------------------------

1. 'an uneasy mix of run-of-the-mill raunchy humor and seemingly sincere personal reflection .'
   True: Negative | Predicted: Positive | Confidence: 0.701
   Length: 13 words

2. 'the stunt work is top-notch ; the dialogue and drama often food-spittingly funny .'
   True: Negative | Predicted: Positive | Confidence: 0.867
   Length: 14 words

3. 'goldmember is funny enough to justify the embarrassment of bringing a barf bag to the moviehouse .'
   True: Positive | Predicted:

**Reflection Questions:**

1. What do high-confidence errors have in common? How does model confidence relate to correctness?

2. Do errors tend to be shorter, longer, or similar length compared to correct predictions? Why might text length affect classification?

3. Which edge cases failed most - sarcasm, mixed sentiment, or double negatives? What aspects of language do embeddings not capture well?