<h1>Chapter 4 - Text Classification</h1>
<i>Classifying text with both representative and generative models</i>

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961"><img src="https://img.shields.io/badge/Buy%20the%20Book!-grey?logo=amazon"></a>
<a href="https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/"><img src="https://img.shields.io/badge/O'Reilly-white.svg?logo=data:image/svg%2bxml;base64,PHN2ZyB3aWR0aD0iMzQiIGhlaWdodD0iMjciIHZpZXdCb3g9IjAgMCAzNCAyNyIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iMTMiIGN5PSIxNCIgcj0iMTEiIHN0cm9rZT0iI0Q0MDEwMSIgc3Ryb2tlLXdpZHRoPSI0Ii8+CjxjaXJjbGUgY3g9IjMwLjUiIGN5PSIzLjUiIHI9IjMuNSIgZmlsbD0iI0Q0MDEwMSIvPgo8L3N2Zz4K"></a>
<a href="https://github.com/HandsOnLLM/Hands-On-Large-Language-Models"><img src="https://img.shields.io/badge/GitHub%20Repository-black?logo=github"></a>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/HandsOnLLM/Hands-On-Large-Language-Models/blob/main/chapter04/Chapter%204%20-%20Text%20Classification.ipynb)

---

This notebook is for Chapter 4 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

---

<a href="https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961">
<img src="https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/images/book_cover.png" width="350"/></a>

### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>


If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

💡 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

# **Data**

In [2]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [3]:
data["train"][0, -1]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'things really get weird , though not particularly scary : the movie is all portent and no content .'],
 'label': [1, 0]}

# **Text Classification with Representation Models**

## **Using a Task-specific Model**

In [4]:
from transformers import pipeline

# Path to our HF model
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Load model into pipeline
pipe = pipeline(
    model=model_path,
    tokenizer=model_path,
    return_all_scores=True,
    device="cuda:0"
)

config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


vocab.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cuda:0


In [5]:
import numpy as np
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "text")), total=len(data["test"])):
    negative_score = output[0]["score"]
    positive_score = output[2]["score"]
    assignment = np.argmax([negative_score, positive_score])
    y_pred.append(assignment)

100%|██████████| 1066/1066 [00:10<00:00, 103.45it/s]


In [6]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

In [7]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.76      0.88      0.81       533
Positive Review       0.86      0.72      0.78       533

       accuracy                           0.80      1066
      macro avg       0.81      0.80      0.80      1066
   weighted avg       0.81      0.80      0.80      1066



## **Classification Tasks that Leverage Embeddings**

### Supervised Classification

In [8]:
from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Convert text to embeddings
train_embeddings = model.encode(data["train"]["text"], show_progress_bar=True)
test_embeddings = model.encode(data["test"]["text"], show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/267 [00:00<?, ?it/s]

Batches:   0%|          | 0/34 [00:00<?, ?it/s]

In [9]:
train_embeddings.shape

(8530, 768)

In [10]:
from sklearn.linear_model import LogisticRegression

# Train a Logistic Regression on our train embeddings
clf = LogisticRegression(random_state=42)
clf.fit(train_embeddings, data["train"]["label"])

In [11]:
# Predict previously unseen instances
y_pred = clf.predict(test_embeddings)
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.86      0.85       533
Positive Review       0.86      0.85      0.85       533

       accuracy                           0.85      1066
      macro avg       0.85      0.85      0.85      1066
   weighted avg       0.85      0.85      0.85      1066



**Tip!**  

What would happen if we would not use a classifier at all? Instead, we can average the embeddings per class and apply cosine similarity to predict which classes match the documents best:

In [12]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.metrics.pairwise import cosine_similarity

# Average the embeddings of all documents in each target label
df = pd.DataFrame(np.hstack([train_embeddings, np.array(data["train"]["label"]).reshape(-1, 1)]))
averaged_target_embeddings = df.groupby(768).mean().values

# Find the best matching embeddings between evaluation documents and target embeddings
sim_matrix = cosine_similarity(test_embeddings, averaged_target_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

# Evaluate the model
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.85      0.84      0.84       533
Positive Review       0.84      0.85      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### Zero-shot Classification

In [13]:
# Create embeddings for our labels
label_embeddings = model.encode(["A negative review",  "A positive review"])

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

# Find the best matching label for each document
sim_matrix = cosine_similarity(test_embeddings, label_embeddings)
y_pred = np.argmax(sim_matrix, axis=1)

In [15]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.78      0.77      0.78       533
Positive Review       0.77      0.79      0.78       533

       accuracy                           0.78      1066
      macro avg       0.78      0.78      0.78      1066
   weighted avg       0.78      0.78      0.78      1066



**Tip!**  

What would happen if you were to use different descriptions? Use **"A very negative movie review"** and **"A very positive movie review"** to see what happens!

## **Classification with Generative Models**

### Encoder-decoder Models

In [16]:
# Load our model
pipe = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    device="cuda:0"
)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [17]:
# Prepare our data
prompt = "Is the following sentence positive or negative? "
data = data.map(lambda example: {"t5": prompt + example['text']})
data

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label', 't5'],
        num_rows: 1066
    })
})

In [18]:
# Run inference
y_pred = []
for output in tqdm(pipe(KeyDataset(data["test"], "t5")), total=len(data["test"])):
    text = output[0]["generated_text"]
    y_pred.append(0 if text == "negative" else 1)

100%|██████████| 1066/1066 [00:55<00:00, 19.05it/s]


In [19]:
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.83      0.85      0.84       533
Positive Review       0.85      0.83      0.84       533

       accuracy                           0.84      1066
      macro avg       0.84      0.84      0.84      1066
   weighted avg       0.84      0.84      0.84      1066



### ChatGPT for Classification

In [25]:
import openai

# Create client
client = openai.OpenAI(api_key="YOUR_KEY_HERE")

In [24]:
def chatgpt_generation(prompt, document, model="gpt-3.5-turbo-0125"):
    """Generate an output based on a prompt and an input document."""
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant."
            },
        {
            "role": "user",
            "content":   prompt.replace("[DOCUMENT]", document)
            }
    ]
    chat_completion = client.chat.completions.create(
      messages=messages,
      model=model,
      temperature=0
    )
    return chat_completion.choices[0].message.content

In [22]:
# Define a prompt template as a base
prompt = """Predict whether the following document is a positive or negative movie review:

[DOCUMENT]

If it is positive return 1 and if it is negative return 0. Do not give any other answers.
"""

# Predict the target using GPT
document = "unpretentious , charming , quirky , original"
chatgpt_generation(prompt, document)

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: YOUR_KEY*HERE. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

The next step would be to run one of OpenAI's model against the entire evaluation dataset. However, only run this when you have sufficient tokens as this will call the API for the entire test dataset (1066 records).

In [23]:
# You can skip this if you want to save your (free) credits
predictions = [chatgpt_generation(prompt, doc) for doc in tqdm(data["test"]["text"])]

  0%|          | 0/1066 [00:00<?, ?it/s]


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: YOUR_KEY*HERE. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

In [None]:
# Extract predictions
y_pred = [int(pred) for pred in predictions]

# Evaluate performance
evaluate_performance(data["test"]["label"], y_pred)

                 precision    recall  f1-score   support

Negative Review       0.87      0.97      0.92       533
Positive Review       0.96      0.86      0.91       533

       accuracy                           0.91      1066
      macro avg       0.92      0.91      0.91      1066
   weighted avg       0.92      0.91      0.91      1066



## Your Turn - Text Classification Experiments

Run each task first to see the baseline results. Follow the instructions to modify and experiment.

This section is divided into EASY, MEDIUM, & HARD.

### Easy Tasks - Hands-On Exploration

#### Easy Task 1: Zero-Shot Classifier

**Instructions:**

1. Execute the code to see baseline predictions for 3 basic reviews
2. Uncomment the larger `test_reviews` list and run again to test harder cases
3. Uncomment one label option to see how wording affects predictions
4. Compare which label style works best for ambiguous reviews

In [26]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Test reviews - THESE WORK AS-IS
test_reviews = [
    "This movie was absolutely fantastic! A masterpiece!",
    "Terrible waste of time. Very disappointing.",
    "An okay film, nothing special but watchable.",
]

# TODO: After running once, uncomment these to test harder cases:
# test_reviews = [
#     "This movie was absolutely fantastic! A masterpiece!",
#     "Terrible waste of time. Very disappointing.",
#     "An okay film, nothing special but watchable.",
#     "Oh great, another masterpiece... NOT!",  # Sarcastic
#     "Boring.",  # Very short
#     "Great acting but terrible plot.",  # Mixed sentiment
# ]

# Label descriptions - THESE WORK AS-IS
labels = [
    "A negative movie review",
    "A positive movie review"
]

# TODO: After seeing baseline, try uncommenting ONE of these:
# labels = ["negative", "positive"]  # Option 1: Simple
# labels = ["bad movie review", "good movie review"]  # Option 2: Different wording
# labels = ["a scathing negative movie review", "an enthusiastic positive movie review"]  # Option 3: Detailed

# Create embeddings and calculate similarity
label_embeddings = model.encode(labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

print("Classification Results:")
print("="*80)

for i, review in enumerate(test_reviews):
    prediction = np.argmax(sim_matrix[i])
    confidence = sim_matrix[i][prediction]
    margin = abs(sim_matrix[i][0] - sim_matrix[i][1])

    print(f"\nReview {i+1}: '{review}'")
    print(f"Predicted: {labels[prediction]}")
    print(f"Confidence: {confidence:.3f}")
    print(f"Scores -> Negative: {sim_matrix[i][0]:.3f}, Positive: {sim_matrix[i][1]:.3f}")
    print(f"Margin (certainty): {margin:.3f}")

Classification Results:

Review 1: 'This movie was absolutely fantastic! A masterpiece!'
Predicted: A positive movie review
Confidence: 0.493
Scores -> Negative: 0.382, Positive: 0.493
Margin (certainty): 0.111

Review 2: 'Terrible waste of time. Very disappointing.'
Predicted: A negative movie review
Confidence: 0.439
Scores -> Negative: 0.439, Positive: 0.299
Margin (certainty): 0.140

Review 3: 'An okay film, nothing special but watchable.'
Predicted: A positive movie review
Confidence: 0.542
Scores -> Negative: 0.503, Positive: 0.542
Margin (certainty): 0.039


**Questions:**

1. How did the classifier handle the sarcastic review ("Oh great, another masterpiece... NOT!")? What semantic features did embeddings miss?

2. Which reviews changed predictions when you modified label descriptions? Why are some reviews more sensitive to label wording than others?

3. Which reviews have low confidence margins (<0.1)? What linguistic features make certain reviews harder to classify?

#### Easy Task 2: Classifier Strategy Analysis

**Instructions:**

1. Execute code to see three pre-built classifiers (conservative, aggressive, balanced)
2. Study each confusion matrix to identify error patterns
3. Modify `classifier_yours` to create a very conservative classifier (precision > 0.9)
4. Uncomment the TODO section to analyze your classifier
5. Experiment with creating different strategy combinations

In [27]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
import numpy as np

# True labels: first 5 are negative (0), last 5 are positive (1)
y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# Pre-built classifiers to analyze
classifier_conservative = np.array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1])  # Rarely predicts positive
classifier_aggressive = np.array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1])     # Often predicts positive
classifier_balanced = np.array([0, 0, 0, 1, 0, 0, 1, 1, 1, 1])       # Balanced approach

# TODO: Modify these predictions to make YOUR classifier
# Goal: Try to achieve precision > 0.9 (be very selective about predicting 1)
classifier_yours = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

def analyze_classifier(name, y_true, y_pred):
    """Analyze classifier performance with detailed breakdown"""
    print(f"\n{'='*70}")
    print(f"{name}")
    print('='*70)

    cm = confusion_matrix(y_true, y_pred)

    # Show confusion matrix with labels
    print(f"\nConfusion Matrix:")
    print(f"                    Predicted Neg | Predicted Pos")
    print(f"Actual Neg (0):          {cm[0][0]}       |       {cm[0][1]} <- False Positives (BAD)")
    print(f"Actual Pos (1):          {cm[1][0]}       |       {cm[1][1]} <- True Positives (GOOD)")
    print(f"                          ↑                           ")
    print(f"                   False Negatives (BAD)              ")

    # Calculate metrics
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)

    print(f"\nMetrics:")
    print(f"Precision: {precision:.3f} = TP/(TP+FP) = {cm[1][1]}/({cm[1][1]}+{cm[0][1]})")
    print(f"           → Of {cm[1][1]+cm[0][1]} positive predictions, {cm[1][1]} were correct")
    print(f"\nRecall:    {recall:.3f} = TP/(TP+FN) = {cm[1][1]}/({cm[1][1]}+{cm[1][0]})")
    print(f"           → Of {cm[1][1]+cm[1][0]} actual positives, found {cm[1][1]}")
    print(f"\nF1 Score:  {f1:.3f} = 2*(P*R)/(P+R)")

    # Explain strategy
    if precision > recall + 0.1:
        print(f"\n→ Strategy: CONSERVATIVE (careful about predicting positive)")
        print(f"  ✓ Few false alarms (only {cm[0][1]} false positives)")
        print(f"  ✗ Misses actual positives ({cm[1][0]} false negatives)")
    elif recall > precision + 0.1:
        print(f"\n→ Strategy: AGGRESSIVE (liberal about predicting positive)")
        print(f"  ✓ Finds most positives (only {cm[1][0]} false negatives)")
        print(f"  ✗ Many false alarms ({cm[0][1]} false positives)")
    else:
        print(f"\n→ Strategy: BALANCED")

    return precision, recall, f1

# Analyze pre-built classifiers
results = {}
for name, classifier in [
    ("Conservative Classifier", classifier_conservative),
    ("Aggressive Classifier", classifier_aggressive),
    ("Balanced Classifier", classifier_balanced),
]:
    p, r, f = analyze_classifier(name, y_true, classifier)
    results[name] = (p, r, f)

# TODO: After modifying classifier_yours above, uncomment these lines:
# print("\n" + "="*70)
# print("ANALYZING YOUR CLASSIFIER")
# print("="*70)
# p, r, f = analyze_classifier("YOUR Classifier", y_true, classifier_yours)
# results["YOUR Classifier"] = (p, r, f)

# Summary
print("\n" + "="*70)
print("SUMMARY COMPARISON")
print("="*70)
print(f"{'Classifier':<25} {'Precision':<12} {'Recall':<12} {'F1':<12}")
print("-"*70)
for name, (p, r, f) in results.items():
    print(f"{name:<25} {p:.3f}        {r:.3f}       {f:.3f}")


Conservative Classifier

Confusion Matrix:
                    Predicted Neg | Predicted Pos
Actual Neg (0):          5       |       0 <- False Positives (BAD)
Actual Pos (1):          2       |       3 <- True Positives (GOOD)
                          ↑                           
                   False Negatives (BAD)              

Metrics:
Precision: 1.000 = TP/(TP+FP) = 3/(3+0)
           → Of 3 positive predictions, 3 were correct

Recall:    0.600 = TP/(TP+FN) = 3/(3+2)
           → Of 5 actual positives, found 3

F1 Score:  0.750 = 2*(P*R)/(P+R)

→ Strategy: CONSERVATIVE (careful about predicting positive)
  ✓ Few false alarms (only 0 false positives)
  ✗ Misses actual positives (2 false negatives)

Aggressive Classifier

Confusion Matrix:
                    Predicted Neg | Predicted Pos
Actual Neg (0):          1       |       4 <- False Positives (BAD)
Actual Pos (1):          0       |       5 <- True Positives (GOOD)
                          ↑                         

**Questions:**

1. The conservative classifier has 2 false negatives. What real-world mistake does this represent? Provide a movie review example.

2. What strategy did you use to achieve high precision in `classifier_yours`? Why does predicting positive less frequently increase precision?

3. Which classifier won on F1 score? Why doesn't the aggressive classifier win despite high recall?

#### Easy Task 3: Temperature Effects on Text Generation

**Instructions:**

1. Execute code to see how temperature affects token selection with a confident model
2. Observe how probabilities and samples change across temperatures
3. Uncomment the uncertain probability distribution and run again
4. Compare temperature effects on confident vs uncertain models
5. Uncomment TODO to add a new temperature value and analyze results

In [28]:
import numpy as np

# Starting with a CONFIDENT model (one token much more likely)
original_probs = np.array([0.50, 0.30, 0.12, 0.05, 0.03])
tokens = ["positive", "negative", "neutral", "good", "bad"]

# TODO: After first run, uncomment this UNCERTAIN distribution:
# original_probs = np.array([0.25, 0.24, 0.22, 0.18, 0.11])  # Much more balanced!
# Run again and compare the temperature effects

def apply_temperature(probs, temperature):
    """Apply temperature scaling to change distribution sharpness"""
    if temperature == 0:
        # Deterministic: always pick the highest
        result = np.zeros_like(probs)
        result[np.argmax(probs)] = 1.0
        return result

    # Apply temperature scaling
    logits = np.log(probs + 1e-10)
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits)
    return exp_logits / np.sum(exp_logits)

def visualize_distribution(probs, tokens):
    """Show probability distribution as bar chart"""
    for i, token in enumerate(tokens):
        bar_length = int(probs[i] * 100)
        bar = '█' * bar_length
        print(f"  {token:10s}: {probs[i]:.3f} {bar}")

# Test different temperatures
temperatures = [0, 0.5, 1.0, 2.0]

# TODO: After analyzing the results, uncomment this to add temperature=3.0:
# temperatures = [0, 0.5, 1.0, 2.0, 3.0]

print("="*70)
print(f"Original (temperature=1.0) probabilities:")
print("="*70)
visualize_distribution(original_probs, tokens)

for temp in temperatures:
    print(f"\n{'='*70}")
    print(f"Temperature = {temp}")
    print('='*70)

    # Apply temperature
    new_probs = apply_temperature(original_probs, temp)

    # Visualize
    visualize_distribution(new_probs, tokens)

    # Sample tokens
    print(f"\n  Sampling 10 tokens:")
    if temp == 0:
        samples = [tokens[np.argmax(new_probs)]] * 10
    else:
        samples = np.random.choice(tokens, size=10, p=new_probs)

    print(f"  {samples}")

    # Show diversity metric
    unique_tokens = len(set(samples))
    print(f"  → Diversity: {unique_tokens}/10 unique tokens")

    # Explain what's happening
    if temp == 0:
        print(f"  → Effect: DETERMINISTIC - always outputs '{samples[0]}'")
    elif temp < 1.0:
        print(f"  → Effect: SHARPENED - makes confident tokens more likely")
    elif temp == 1.0:
        print(f"  → Effect: UNCHANGED - original distribution")
    else:
        print(f"  → Effect: FLATTENED - makes all tokens more equally likely")

Original (temperature=1.0) probabilities:
  positive  : 0.500 ██████████████████████████████████████████████████
  negative  : 0.300 ██████████████████████████████
  neutral   : 0.120 ████████████
  good      : 0.050 █████
  bad       : 0.030 ███

Temperature = 0
  positive  : 1.000 ████████████████████████████████████████████████████████████████████████████████████████████████████
  negative  : 0.000 
  neutral   : 0.000 
  good      : 0.000 
  bad       : 0.000 

  Sampling 10 tokens:
  ['positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive']
  → Diversity: 1/10 unique tokens
  → Effect: DETERMINISTIC - always outputs 'positive'

Temperature = 0.5
  positive  : 0.699 █████████████████████████████████████████████████████████████████████
  negative  : 0.252 █████████████████████████
  neutral   : 0.040 ████
  good      : 0.007 
  bad       : 0.003 

  Sampling 10 tokens:
  ['positive' 'positive' 'positive' 'neutral' 'posit

**Questions:**

1. Why is temperature=0 critical for classification tasks? What would go wrong with temperature=1.0?

2. Compare temperature=0.5 vs 2.0. At what temperature did low-probability tokens like "bad" start appearing in samples?

3. With the uncertain distribution ([0.25, 0.24, 0.22, 0.18, 0.11]), how did temperature effects differ from the confident model?

#### Easy Task 4: Embedding Similarity Analysis

**Instructions:**

1. Execute code to see similarity matrix for movie reviews
2. Identify which reviews cluster together and which are distant
3. Uncomment TODO to add reviews from different domains
4. Analyze whether restaurant/product reviews cluster with movie reviews
5. Add a random sentence to test similarity boundaries

In [29]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Movie reviews - different types
texts = [
    # Positive reviews
    "Amazing movie! Absolutely loved it!",
    "Fantastic film, highly recommend!",
    "Great cinematography and acting",

    # Negative reviews
    "Terrible waste of time",
    "Very disappointing and boring",
    "Poor acting and weak plot",

    # Neutral reviews
    "It was okay, nothing special",
    "Some good parts, some bad parts",

    # Off-topic
    "The weather is nice today",
    "I like eating pizza"
]

# TODO: After first run, uncomment these to test domain transfer:
# texts = [
#     # Positive movie reviews
#     "Amazing movie! Absolutely loved it!",
#     "Fantastic film, highly recommend!",
#     "Great cinematography and acting",
#
#     # Negative movie reviews
#     "Terrible waste of time",
#     "Very disappointing and boring",
#     "Poor acting and weak plot",
#
#     # Neutral movie reviews
#     "It was okay, nothing special",
#     "Some good parts, some bad parts",
#
#     # Positive restaurant review (different domain!)
#     "Amazing food! Absolutely loved it!",
#     "Fantastic restaurant, highly recommend!",
#
#     # Off-topic
#     "The weather is nice today",
#     "I like eating pizza",
# ]

labels = [f"Text {i+1}" for i in range(len(texts))]

# Generate embeddings
embeddings = model.encode(texts)
similarity_matrix = cosine_similarity(embeddings)

print(f"Each text converted to {embeddings.shape[1]}-dimensional vector")
print(f"Comparing {embeddings.shape[0]} texts\n")

# Show full similarity matrix
print("Similarity Matrix (0=unrelated, 1=identical):")
print("="*80)
print(f"{'':10s}", end="")
for i in range(len(texts)):
    print(f"T{i+1:2d} ", end="")
print()

for i in range(len(texts)):
    print(f"Text {i+1:2d}:  ", end="")
    for j in range(len(texts)):
        if i == j:
            print("---- ", end="")
        else:
            sim = similarity_matrix[i][j]
            if sim > 0.6:
                print(f"{sim:.2f}*", end="")  # High similarity
            else:
                print(f"{sim:.2f} ", end="")
            print(" ", end="")
    print()

print("\n* = High similarity (>0.6)")

# Detailed comparisons
print("\n" + "="*80)
print("DETAILED COMPARISONS")
print("="*80)

comparisons = [
    (0, 1, "Positive review vs Positive review"),
    (3, 4, "Negative review vs Negative review"),
    (0, 3, "Positive review vs Negative review"),
    (0, 8, "Movie review vs Off-topic text"),
]

# TODO: After adding restaurant reviews, uncomment this:
# comparisons.append((0, 8, "Positive MOVIE vs Positive RESTAURANT"))

for i, j, description in comparisons:
    if i < len(texts) and j < len(texts):
        sim = similarity_matrix[i][j]
        print(f"\n{description}:")
        print(f"  Text {i+1}: '{texts[i]}'")
        print(f"  Text {j+1}: '{texts[j]}'")
        print(f"  Similarity: {sim:.3f}")

        if sim > 0.7:
            print(f"  → Very similar! These texts are closely related in meaning")
        elif sim > 0.4:
            print(f"  → Moderately similar. Some shared concepts")
        else:
            print(f"  → Different topics or sentiments")

# Find clusters
print("\n" + "="*80)
print("CLUSTERS (which texts group together?)")
print("="*80)

# Find texts similar to first positive review
positive_idx = 0
similar_to_positive = []
for i in range(len(texts)):
    if i != positive_idx and similarity_matrix[positive_idx][i] > 0.5:
        similar_to_positive.append((i, similarity_matrix[positive_idx][i]))

print(f"\nTexts similar to '{texts[positive_idx]}':")
for idx, sim in sorted(similar_to_positive, key=lambda x: x[1], reverse=True):
    print(f"  Text {idx+1} (sim={sim:.3f}): '{texts[idx]}'")

Each text converted to 768-dimensional vector
Comparing 10 texts

Similarity Matrix (0=unrelated, 1=identical):
          T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T10 
Text  1:  ---- 0.77* 0.52  0.09  0.24  0.23  0.31  0.16  0.10  0.03  
Text  2:  0.77* ---- 0.47  0.11  0.26  0.23  0.21  0.14  0.09  0.01  
Text  3:  0.52  0.47  ---- 0.10  0.32  0.51  0.36  0.26  0.12  0.06  
Text  4:  0.09  0.11  0.10  ---- 0.55  0.29  0.34  0.21  0.03  0.07  
Text  5:  0.24  0.26  0.32  0.55  ---- 0.50  0.55  0.36  0.09  0.13  
Text  6:  0.23  0.23  0.51  0.29  0.50  ---- 0.37  0.25  -0.03  0.08  
Text  7:  0.31  0.21  0.36  0.34  0.55  0.37  ---- 0.34  0.16  0.11  
Text  8:  0.16  0.14  0.26  0.21  0.36  0.25  0.34  ---- 0.12  0.21  
Text  9:  0.10  0.09  0.12  0.03  0.09  -0.03  0.16  0.12  ---- 0.20  
Text 10:  0.03  0.01  0.06  0.07  0.13  0.08  0.11  0.21  0.20  ---- 

* = High similarity (>0.6)

DETAILED COMPARISONS

Positive review vs Positive review:
  Text 1: 'Amazing movie! Absolutely loved it!'


**Questions:**

1. Compare similarity between Text 1 and Text 2 (both positive) vs Text 1 and Text 4 (positive vs negative). What aspects of semantic meaning do embeddings prioritize?

2. Find similarity scores between two negative reviews (Text 4 and Text 5) and two positive reviews (Text 1 and Text 2). Why would averaging embeddings per class work for classification?

3. After adding restaurant reviews: How similar was "Amazing food!" to "Amazing movie!"? What does this reveal about domain transfer?

### Medium Tasks - Building Real Classifiers

These tasks require more modification and experimentation. You'll build complete classification systems.

#### Medium Task 1: Multi-Class Sentiment Classifier with Custom Categories

**Instructions:**

1. Execute code to see how the 5-level sentiment classifier works
2. Analyze confusion matrix to identify which categories get confused
3. Uncomment TODO to add a 6th sentiment level
4. Write 3 reviews specifically for your new category
5. Compare performance before and after adding the category

In [30]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Test reviews covering different sentiment intensities
test_reviews = [
    "This is the best movie I've ever seen! Absolute masterpiece!",
    "Pretty good movie, I enjoyed it",
    "It was okay, nothing special",
    "Not great, pretty disappointing",
    "Absolutely terrible, worst movie ever",
    "Amazing performances and stunning visuals!",
    "Mediocre at best",
    "Quite bad, wouldn't recommend",
]

# TODO: After first run, add 3 reviews that should fit your new category:
# test_reviews.extend([
#     "Your review 1 here",
#     "Your review 2 here",
#     "Your review 3 here",
# ])

# 5-level sentiment classification
sentiment_labels = [
    "extremely negative sentiment",
    "somewhat negative sentiment",
    "neutral sentiment",
    "somewhat positive sentiment",
    "extremely positive sentiment"
]

# TODO: After analyzing results, uncomment to add 6th category between somewhat and extremely positive:
# sentiment_labels = [
#     "extremely negative sentiment",
#     "somewhat negative sentiment",
#     "neutral sentiment",
#     "somewhat positive sentiment",
#     "very positive sentiment",  # NEW CATEGORY
#     "extremely positive sentiment"
# ]

# Create embeddings
label_embeddings = model.encode(sentiment_labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

# Classify
predictions = np.argmax(sim_matrix, axis=1)

print("="*80)
print(f"MULTI-CLASS CLASSIFICATION ({len(sentiment_labels)} categories)")
print("="*80)

for i, review in enumerate(test_reviews):
    predicted_idx = predictions[i]
    predicted_label = sentiment_labels[predicted_idx]
    confidence = sim_matrix[i][predicted_idx]

    # Get top 2 predictions to see confusion
    top2_indices = np.argsort(sim_matrix[i])[-2:][::-1]
    second_best_idx = top2_indices[1]
    second_best_label = sentiment_labels[second_best_idx]
    second_best_score = sim_matrix[i][second_best_idx]
    margin = confidence - second_best_score

    print(f"\nReview {i+1}: '{review[:60]}...'")
    print(f"  1st: {predicted_label:30s} ({confidence:.3f})")
    print(f"  2nd: {second_best_label:30s} ({second_best_score:.3f})")
    print(f"  Margin: {margin:.3f}", end="")

    if margin < 0.05:
        print(" ⚠️  VERY UNCERTAIN - almost tied!")
    elif margin < 0.15:
        print(" ⚠️  Uncertain")
    else:
        print(" ✓ Confident")

# Analyze category confusion
print("\n" + "="*80)
print("CATEGORY CONFUSION ANALYSIS")
print("="*80)
print("How similar are the category descriptions to each other?")
print("(High similarity = easy to confuse)\n")

label_similarity = cosine_similarity(label_embeddings)

print(f"{'Category Pair':<60s} {'Similarity':<12s}")
print("-"*75)

confusions = []
for i in range(len(sentiment_labels)):
    for j in range(i+1, len(sentiment_labels)):
        sim = label_similarity[i][j]
        confusions.append((i, j, sim))

# Sort by similarity (most confusing first)
for i, j, sim in sorted(confusions, key=lambda x: x[2], reverse=True)[:10]:
    pair_name = f"{sentiment_labels[i]} <-> {sentiment_labels[j]}"
    marker = "⚠️ " if sim > 0.7 else ""
    print(f"{marker}{pair_name:<60s} {sim:.3f}")

MULTI-CLASS CLASSIFICATION (5 categories)

Review 1: 'This is the best movie I've ever seen! Absolute masterpiece!...'
  1st: extremely positive sentiment   (0.221)
  2nd: extremely negative sentiment   (0.119)
  Margin: 0.102 ⚠️  Uncertain

Review 2: 'Pretty good movie, I enjoyed it...'
  1st: somewhat positive sentiment    (0.320)
  2nd: extremely positive sentiment   (0.313)
  Margin: 0.007 ⚠️  VERY UNCERTAIN - almost tied!

Review 3: 'It was okay, nothing special...'
  1st: somewhat negative sentiment    (0.411)
  2nd: somewhat positive sentiment    (0.395)
  Margin: 0.016 ⚠️  VERY UNCERTAIN - almost tied!

Review 4: 'Not great, pretty disappointing...'
  1st: somewhat negative sentiment    (0.451)
  2nd: extremely negative sentiment   (0.443)
  Margin: 0.008 ⚠️  VERY UNCERTAIN - almost tied!

Review 5: 'Absolutely terrible, worst movie ever...'
  1st: extremely negative sentiment   (0.442)
  2nd: somewhat negative sentiment    (0.354)
  Margin: 0.087 ⚠️  Uncertain

Review 6: 'Amaz

**Questions:**

1. Which reviews have low margins (<0.10)? What linguistic features do they share? How does multi-class classification differ from binary?

2. Which adjacent categories have highest similarity in the confusion analysis? How could you rewrite label descriptions to create clearer boundaries?

3. After adding your 6th category: Did reviews switch to it? Did the new category create more uncertainty or resolve confusion?

#### Medium Task 2: Classifier Performance with Limited Training Data

**Instructions:**

1. Execute code to see task-specific model vs embedding classifier with 1000 training samples
2. Modify `train_size` to 100, then 2000, then 5000 - run after each change
3. Fill in the results table in the TODO section
4. Analyze at what point the embedding classifier matches the task-specific model

In [31]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np

data = load_dataset("rotten_tomatoes")

# TODO: EXPERIMENT WITH THIS VALUE - Try: 100, 500, 1000, 2000, 5000
train_size = 1000
test_size = 300

train_subset = data["train"].shuffle(seed=42).select(range(min(train_size, len(data["train"]))))
test_subset = data["test"].shuffle(seed=42).select(range(test_size))

print("="*80)
print(f"EXPERIMENT: Training Size = {train_size}")
print("="*80)

# Approach 1: Task-Specific Model (pre-trained for sentiment)
print("\n[1/2] Testing Task-Specific Model...")
print("Note: This model doesn't use our training data - it's already trained!")

task_model = pipeline(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True,
    device=-1
)

y_pred_task = []
for text in test_subset["text"]:
    output = task_model(text)[0]
    neg_score = output[0]["score"]
    pos_score = output[2]["score"]
    y_pred_task.append(1 if pos_score > neg_score else 0)

task_f1 = f1_score(test_subset["label"], y_pred_task, average='weighted')

print(f"✓ Task-Specific Model F1: {task_f1:.4f}")

# Approach 2: Embedding + Classifier (uses our training data)
print(f"\n[2/2] Training Embedding Classifier on {train_size} samples...")

embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_embeddings = embedding_model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = embedding_model.encode(test_subset["text"], show_progress_bar=False)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])

y_pred_embed = clf.predict(test_embeddings)
embed_f1 = f1_score(test_subset["label"], y_pred_embed, average='weighted')

print(f"✓ Embedding Classifier F1: {embed_f1:.4f}")

# Comparison
print("\n" + "="*80)
print("RESULTS SUMMARY")
print("="*80)
print(f"Training samples used: {train_size}")
print(f"\nTask-Specific (pre-trained):  F1 = {task_f1:.4f}")
print(f"Embedding + Classifier:       F1 = {embed_f1:.4f}")
print(f"Difference:                       {embed_f1 - task_f1:+.4f}")

if embed_f1 > task_f1:
    print(f"\n→ Embedding approach WINS with {train_size} samples!")
elif embed_f1 > task_f1 - 0.01:
    print(f"\n→ Essentially TIED - both perform similarly")
else:
    print(f"\n→ Task-specific model wins - embedding needs more data")

# Show some predictions
print("\n" + "="*80)
print("EXAMPLE PREDICTIONS (first 5 test samples)")
print("="*80)

for i in range(5):
    true_label = "Positive" if test_subset["label"][i] == 1 else "Negative"
    task_pred = "Positive" if y_pred_task[i] == 1 else "Negative"
    embed_pred = "Positive" if y_pred_embed[i] == 1 else "Negative"

    task_correct = "✓" if y_pred_task[i] == test_subset["label"][i] else "✗"
    embed_correct = "✓" if y_pred_embed[i] == test_subset["label"][i] else "✗"

    print(f"\n{i+1}. '{test_subset['text'][i][:60]}...'")
    print(f"   True: {true_label}")
    print(f"   Task-Specific: {task_pred} {task_correct}")
    print(f"   Embedding:     {embed_pred} {embed_correct}")

# TODO: RECORD YOUR RESULTS HERE
# After running with different train_size values, fill in this table:
print("\n" + "="*80)
print("YOUR EXPERIMENT RESULTS")
print("="*80)
print("Run the code multiple times with different train_size values and record:")
print()
print("| Train Size | Task F1 | Embedding F1 | Winner      |")
print("|------------|---------|--------------|-------------|")
print("| 100        | ?.????  | ?.????       | ?           |")
print("| 500        | ?.????  | ?.????       | ?           |")
print("| 1000       | ?.????  | ?.????       | ?           |")
print("| 2000       | ?.????  | ?.????       | ?           |")
print("| 5000       | ?.????  | ?.????       | ?           |")
print()
print(f"Current run: | {train_size:<10} | {task_f1:.4f}  | {embed_f1:.4f}       | {'Embed' if embed_f1 > task_f1 else 'Task':<11} |")

EXPERIMENT: Training Size = 1000

[1/2] Testing Task-Specific Model...
Note: This model doesn't use our training data - it's already trained!


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


✓ Task-Specific Model F1: 0.7709

[2/2] Training Embedding Classifier on 1000 samples...
✓ Embedding Classifier F1: 0.8699

RESULTS SUMMARY
Training samples used: 1000

Task-Specific (pre-trained):  F1 = 0.7709
Embedding + Classifier:       F1 = 0.8699
Difference:                       +0.0990

→ Embedding approach WINS with 1000 samples!

EXAMPLE PREDICTIONS (first 5 test samples)

1. 'unpretentious , charming , quirky , original...'
   True: Positive
   Task-Specific: Positive ✓
   Embedding:     Positive ✓

2. 'a film really has to be exceptional to justify a three hour ...'
   True: Negative
   Task-Specific: Negative ✓
   Embedding:     Negative ✓

3. 'working from a surprisingly sensitive script co-written by g...'
   True: Positive
   Task-Specific: Positive ✓
   Embedding:     Positive ✓

4. 'it may not be particularly innovative , but the film's crisp...'
   True: Positive
   Task-Specific: Positive ✓
   Embedding:     Positive ✓

5. 'such a premise is ripe for all manner of l

**Questions:**

1. At what training size did the embedding classifier match or beat the task-specific model? What does this reveal about data requirements?

2. Were there cases where one model was correct and the other wrong? What characteristics did those reviews have?

3. With train_size=100, is this enough labeled data? How does this compare to training models from scratch?

#### Medium Task 3: Confidence-Based Classifier with Uncertainty Handling

**Instructions:**

1. Execute code to see classifier handling uncertain predictions with threshold=0.15
2. Analyze which reviews were marked as "uncertain" and why
3. Change `confidence_threshold` to 0.05, then 0.30 to observe trade-offs
4. Uncomment TODO to implement an alternative uncertainty measure
5. Compare which uncertainty measure works better

In [32]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Reviews with varying levels of clarity
test_reviews = [
    "Absolutely fantastic! Best movie ever!",           # Clear positive
    "Pretty good, I liked it",                          # Weak positive
    "It was fine, nothing special",                     # Ambiguous
    "Not bad but not great either",                     # Very ambiguous
    "Quite disappointing",                              # Weak negative
    "Terrible! Complete waste of time!",                # Clear negative
    "The movie had some interesting moments",           # Ambiguous positive
    "Outstanding performances all around!",             # Clear positive
]

# True labels (for evaluation)
y_true = [1, 1, 0, 0, 0, 0, 1, 1]  # 1=positive, 0=negative

labels = ["A negative movie review", "A positive movie review"]

# TODO: EXPERIMENT WITH THIS - Try: 0.05, 0.15, 0.30, 0.50
confidence_threshold = 0.15

label_embeddings = model.encode(labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

def calculate_margin(similarities):
    """
    Margin = difference between top two predictions
    Small margin = uncertain (predictions are close)
    """
    sorted_sims = np.sort(similarities)[::-1]
    margin = sorted_sims[0] - sorted_sims[1]
    return margin

# TODO: After first run, uncomment this alternative uncertainty measure:
# def calculate_margin(similarities):
#     """
#     Alternative: Use absolute confidence in top prediction
#     Low confidence = uncertain
#     """
#     max_confidence = np.max(similarities)
#     # Convert to margin-like score (higher = more certain)
#     # If max is 0.6, margin = 0.6 - 0.5 = 0.1 (uncertain)
#     # If max is 0.9, margin = 0.9 - 0.5 = 0.4 (certain)
#     margin = max_confidence - 0.5
#     return margin

# Classify with confidence threshold
results = []
predictions = []
confidences = []

print("="*80)
print(f"CONFIDENCE-BASED CLASSIFICATION (threshold={confidence_threshold})")
print("="*80)

for i, review in enumerate(test_reviews):
    similarities = sim_matrix[i]
    predicted_idx = np.argmax(similarities)
    top_confidence = similarities[predicted_idx]
    margin = calculate_margin(similarities)

    # Decision: predict only if confident enough
    if margin >= confidence_threshold:
        prediction = predicted_idx
        status = "PREDICTED"
        predictions.append(prediction)
    else:
        prediction = None
        status = "UNCERTAIN"
        predictions.append(None)

    true_label = "Positive" if y_true[i] == 1 else "Negative"
    pred_label = labels[predicted_idx] if prediction is not None else "UNCERTAIN"

    print(f"\n{i+1}. '{review}'")
    print(f"   True label: {true_label}")
    print(f"   Prediction: {pred_label}")
    print(f"   Top confidence: {top_confidence:.3f}")
    print(f"   Margin: {margin:.3f} {'✓ Above threshold' if margin >= confidence_threshold else '✗ Below threshold'}")
    print(f"   Status: {status}", end="")

    if prediction is not None:
        correct = prediction == y_true[i]
        print(f" - {'✓ CORRECT' if correct else '✗ INCORRECT'}")
    else:
        print()

    results.append({
        'review': review,
        'true': y_true[i],
        'pred': prediction,
        'margin': margin,
        'status': status
    })

# Calculate metrics
print("\n" + "="*80)
print("PERFORMANCE ANALYSIS")
print("="*80)

made_predictions = [r for r in results if r['pred'] is not None]
uncertain_cases = [r for r in results if r['pred'] is None]
correct_predictions = [r for r in made_predictions if r['pred'] == r['true']]

total = len(results)
n_predicted = len(made_predictions)
n_uncertain = len(uncertain_cases)
n_correct = len(correct_predictions)

coverage = n_predicted / total
accuracy = n_correct / n_predicted if n_predicted > 0 else 0

print(f"\nCoverage: {n_predicted}/{total} = {coverage:.1%}")
print(f"  → Made predictions for {n_predicted} reviews")
print(f"  → Refused to predict on {n_uncertain} reviews")

print(f"\nAccuracy (on predictions made): {n_correct}/{n_predicted} = {accuracy:.1%}")
print(f"  → Of the {n_predicted} predictions, {n_correct} were correct")

print(f"\nTrade-off Analysis:")
print(f"  Threshold = {confidence_threshold}")
print(f"  → Higher threshold = fewer predictions but higher accuracy")
print(f"  → Lower threshold = more predictions but lower accuracy")

# Show which reviews were uncertain
if n_uncertain > 0:
    print(f"\n" + "-"*80)
    print(f"UNCERTAIN CASES (margin < {confidence_threshold}):")
    print("-"*80)
    for r in uncertain_cases:
        print(f"  • '{r['review']}'")
        print(f"    Margin: {r['margin']:.3f} (too close to call)")

# TODO: After experimenting with thresholds, analyze the trade-off
print("\n" + "="*80)
print("EXPERIMENT LOG - Fill this in as you try different thresholds:")
print("="*80)
print("| Threshold | Coverage | Accuracy | Notes                    |")
print("|-----------|----------|----------|--------------------------|")
print("| 0.05      | ??.?%    | ??.?%    | ?                        |")
print("| 0.15      | ??.?%    | ??.?%    | ?                        |")
print("| 0.30      | ??.?%    | ??.?%    | ?                        |")
print("| 0.50      | ??.?%    | ??.?%    | ?                        |")
print()
print(f"Current:    | {confidence_threshold:<9.2f} | {coverage*100:>5.1f}%    | {accuracy*100:>5.1f}%    |")

CONFIDENCE-BASED CLASSIFICATION (threshold=0.15)

1. 'Absolutely fantastic! Best movie ever!'
   True label: Positive
   Prediction: UNCERTAIN
   Top confidence: 0.451
   Margin: 0.092 ✗ Below threshold
   Status: UNCERTAIN

2. 'Pretty good, I liked it'
   True label: Positive
   Prediction: UNCERTAIN
   Top confidence: 0.410
   Margin: 0.018 ✗ Below threshold
   Status: UNCERTAIN

3. 'It was fine, nothing special'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.418
   Margin: 0.051 ✗ Below threshold
   Status: UNCERTAIN

4. 'Not bad but not great either'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.414
   Margin: 0.053 ✗ Below threshold
   Status: UNCERTAIN

5. 'Quite disappointing'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.354
   Margin: 0.080 ✗ Below threshold
   Status: UNCERTAIN

6. 'Terrible! Complete waste of time!'
   True label: Negative
   Prediction: UNCERTAIN
   Top confidence: 0.397
   Margin: 0.121

**Questions:**

1. What do uncertain reviews have in common? Are they using hedging language like "kind of" or "somewhat"?

2. Compare results at threshold=0.05 vs 0.30. Describe the coverage vs accuracy trade-off. When would you want high coverage vs high accuracy?

3. How could you use confidence-based prediction in production? What should a system do when the model is uncertain?

#### Medium Task 4: Classifier Failure Analysis

**Instructions:**

1. Train classifier and review overall error analysis
2. Study error patterns to understand which reviews failed and why
3. Uncomment TODO to add your own "hard cases" that you predict will fail
4. Test hypotheses: Do sarcastic reviews fail? Short reviews? Mixed sentiment?
5. Propose fixes based on your analysis

In [33]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from datasets import load_dataset
import numpy as np

# Load data
data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Use subset for faster experimentation
train_subset = data["train"].shuffle(seed=42).select(range(1000))
test_subset = data["test"].shuffle(seed=42).select(range(200))

# Train classifier
print("Training classifier on 1000 movie reviews...")
train_embeddings = model.encode(train_subset["text"], show_progress_bar=False)
test_embeddings = model.encode(test_subset["text"], show_progress_bar=False)

clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, train_subset["label"])

# Get predictions
predictions = clf.predict(test_embeddings)
probabilities = clf.predict_proba(test_embeddings)

# Analyze errors
print("\n" + "="*80)
print("ERROR ANALYSIS")
print("="*80)

errors = []
for i in range(len(test_subset)):
    if predictions[i] != test_subset["label"][i]:
        confidence = probabilities[i][predictions[i]]
        errors.append({
            'index': i,
            'text': test_subset["text"][i],
            'true_label': test_subset["label"][i],
            'predicted_label': predictions[i],
            'confidence': confidence,
            'length': len(test_subset["text"][i].split())
        })

total_errors = len(errors)
total_samples = len(test_subset)
accuracy = (total_samples - total_errors) / total_samples

print(f"\nOverall Performance:")
print(f"  Correct: {total_samples - total_errors}/{total_samples} ({accuracy:.1%})")
print(f"  Errors:  {total_errors}/{total_samples} ({total_errors/total_samples:.1%})")

# Categorize errors
false_positives = [e for e in errors if e['predicted_label'] == 1]
false_negatives = [e for e in errors if e['predicted_label'] == 0]

print(f"\nError Types:")
print(f"  False Positives: {len(false_positives)} (predicted positive, actually negative)")
print(f"  False Negatives: {len(false_negatives)} (predicted negative, actually positive)")

# Show high-confidence errors (most surprising)
high_conf_errors = [e for e in errors if e['confidence'] > 0.7]

print(f"\n" + "-"*80)
print(f"HIGH-CONFIDENCE ERRORS (confidence > 0.7)")
print(f"These are the most surprising mistakes:")
print("-"*80)

for i, error in enumerate(high_conf_errors[:5]):
    true_sent = "Positive" if error['true_label'] == 1 else "Negative"
    pred_sent = "Positive" if error['predicted_label'] == 1 else "Negative"

    print(f"\n{i+1}. '{error['text']}'")
    print(f"   True: {true_sent} | Predicted: {pred_sent} | Confidence: {error['confidence']:.3f}")
    print(f"   Length: {error['length']} words")

# Analyze by text length
print(f"\n" + "-"*80)
print("ERROR ANALYSIS BY TEXT LENGTH")
print("-"*80)

error_lengths = [e['length'] for e in errors]
correct_lengths = [len(test_subset["text"][i].split())
                   for i in range(len(test_subset))
                   if predictions[i] == test_subset["label"][i]]

avg_error_length = np.mean(error_lengths) if error_lengths else 0
avg_correct_length = np.mean(correct_lengths) if correct_lengths else 0

print(f"\nAverage length of ERROR reviews: {avg_error_length:.1f} words")
print(f"Average length of CORRECT reviews: {avg_correct_length:.1f} words")

if avg_error_length < avg_correct_length:
    print(f"→ Observation: Errors tend to be SHORTER")
elif avg_error_length > avg_correct_length:
    print(f"→ Observation: Errors tend to be LONGER")
else:
    print(f"→ Observation: No clear length pattern")

# Test edge cases
print("\n" + "="*80)
print("TESTING EDGE CASES")
print("="*80)

edge_cases = [
    ("Sarcastic", "Oh great, another masterpiece. NOT!", 0),
    ("Mixed", "The acting was great but the plot was terrible", 0),
    ("Backhanded", "Not as bad as I expected", 1),
    ("Double negative", "Not unwatchable", 1),
    ("Very short", "Boring", 0),
    ("Ambiguous", "It was a movie", 0),
]

# TODO: After analyzing above errors, add your own test cases:
# edge_cases.extend([
#     ("Your category", "Your test review here", expected_label_0_or_1),
#     ("Another category", "Another test review", expected_label),
# ])

print("\nTesting challenging cases that often fail:")
print("-"*80)

edge_embeddings = model.encode([text for _, text, _ in edge_cases])
edge_predictions = clf.predict(edge_embeddings)
edge_probs = clf.predict_proba(edge_embeddings)

correct_count = 0
for i, (category, text, true_label) in enumerate(edge_cases):
    pred = edge_predictions[i]
    conf = edge_probs[i][pred]
    correct = pred == true_label
    if correct:
        correct_count += 1

    true_sent = "Positive" if true_label == 1 else "Negative"
    pred_sent = "Positive" if pred == 1 else "Negative"

    print(f"\n{category}: '{text}'")
    print(f"  True: {true_sent} | Predicted: {pred_sent} | Confidence: {conf:.3f}")
    print(f"  Result: {'✓ CORRECT' if correct else '✗ WRONG'}")

edge_accuracy = correct_count / len(edge_cases)
print(f"\n" + "-"*80)
print(f"Edge Case Accuracy: {correct_count}/{len(edge_cases)} ({edge_accuracy:.1%})")
print(f"Regular Test Accuracy: {accuracy:.1%}")
print(f"Difference: {accuracy - edge_accuracy:+.1%}")

# Summary and insights
print("\n" + "="*80)
print("KEY INSIGHTS FROM ERROR ANALYSIS")
print("="*80)

print("\n1. Error Distribution:")
print(f"   - False Positives (predicted too optimistic): {len(false_positives)}")
print(f"   - False Negatives (predicted too pessimistic): {len(false_negatives)}")
if len(false_positives) > len(false_negatives):
    print(f"   → Classifier has POSITIVE BIAS")
elif len(false_negatives) > len(false_positives):
    print(f"   → Classifier has NEGATIVE BIAS")

print("\n2. Challenging Cases:")
failing_categories = [cat for cat, text, true in edge_cases
                     if clf.predict(model.encode([text]))[0] != true]
if failing_categories:
    print(f"   The classifier struggles with: {', '.join(failing_categories)}")

print("\n3. Confidence Analysis:")
if high_conf_errors:
    print(f"   Found {len(high_conf_errors)} high-confidence errors")
    print(f"   → The model is 'confidently wrong' on some cases")

print("\n" + "="*80)
print("TODO: Based on your error analysis, propose improvements:")
print("="*80)
print("# Write your observations here:")
print("# 1. What patterns did you notice in the errors?")
print("# 2. Which edge cases failed most?")
print("# 3. How would you improve the classifier?")
print("#    - Better training data?")
print("#    - Different features?")
print("#    - Ensemble approach?")
print("#    - Confidence thresholds?")

Training classifier on 1000 movie reviews...

ERROR ANALYSIS

Overall Performance:
  Correct: 174/200 (87.0%)
  Errors:  26/200 (13.0%)

Error Types:
  False Positives: 11 (predicted positive, actually negative)
  False Negatives: 15 (predicted negative, actually positive)

--------------------------------------------------------------------------------
HIGH-CONFIDENCE ERRORS (confidence > 0.7)
These are the most surprising mistakes:
--------------------------------------------------------------------------------

1. 'an uneasy mix of run-of-the-mill raunchy humor and seemingly sincere personal reflection .'
   True: Negative | Predicted: Positive | Confidence: 0.701
   Length: 13 words

2. 'the stunt work is top-notch ; the dialogue and drama often food-spittingly funny .'
   True: Negative | Predicted: Positive | Confidence: 0.867
   Length: 14 words

3. 'goldmember is funny enough to justify the embarrassment of bringing a barf bag to the moviehouse .'
   True: Positive | Predicted:

**Reflection Questions:**

1. What do high-confidence errors have in common? How does model confidence relate to correctness?

2. Do errors tend to be shorter, longer, or similar length compared to correct predictions? Why might text length affect classification?

3. Which edge cases failed most - sarcasm, mixed sentiment, or double negatives? What aspects of language do embeddings not capture well?

### Hard Tasks - Advanced Classification Challenges

These tasks require significant modifications and deeper understanding. Take your time and experiment

#### Hard Task 1: Hierarchical Multi-Level Classifier

**Instructions:**

1. Execute code to see 2-level hierarchical classifier (sentiment → specific aspect)
2. Analyze whether hierarchy improves accuracy compared to flat classification
3. Uncomment TODO to add a third level of granularity
4. Create test reviews that specifically target your new level
5. Compare 3-level vs 2-level performance

In [35]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Test reviews targeting different aspects
test_reviews = [
    # Positive - Quality focused
    "Brilliant performances and stunning cinematography",
    "Exceptional directing and beautiful visuals",

    # Positive - Entertainment focused
    "So much fun! Had a great time watching",
    "Really entertaining and enjoyable",

    # Negative - Boring
    "Incredibly dull and slow-paced",
    "Boring, nothing happens for two hours",

    # Negative - Quality issues
    "Poor acting and terrible script",
    "Awful production values and bad directing",
]

# TODO: After understanding the hierarchy, add reviews for your third level:
# test_reviews.extend([
#     "The cinematography was breathtaking, truly artistic",  # Quality -> Visual
#     "Powerful performances, especially the lead actor",     # Quality -> Acting
#     "Had me laughing throughout the entire film",           # Entertainment -> Comedy
#     "Edge of my seat thriller, so suspenseful",            # Entertainment -> Excitement
# ])

# Level 1: Broad sentiment
level1_labels = [
    "negative sentiment review",
    "positive sentiment review"
]

# Level 2: Specific aspects
level2_negative = [
    "review criticizing entertainment value and pacing",
    "review criticizing technical quality and production"
]

level2_positive = [
    "review praising technical quality and artistry",
    "review praising entertainment value and enjoyment"
]

# TODO: Uncomment to add Level 3 - more specific categories
# level3_positive_quality = [
#     "review praising visual elements and cinematography",
#     "review praising acting performances and characters"
# ]
#
# level3_positive_entertainment = [
#     "review praising humor and comedy",
#     "review praising excitement and suspense"
# ]
#
# level3_negative_quality = [
#     "review criticizing visual elements and cinematography",
#     "review criticizing acting performances and characters"
# ]
#
# level3_negative_entertainment = [
#     "review criticizing humor attempts and comedy",
#     "review criticizing pacing and excitement"
# ]

def hierarchical_classify_2level(text):
    """Two-level classification: Sentiment -> Aspect"""
    text_embedding = model.encode([text])

    # Level 1: Determine sentiment
    level1_embeddings = model.encode(level1_labels)
    level1_sim = cosine_similarity(text_embedding, level1_embeddings)[0]
    level1_pred = np.argmax(level1_sim)
    level1_conf = level1_sim[level1_pred]

    # Level 2: Determine specific aspect based on Level 1
    if level1_pred == 0:  # Negative
        level2_labels = level2_negative
        sentiment = "Negative"
    else:  # Positive
        level2_labels = level2_positive
        sentiment = "Positive"

    level2_embeddings = model.encode(level2_labels)
    level2_sim = cosine_similarity(text_embedding, level2_embeddings)[0]
    level2_pred = np.argmax(level2_sim)
    level2_conf = level2_sim[level2_pred]

    return {
        'level1_pred': level1_pred,
        'level1_label': level1_labels[level1_pred],
        'level1_conf': level1_conf,
        'level2_pred': level2_pred,
        'level2_label': level2_labels[level2_pred],
        'level2_conf': level2_conf,
        'sentiment': sentiment,
        'path': f"{sentiment} -> {level2_labels[level2_pred]}"
    }

# TODO: Uncomment to implement 3-level classification
# def hierarchical_classify_3level(text):
#     """Three-level classification: Sentiment -> Aspect -> Specific"""
#     # Start with levels 1 and 2
#     result = hierarchical_classify_2level(text)
#
#     text_embedding = model.encode([text])
#
#     # Level 3: Even more specific based on Level 2
#     if result['sentiment'] == "Positive":
#         if result['level2_pred'] == 0:  # Quality
#             level3_labels = level3_positive_quality
#         else:  # Entertainment
#             level3_labels = level3_positive_entertainment
#     else:  # Negative
#         if result['level2_pred'] == 0:  # Entertainment
#             level3_labels = level3_negative_entertainment
#         else:  # Quality
#             level3_labels = level3_negative_quality
#
#     level3_embeddings = model.encode(level3_labels)
#     level3_sim = cosine_similarity(text_embedding, level3_embeddings)[0]
#     level3_pred = np.argmax(level3_sim)
#     level3_conf = level3_sim[level3_pred]
#
#     result['level3_pred'] = level3_pred
#     result['level3_label'] = level3_labels[level3_pred]
#     result['level3_conf'] = level3_conf
#     result['path'] = f"{result['sentiment']} -> L2 -> {level3_labels[level3_pred]}"
#
#     return result

print("="*80)
print("HIERARCHICAL CLASSIFICATION (2 LEVELS)")
print("="*80)

for i, review in enumerate(test_reviews):
    result = hierarchical_classify_2level(review)

    print(f"\nReview {i+1}: '{review}'")
    print(f"\n  Level 1 (Sentiment):")
    print(f"    → {result['level1_label']}")
    print(f"    → Confidence: {result['level1_conf']:.3f}")

    print(f"\n  Level 2 (Specific Aspect):")
    print(f"    → {result['level2_label']}")
    print(f"    → Confidence: {result['level2_conf']:.3f}")

    print(f"\n  Final Classification Path:")
    print(f"    → {result['path']}")
    print("-"*80)

# Compare with flat classification
print("\n" + "="*80)
print("COMPARISON: Hierarchical vs Flat Classification")
print("="*80)

# Flat: All 4 categories at once
flat_labels = [
    "review criticizing entertainment value and pacing",      # 0
    "review criticizing technical quality and production",    # 1
    "review praising technical quality and artistry",         # 2
    "review praising entertainment value and enjoyment"       # 3
]

flat_embeddings = model.encode(flat_labels)
review_embeddings = model.encode(test_reviews)
flat_sim = cosine_similarity(review_embeddings, flat_embeddings)

print("\nShowing first 3 reviews:")
for i in range(min(3, len(test_reviews))):
    hier_result = hierarchical_classify_2level(test_reviews[i])
    flat_pred = np.argmax(flat_sim[i])
    flat_conf = flat_sim[i][flat_pred]

    print(f"\nReview: '{test_reviews[i][:50]}...'")
    print(f"  Hierarchical: {hier_result['level2_label']}")
    print(f"    → Confidence: {hier_result['level2_conf']:.3f}")
    print(f"  Flat:         {flat_labels[flat_pred]}")
    print(f"    → Confidence: {flat_conf:.3f}")
    print(f"  Confidence Diff: {hier_result['level2_conf'] - flat_conf:+.3f}")

# TODO: After implementing 3-level, uncomment to test it:
# print("\n" + "="*80)
# print("TESTING 3-LEVEL HIERARCHICAL CLASSIFICATION")
# print("="*80)
#
# for i, review in enumerate(test_reviews):
#     result = hierarchical_classify_3level(review)
#     print(f"\n{i+1}. '{review[:60]}...'")
#     print(f"   Path: {result['path']}")
#     print(f"   Level 3 confidence: {result['level3_conf']:.3f}")

HIERARCHICAL CLASSIFICATION (2 LEVELS)

Review 1: 'Brilliant performances and stunning cinematography'

  Level 1 (Sentiment):
    → positive sentiment review
    → Confidence: 0.224

  Level 2 (Specific Aspect):
    → review praising technical quality and artistry
    → Confidence: 0.386

  Final Classification Path:
    → Positive -> review praising technical quality and artistry
--------------------------------------------------------------------------------

Review 2: 'Exceptional directing and beautiful visuals'

  Level 1 (Sentiment):
    → positive sentiment review
    → Confidence: 0.241

  Level 2 (Specific Aspect):
    → review praising technical quality and artistry
    → Confidence: 0.426

  Final Classification Path:
    → Positive -> review praising technical quality and artistry
--------------------------------------------------------------------------------

Review 3: 'So much fun! Had a great time watching'

  Level 1 (Sentiment):
    → positive sentiment review
    → 

**Questions:**

1. Compare confidence scores for hierarchical vs flat classification. Why might breaking decisions into steps help with confidence?

2. Could the classifier make the right Level 2 decision even if Level 1 was wrong? What does this reveal about error propagation?

3. After implementing 3 levels: Did added granularity help or hurt? Are Level 3 confidence scores lower than Level 2?

#### Hard Task 2: Active Learning to Minimize Labeling

**Instructions:**

1. Execute code to see active learning selecting uncertain examples vs random selection
2. Study what makes the selected samples uncertain
3. Uncomment TODO to implement a different selection strategy
4. Compare which strategy reaches high F1 score faster
5. Fill in results table and analyze learning curves

In [36]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from datasets import load_dataset
import numpy as np

data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# Prepare datasets
pool_size = 2000
test_size = 300

train_pool = data["train"].shuffle(seed=42).select(range(pool_size))
test_set = data["test"].shuffle(seed=42).select(range(test_size))

# Generate embeddings upfront (faster)
print("Generating embeddings for 2000 training pool and 300 test samples...")
pool_embeddings = model.encode(train_pool["text"], show_progress_bar=True)
test_embeddings = model.encode(test_set["text"], show_progress_bar=False)
print("✓ Embeddings ready\n")

def uncertainty_sampling(clf, unlabeled_embeddings, n_samples=50):
    """
    Select samples where model is most uncertain.
    Strategy: Pick samples with lowest confidence (closest to 50-50)
    """
    probs = clf.predict_proba(unlabeled_embeddings)
    # Uncertainty = 1 - max(prob) = how close to 50-50 the prediction is
    uncertainties = 1 - np.max(probs, axis=1)

    # Get indices of most uncertain samples
    most_uncertain_indices = np.argsort(uncertainties)[-n_samples:]
    return most_uncertain_indices

# TODO: Uncomment to implement alternative selection strategy
# def uncertainty_sampling(clf, unlabeled_embeddings, n_samples=50):
#     """
#     Alternative strategy: Margin sampling
#     Select samples where top two predictions are closest
#     """
#     probs = clf.predict_proba(unlabeled_embeddings)
#     # Sort probabilities for each sample
#     sorted_probs = np.sort(probs, axis=1)
#     # Margin = difference between top two
#     margins = sorted_probs[:, -1] - sorted_probs[:, -2]
#
#     # Get indices of smallest margins (most uncertain)
#     most_uncertain_indices = np.argsort(margins)[:n_samples]
#     return most_uncertain_indices

def random_sampling(n_available, n_samples=50):
    """Baseline: Random selection"""
    return np.random.choice(n_available, size=min(n_samples, n_available), replace=False)

# Active Learning Simulation
print("="*80)
print("ACTIVE LEARNING SIMULATION")
print("="*80)
print("Strategy: Start with 100 labeled, then iteratively add 50 most uncertain samples")
print("Compare: Active Learning vs Random Sampling\n")

# Configuration
initial_size = 100
samples_per_iteration = 50
n_iterations = 10

# Initialize
labeled_indices = list(range(initial_size))
unlabeled_indices = list(range(initial_size, pool_size))

active_scores = []
random_scores = []
iteration_labeled_sizes = []

for iteration in range(n_iterations):
    current_size = len(labeled_indices)
    iteration_labeled_sizes.append(current_size)

    print(f"{'='*80}")
    print(f"Iteration {iteration + 1}/{n_iterations} - Labeled samples: {current_size}")
    print('='*80)

    # Get current labeled data
    labeled_embeddings = pool_embeddings[labeled_indices]
    labeled_labels = [train_pool["label"][i] for i in labeled_indices]

    # Train classifier
    clf = LogisticRegression(random_state=42, max_iter=1000)
    clf.fit(labeled_embeddings, labeled_labels)

    # Evaluate
    test_pred = clf.predict(test_embeddings)
    active_f1 = f1_score(test_set["label"], test_pred, average='weighted')
    active_scores.append(active_f1)

    print(f"Active Learning F1: {active_f1:.4f}")

    # Compare with random sampling (same number of samples)
    random_indices = list(range(initial_size)) + list(
        np.random.choice(range(initial_size, pool_size),
                        size=min(current_size - initial_size, pool_size - initial_size),
                        replace=False)
    )
    random_embeddings = pool_embeddings[random_indices]
    random_labels = [train_pool["label"][i] for i in random_indices]

    clf_random = LogisticRegression(random_state=42, max_iter=1000)
    clf_random.fit(random_embeddings, random_labels)
    random_pred = clf_random.predict(test_embeddings)
    random_f1 = f1_score(test_set["label"], random_pred, average='weighted')
    random_scores.append(random_f1)

    print(f"Random Sampling F1:  {random_f1:.4f}")
    print(f"Improvement:         {active_f1 - random_f1:+.4f}")

    # Select next batch using active learning
    if len(unlabeled_indices) < samples_per_iteration:
        print(f"\n✓ Stopping: Only {len(unlabeled_indices)} samples left")
        break

    unlabeled_embeddings = pool_embeddings[unlabeled_indices]
    uncertain_local_indices = uncertainty_sampling(clf, unlabeled_embeddings, samples_per_iteration)

    # Convert to global indices
    newly_labeled = [unlabeled_indices[i] for i in uncertain_local_indices]

    # Show examples of selected samples
    print(f"\nExamples of selected UNCERTAIN samples:")
    for i, idx in enumerate(newly_labeled[:3]):
        probs = clf.predict_proba(pool_embeddings[idx].reshape(1, -1))[0]
        uncertainty = 1 - np.max(probs)
        print(f"  {i+1}. '{train_pool['text'][idx][:60]}...'")
        print(f"     Uncertainty: {uncertainty:.3f}")
        print(f"     Probs: [neg={probs[0]:.3f}, pos={probs[1]:.3f}]")

    # Update sets
    labeled_indices.extend(newly_labeled)
    unlabeled_indices = [idx for idx in unlabeled_indices if idx not in newly_labeled]
    print()

# Results Summary
print("="*80)
print("FINAL RESULTS - LEARNING CURVES")
print("="*80)

print(f"\n{'Labeled':<10s} {'Active F1':<12s} {'Random F1':<12s} {'Difference':<12s}")
print("-"*50)
for size, active, random in zip(iteration_labeled_sizes, active_scores, random_scores):
    diff = active - random
    marker = "  ✓" if diff > 0.01 else ""
    print(f"{size:<10d} {active:.4f}       {random:.4f}       {diff:+.4f}{marker}")

avg_improvement = np.mean(np.array(active_scores) - np.array(random_scores))
print(f"\nAverage Improvement: {avg_improvement:+.4f}")

# Find when active learning reaches target F1
target_f1 = 0.85
active_reached = next((size for size, f1 in zip(iteration_labeled_sizes, active_scores) if f1 >= target_f1), None)
random_reached = next((size for size, f1 in zip(iteration_labeled_sizes, random_scores) if f1 >= target_f1), None)

if active_reached or random_reached:
    print(f"\nTo reach F1={target_f1}:")
    if active_reached:
        print(f"  Active Learning: {active_reached} labeled samples")
    else:
        print(f"  Active Learning: Did not reach {target_f1}")

    if random_reached:
        print(f"  Random Sampling: {random_reached} labeled samples")
    else:
        print(f"  Random Sampling: Did not reach {target_f1}")

    if active_reached and random_reached:
        savings = random_reached - active_reached
        print(f"  → Active Learning saved {savings} labeled samples ({savings/random_reached:.1%})")

print("\n" + "="*80)
print("TODO: Try different selection strategies and record results:")
print("="*80)
print("| Strategy              | Samples to F1=0.85 | Avg Improvement | Notes |")
print("|----------------------|-------------------|-----------------|-------|")
print("| Uncertainty (1-max)  | ???               | ???             |       |")
print("| Margin sampling      | ???               | ???             |       |")
print("| YOUR STRATEGY        | ???               | ???             |       |")

Generating embeddings for 2000 training pool and 300 test samples...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

✓ Embeddings ready

ACTIVE LEARNING SIMULATION
Strategy: Start with 100 labeled, then iteratively add 50 most uncertain samples
Compare: Active Learning vs Random Sampling

Iteration 1/10 - Labeled samples: 100
Active Learning F1: 0.7535
Random Sampling F1:  0.7535
Improvement:         +0.0000

Examples of selected UNCERTAIN samples:
  1. 'it uses the pain and violence of war as background material ...'
     Uncertainty: 0.495
     Probs: [neg=0.505, pos=0.495]
  2. '. . . the tale of her passionate , tumultuous affair with mu...'
     Uncertainty: 0.495
     Probs: [neg=0.505, pos=0.495]
  3. 'inventive , fun , intoxicatingly sexy , violent , self-indul...'
     Uncertainty: 0.495
     Probs: [neg=0.505, pos=0.495]

Iteration 2/10 - Labeled samples: 150
Active Learning F1: 0.8023
Random Sampling F1:  0.8222
Improvement:         -0.0199

Examples of selected UNCERTAIN samples:
  1. 'a vile , incoherent mess . . . a scummy ripoff of david cron...'
     Uncertainty: 0.492
     Probs: [ne

**Questions:**

1. What makes the uncertain samples uncertain? Are they using hedging language, mixed sentiment, or ambiguous wording?

2. At what point did active learning pull ahead of random sampling? How much can active learning reduce labeling costs?

3. Why are samples with probabilities like [0.52, 0.48] more valuable for training than confident samples?

#### Hard Task 3: Ensemble Classifier

**Instructions:**

1. Execute code to see three models performing individually and as an ensemble
2. Analyze when models disagree and which model is correct most often
3. Uncomment TODO to add a fourth model to the ensemble
4. Uncomment TODO to implement weighted voting based on validation performance
5. Compare ensemble methods to determine which works best

In [37]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
from datasets import load_dataset
import numpy as np

data = load_dataset("rotten_tomatoes")

# Configuration
train_size = 1500
test_size = 300

train_subset = data["train"].shuffle(seed=42).select(range(train_size))
test_subset = data["test"].shuffle(seed=42).select(range(test_size))

print("="*80)
print("BUILDING ENSEMBLE OF CLASSIFIERS")
print("="*80)

# Model 1: Task-Specific (Twitter RoBERTa)
print("\n[1/3] Loading Task-Specific Model (Twitter RoBERTa)...")
task_model = pipeline(
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True,
    device=-1
)

# Model 2: Embedding + Logistic Regression
print("[2/3] Training Embedding Classifier...")
embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

train_embeddings = embedding_model.encode(train_subset["text"], show_progress_bar=True)
test_embeddings = embedding_model.encode(test_subset["text"], show_progress_bar=False)

clf_embedding = LogisticRegression(random_state=42, max_iter=1000)
clf_embedding.fit(train_embeddings, train_subset["label"])

# Model 3: Zero-Shot
print("[3/3] Setting up Zero-Shot Classifier...")
zero_shot_labels = ["A very negative movie review", "A very positive movie review"]
zero_shot_label_embeddings = embedding_model.encode(zero_shot_labels)

# TODO: Uncomment to add Model 4 - Different embedding model
# print("[4/4] Training with alternative embedding model...")
# embedding_model_alt = SentenceTransformer('all-MiniLM-L6-v2')  # Smaller, faster
# train_embeddings_alt = embedding_model_alt.encode(train_subset["text"], show_progress_bar=True)
# test_embeddings_alt = embedding_model_alt.encode(test_subset["text"], show_progress_bar=False)
# clf_embedding_alt = LogisticRegression(random_state=42, max_iter=1000)
# clf_embedding_alt.fit(train_embeddings_alt, train_subset["label"])

print("\n✓ All models ready")

# Get predictions from all models
print("\n" + "="*80)
print("GENERATING PREDICTIONS")
print("="*80)

print("\nModel 1: Task-Specific...")
pred_task = []
conf_task = []
for text in test_subset["text"]:
    output = task_model(text)[0]
    neg_score = output[0]["score"]
    pos_score = output[2]["score"]
    pred_task.append(1 if pos_score > neg_score else 0)
    conf_task.append(max(neg_score, pos_score))

print("Model 2: Embedding Classifier...")
pred_embedding = clf_embedding.predict(test_embeddings)
conf_embedding = np.max(clf_embedding.predict_proba(test_embeddings), axis=1)

print("Model 3: Zero-Shot...")
from sklearn.metrics.pairwise import cosine_similarity
zero_shot_sim = cosine_similarity(test_embeddings, zero_shot_label_embeddings)
pred_zero_shot = np.argmax(zero_shot_sim, axis=1)
conf_zero_shot = np.max(zero_shot_sim, axis=1)

# TODO: Uncomment if you added Model 4
# print("Model 4: Alternative Embedding...")
# pred_alt = clf_embedding_alt.predict(test_embeddings_alt)
# conf_alt = np.max(clf_embedding_alt.predict_proba(test_embeddings_alt), axis=1)

# Evaluate individual models
print("\n" + "="*80)
print("INDIVIDUAL MODEL PERFORMANCE")
print("="*80)

models = [
    ("Task-Specific", pred_task),
    ("Embedding + LR", pred_embedding),
    ("Zero-Shot", pred_zero_shot),
]

# TODO: Uncomment if Model 4 added
# models.append(("Alternative Embedding", pred_alt))

individual_scores = []
for name, predictions in models:
    f1 = f1_score(test_subset["label"], predictions, average='weighted')
    acc = accuracy_score(test_subset["label"], predictions)
    individual_scores.append(f1)
    print(f"\n{name}:")
    print(f"  F1 Score:  {f1:.4f}")
    print(f"  Accuracy:  {acc:.4f}")

# Ensemble Methods
print("\n" + "="*80)
print("ENSEMBLE METHODS")
print("="*80)

# Method 1: Simple Majority Voting
ensemble_votes = np.array([pred_task, pred_embedding, pred_zero_shot])
# TODO: Add Model 4 if available
# ensemble_votes = np.array([pred_task, pred_embedding, pred_zero_shot, pred_alt])

pred_majority = np.apply_along_axis(lambda x: np.bincount(x).argmax(), 0, ensemble_votes)

maj_f1 = f1_score(test_subset["label"], pred_majority, average='weighted')
maj_acc = accuracy_score(test_subset["label"], pred_majority)

print(f"\n1. Simple Majority Voting:")
print(f"   F1 Score:  {maj_f1:.4f}")
print(f"   Accuracy:  {maj_acc:.4f}")

# Method 2: Confidence-Weighted Voting
weights = np.array([conf_task, conf_embedding, conf_zero_shot])
# TODO: Add Model 4 weights if available
# weights = np.array([conf_task, conf_embedding, conf_zero_shot, conf_alt])

weighted_votes = np.zeros((len(test_subset), 2))
for i in range(len(test_subset)):
    for model_idx in range(len(models)):
        vote = ensemble_votes[model_idx, i]
        weight = weights[model_idx, i]
        weighted_votes[i, vote] += weight

pred_weighted = np.argmax(weighted_votes, axis=1)

weight_f1 = f1_score(test_subset["label"], pred_weighted, average='weighted')
weight_acc = accuracy_score(test_subset["label"], pred_weighted)

print(f"\n2. Confidence-Weighted Voting:")
print(f"   F1 Score:  {weight_f1:.4f}")
print(f"   Accuracy:  {weight_acc:.4f}")

# TODO: Uncomment to implement Method 3: Performance-Weighted Voting
# Method 3: Weight models by their F1 scores
# print(f"\n3. Performance-Weighted Voting:")
# model_weights = np.array(individual_scores)  # Use F1 scores as weights
# model_weights = model_weights / model_weights.sum()  # Normalize
#
# perf_weighted_votes = np.zeros((len(test_subset), 2))
# for i in range(len(test_subset)):
#     for model_idx in range(len(models)):
#         vote = ensemble_votes[model_idx, i]
#         weight = model_weights[model_idx]
#         perf_weighted_votes[i, vote] += weight
#
# pred_perf_weighted = np.argmax(perf_weighted_votes, axis=1)
# perf_f1 = f1_score(test_subset["label"], pred_perf_weighted, average='weighted')
# perf_acc = accuracy_score(test_subset["label"], pred_perf_weighted)
# print(f"   F1 Score:  {perf_f1:.4f}")
# print(f"   Accuracy:  {perf_acc:.4f}")

# Comparison Table
print("\n" + "="*80)
print("PERFORMANCE COMPARISON")
print("="*80)

results = [
    ("Task-Specific (Model 1)", individual_scores[0]),
    ("Embedding (Model 2)", individual_scores[1]),
    ("Zero-Shot (Model 3)", individual_scores[2]),
    ("─" * 30, None),
    ("Ensemble: Majority Vote", maj_f1),
    ("Ensemble: Confidence-Weighted", weight_f1),
]

# TODO: Add Model 4 and performance-weighted if implemented
# results.insert(3, ("Alternative Embedding (Model 4)", individual_scores[3]))
# results.append(("Ensemble: Performance-Weighted", perf_f1))

best_individual = max(individual_scores)

print(f"\n{'Method':<35s} {'F1 Score':<12s} {'vs Best Individual':<20s}")
print("-"*70)
for name, score in results:
    if score is None:
        print(name)
    else:
        diff = score - best_individual if score is not None else 0
        improvement = "✓" if diff > 0.001 else ""
        print(f"{name:<35s} {score:.4f}       {diff:+.4f}  {improvement}")

# Analyze disagreements
print("\n" + "="*80)
print("ANALYZING MODEL DISAGREEMENTS")
print("="*80)

disagreements = []
unanimous_correct = 0
unanimous_wrong = 0

for i in range(len(test_subset)):
    votes = ensemble_votes[:, i]
    unique_votes = len(set(votes))

    if unique_votes > 1:  # Disagreement
        disagreements.append({
            'index': i,
            'text': test_subset["text"][i],
            'true': test_subset["label"][i],
            'votes': votes,
            'ensemble': pred_majority[i],
            'models': [models[j][0] for j in range(len(models))]
        })
    else:  # Unanimous
        if votes[0] == test_subset["label"][i]:
            unanimous_correct += 1
        else:
            unanimous_wrong += 1

print(f"\nVoting Patterns:")
print(f"  Unanimous Correct: {unanimous_correct} ({100*unanimous_correct/len(test_subset):.1f}%)")
print(f"  Unanimous Wrong:   {unanimous_wrong} ({100*unanimous_wrong/len(test_subset):.1f}%)")
print(f"  Disagreements:     {len(disagreements)} ({100*len(disagreements)/len(test_subset):.1f}%)")

print(f"\n" + "-"*80)
print(f"Examples of Disagreements (first 5):")
print("-"*80)

for i, case in enumerate(disagreements[:5]):
    true_label = "Positive" if case['true'] == 1 else "Negative"
    ensemble_label = "Positive" if case['ensemble'] == 1 else "Negative"
    ensemble_correct = "✓" if case['ensemble'] == case['true'] else "✗"

    print(f"\n{i+1}. '{case['text'][:60]}...'")
    print(f"   True: {true_label}")

    for j, model_name in enumerate(case['models']):
        vote_label = "Positive" if case['votes'][j] == 1 else "Negative"
        vote_correct = "✓" if case['votes'][j] == case['true'] else "✗"
        print(f"   {model_name:20s}: {vote_label:8s} {vote_correct}")

    print(f"   Ensemble Decision:   {ensemble_label:8s} {ensemble_correct}")

BUILDING ENSEMBLE OF CLASSIFIERS

[1/3] Loading Task-Specific Model (Twitter RoBERTa)...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


[2/3] Training Embedding Classifier...


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

[3/3] Setting up Zero-Shot Classifier...

✓ All models ready

GENERATING PREDICTIONS

Model 1: Task-Specific...
Model 2: Embedding Classifier...
Model 3: Zero-Shot...

INDIVIDUAL MODEL PERFORMANCE

Task-Specific:
  F1 Score:  0.7709
  Accuracy:  0.7733

Embedding + LR:
  F1 Score:  0.8699
  Accuracy:  0.8700

Zero-Shot:
  F1 Score:  0.8255
  Accuracy:  0.8267

ENSEMBLE METHODS

1. Simple Majority Voting:
   F1 Score:  0.8667
   Accuracy:  0.8667

2. Confidence-Weighted Voting:
   F1 Score:  0.8632
   Accuracy:  0.8633

PERFORMANCE COMPARISON

Method                              F1 Score     vs Best Individual  
----------------------------------------------------------------------
Task-Specific (Model 1)             0.7709       -0.0990  
Embedding (Model 2)                 0.8699       +0.0000  
Zero-Shot (Model 3)                 0.8255       -0.0444  
──────────────────────────────
Ensemble: Majority Vote             0.8667       -0.0032  
Ensemble: Confidence-Weighted       0.8632 

**Reflection Questions:**

1. Did the ensemble beat the best individual model? What does this reveal about combining complementary strengths?

2. When models disagree, which one is usually correct? Are there patterns showing when ensembles add value?

3. Compare simple majority voting vs confidence-weighted voting. Which performed better? When might confidence be misleading?

#### Hard Task 4: Cross-Domain Transfer Learning

**Instructions:**

1. Train on movie reviews and test on restaurant/product/book reviews
2. Study performance drops to identify which domains transfer well
3. Uncomment TODO to test reviews from a domain you choose
4. Uncomment TODO to try few-shot domain adaptation
5. Compare which domains need more adaptation and why

In [38]:
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report
from datasets import load_dataset
import numpy as np

# Source domain: Movie reviews
movie_data = load_dataset("rotten_tomatoes")
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

movie_train = movie_data["train"].shuffle(seed=42).select(range(2000))
movie_test = movie_data["test"].shuffle(seed=42).select(range(200))

# Target domains with labeled examples
restaurant_reviews = {
    'text': [
        "Amazing food and excellent service!",
        "Best restaurant in town, highly recommend",
        "Delicious meals and great atmosphere",
        "Outstanding cuisine and friendly staff",
        "Terrible food, very disappointing",
        "Awful service and poor quality",
        "Not worth the money, mediocre at best",
        "Disgusting food and rude waiters",
        "The pasta was okay but nothing special",
        "Decent place for a quick meal"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

product_reviews = {
    'text': [
        "This product is amazing! Works perfectly",
        "Excellent quality, very satisfied",
        "Great value for money, highly recommend",
        "Perfect! Exactly what I needed",
        "Terrible product, broke immediately",
        "Waste of money, very poor quality",
        "Doesn't work as advertised, disappointed",
        "Awful, don't buy this",
        "It's okay, does the job",
        "Average product, nothing special"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

book_reviews = {
    'text': [
        "Brilliant book! Couldn't put it down",
        "Masterfully written, highly engaging",
        "One of the best books I've read",
        "Fantastic story and great characters",
        "Boring and poorly written",
        "Terrible book, waste of time",
        "Disappointing, not worth reading",
        "Awful plot and weak characters",
        "Decent read but nothing groundbreaking",
        "It was fine, not great not terrible"
    ],
    'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}

# TODO: Add your own domain - try something different!
# YOUR_DOMAIN_reviews = {
#     'text': [
#         "Positive example 1",
#         "Positive example 2",
#         "Positive example 3",
#         "Positive example 4",
#         "Negative example 1",
#         "Negative example 2",
#         "Negative example 3",
#         "Negative example 4",
#         "Neutral example 1",
#         "Neutral example 2",
#     ],
#     'label': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
# }

print("="*80)
print("CROSS-DOMAIN TRANSFER LEARNING EXPERIMENT")
print("="*80)

# Train on source domain (movies)
print("\nTraining classifier on MOVIE REVIEWS (source domain)...")
train_embeddings = model.encode(movie_train["text"], show_progress_bar=True)
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(train_embeddings, movie_train["label"])

# Test on source domain (baseline)
print("\n" + "-"*80)
print("BASELINE: Performance on Source Domain (Movies)")
print("-"*80)

movie_test_embeddings = model.encode(movie_test["text"], show_progress_bar=False)
movie_test_pred = clf.predict(movie_test_embeddings)
source_f1 = f1_score(movie_test["label"], movie_test_pred, average='weighted')

print(f"Source Domain F1: {source_f1:.4f}")
print("This is how well the classifier does on its training domain")

# Zero-shot transfer to target domains
print("\n" + "="*80)
print("ZERO-SHOT TRANSFER TO TARGET DOMAINS")
print("="*80)

target_domains = {
    "Restaurant Reviews": restaurant_reviews,
    "Product Reviews": product_reviews,
    "Book Reviews": book_reviews,
}

# TODO: Add your domain if created
# target_domains["YOUR DOMAIN"] = YOUR_DOMAIN_reviews

transfer_results = {}

for domain_name, domain_data in target_domains.items():
    print(f"\n{domain_name}:")
    print("-" * 60)

    # Test without adaptation
    domain_embeddings = model.encode(domain_data['text'])
    domain_pred = clf.predict(domain_embeddings)

    domain_f1 = f1_score(domain_data['label'], domain_pred, average='weighted')
    domain_acc = accuracy_score(domain_data['label'], domain_pred)

    print(f"F1 Score: {domain_f1:.4f}")
    print(f"Accuracy: {domain_acc:.4f}")
    print(f"Performance Drop: {source_f1 - domain_f1:.4f} ({(source_f1-domain_f1)/source_f1*100:.1f}%)")

    # Show some predictions
    print(f"\nExample predictions:")
    for i in range(3):
        true_label = "Positive" if domain_data['label'][i] == 1 else "Negative"
        pred_label = "Positive" if domain_pred[i] == 1 else "Negative"
        correct = "✓" if domain_pred[i] == domain_data['label'][i] else "✗"

        print(f"  '{domain_data['text'][i][:50]}...'")
        print(f"  True: {true_label} | Pred: {pred_label} {correct}")

    transfer_results[domain_name] = {
        'zero_shot_f1': domain_f1,
        'zero_shot_acc': domain_acc,
        'embeddings': domain_embeddings,
        'predictions': domain_pred
    }

# TODO: Uncomment to implement few-shot domain adaptation
print("\n" + "="*80)
print("FEW-SHOT DOMAIN ADAPTATION")
print("="*80)
print("Strategy: Add first 4 examples from each target domain to training set")

adaptation_size = 4

for domain_name, domain_data in target_domains.items():
    print(f"\n{domain_name}:")
    print("-" * 60)

    # Split domain data
    adapt_texts = domain_data['text'][:adaptation_size]
    adapt_labels = domain_data['label'][:adaptation_size]

    test_texts = domain_data['text'][adaptation_size:]
    test_labels = domain_data['label'][adaptation_size:]

    # Combine source + adaptation examples
    adapt_embeddings = model.encode(adapt_texts)
    combined_embeddings = np.vstack([train_embeddings, adapt_embeddings])
    combined_labels = list(movie_train["label"]) + adapt_labels

    # Retrain
    clf_adapted = LogisticRegression(random_state=42, max_iter=1000)
    clf_adapted.fit(combined_embeddings, combined_labels)

    # Test
    test_embeddings = model.encode(test_texts)
    adapted_pred = clf_adapted.predict(test_embeddings)
    adapted_f1 = f1_score(test_labels, adapted_pred, average='weighted')

    zero_shot_f1 = transfer_results[domain_name]['zero_shot_f1']
    improvement = adapted_f1 - zero_shot_f1

    print(f"Zero-shot F1:  {zero_shot_f1:.4f}")
    print(f"Adapted F1:    {adapted_f1:.4f}")
    print(f"Improvement:   {improvement:+.4f}")

    if improvement > 0.05:
        print(f"→ Significant improvement! Domain adaptation helped a lot")
    elif improvement > 0:
        print(f"→ Slight improvement from adaptation")
    else:
        print(f"→ No improvement or slight degradation")

    transfer_results[domain_name]['adapted_f1'] = adapted_f1
    transfer_results[domain_name]['improvement'] = improvement

# Analyze domain similarity
print("\n" + "="*80)
print("DOMAIN SIMILARITY ANALYSIS")
print("="*80)

# Calculate domain centroids (average embedding)
source_centroid = np.mean(train_embeddings, axis=0)

print("\nDomain distances from source (movie reviews):")
print("-"*60)

domain_distances = []
for domain_name, domain_data in target_domains.items():
    domain_embeddings = transfer_results[domain_name]['embeddings']
    domain_centroid = np.mean(domain_embeddings, axis=0)

    distance = np.linalg.norm(source_centroid - domain_centroid)
    zero_shot_f1 = transfer_results[domain_name]['zero_shot_f1']
    drop = source_f1 - zero_shot_f1

    domain_distances.append((domain_name, distance, drop))

    print(f"\n{domain_name}:")
    print(f"  Embedding distance: {distance:.4f}")
    print(f"  Performance drop:   {drop:.4f}")
    print(f"  Zero-shot F1:       {zero_shot_f1:.4f}")

    if 'improvement' in transfer_results[domain_name]:
        improvement = transfer_results[domain_name]['improvement']
        print(f"  Adaptation gain:    {improvement:+.4f}")

# Correlation analysis
print("\n" + "-"*80)
print("Correlation: Distance vs Performance")
print("-"*80)

domain_distances.sort(key=lambda x: x[1])
print("\nRanked by distance to source:")
for name, dist, drop in domain_distances:
    print(f"  {name:20s}: distance={dist:.3f}, drop={drop:.3f}")

print("\nObservation:")
print("  → Domains closer to movies in embedding space tend to transfer better")
print("  → Larger embedding distance correlates with larger performance drop")

# Summary table
print("\n" + "="*80)
print("TRANSFER LEARNING SUMMARY")
print("="*80)

print(f"\n{'Domain':<20s} {'Zero-Shot F1':<15s} {'Adapted F1':<15s} {'Improvement':<12s}")
print("-"*65)
for domain_name in target_domains.keys():
    zero_f1 = transfer_results[domain_name]['zero_shot_f1']
    adapted_f1 = transfer_results[domain_name].get('adapted_f1', 0)
    improvement = transfer_results[domain_name].get('improvement', 0)

    marker = "✓" if improvement > 0.05 else ""
    print(f"{domain_name:<20s} {zero_f1:.4f}          {adapted_f1:.4f}          {improvement:+.4f}     {marker}")

print(f"\nSource (Movies):      {source_f1:.4f}          N/A             N/A")

CROSS-DOMAIN TRANSFER LEARNING EXPERIMENT

Training classifier on MOVIE REVIEWS (source domain)...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]


--------------------------------------------------------------------------------
BASELINE: Performance on Source Domain (Movies)
--------------------------------------------------------------------------------
Source Domain F1: 0.8497
This is how well the classifier does on its training domain

ZERO-SHOT TRANSFER TO TARGET DOMAINS

Restaurant Reviews:
------------------------------------------------------------
F1 Score: 0.9010
Accuracy: 0.9000
Performance Drop: -0.0513 (-6.0%)

Example predictions:
  'Amazing food and excellent service!...'
  True: Positive | Pred: Positive ✓
  'Best restaurant in town, highly recommend...'
  True: Positive | Pred: Positive ✓
  'Delicious meals and great atmosphere...'
  True: Positive | Pred: Positive ✓

Product Reviews:
------------------------------------------------------------
F1 Score: 1.0000
Accuracy: 1.0000
Performance Drop: -0.1503 (-17.7%)

Example predictions:
  'This product is amazing! Works perfectly...'
  True: Positive | Pred: Positiv

**Reflection Questions:**

1. Which domain transferred best from movies? Which worst? How domain-specific is the classifier's knowledge?

2. Do you see patterns in what transfers well vs what fails? Does the classifier understand "delicious" (restaurant) as easily as "entertaining" (movie)?

3. After few-shot adaptation: Which domains benefited most from adding just 4 examples? Why might some domains need more adaptation than others?