# Chapter 4: Text Classification - Easy Tasks

This notebook covers the basic text classification concepts: zero-shot classification, classifier strategies, temperature effects, and embedding similarity.



## Setup

Run all cells in this section to set up the environment and load necessary data.

Before running these cells, it is advised to first run and try to get familiar with the codes and concepts from the main Chapter 4 Notebook (`Start_Here.ipynb`).


### [OPTIONAL] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **NOTE**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [34]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


We use the same data as in Start_Here.ipynb notebook

In [None]:
from datasets import load_dataset

# Load our data
data = load_dataset("rotten_tomatoes")
data

### Helper Functions


In [36]:
from sklearn.metrics import classification_report

def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Challenges

Complete the following tasks by implementing the TODO sections. Solutions are provided below each challenge (commented out).

Each task includes:
- Clear objective
- Starter code with TODOs
- Hints
- Test assertions
- Solution code (commented)

### Level: Easy

These challenges introduce core concepts. Implement the TODO sections to practice.

**About This Task:**
Zero-shot classification classifies text without training examples.

#### Easy Task 1: Zero-Shot Classifier

### Instructions

1. Execute the code to see baseline predictions for 3 basic reviews
2. Uncomment the larger `test_reviews` list and run again to test harder cases
3. Uncomment one label option to see how wording affects predictions
4. Compare which label style works best for ambiguous reviews

In [37]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [38]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [39]:
# Test reviews - THESE WORK AS-IS
test_reviews = [
    "This movie was absolutely fantastic! A masterpiece!",
    "Terrible waste of time. Very disappointing.",
    "An okay film, nothing special but watchable.",
]

# Your task: Uncomment to test harder cases
# test_reviews = [
#     "This movie was absolutely fantastic! A masterpiece!",
#     "Terrible waste of time. Very disappointing.",
#     "An okay film, nothing special but watchable.",
#     "Oh great, another masterpiece... NOT!",  # Sarcastic
#     "Boring.",  # Very short
#     "Great acting but terrible plot.",  # Mixed sentiment
# ]

In [40]:
# Label descriptions - THESE WORK AS-IS
labels = [
    "A negative movie review",
    "A positive movie review"
]

# Implement: Try different label options
# labels = ["negative", "positive"]  # Option 1: Simple
# labels = ["bad movie review", "good movie review"]  # Option 2: Different wording
# labels = ["a scathing negative movie review", "an enthusiastic positive movie review"]  # Option 3: Detailed

In [None]:
# Complete this: Encode the labels and reviews, then calculate cosine similarity
# Hint: Use model.encode() for embeddings and cosine_similarity() for the matrix
label_embeddings = None  # YOUR CODE HERE
review_embeddings = None  # YOUR CODE HERE
sim_matrix = None  # YOUR CODE HERE

# Test your implementation
assert label_embeddings is not None, "Encode the labels first!"
assert review_embeddings is not None, "Encode the reviews!"
assert sim_matrix is not None, "Calculate the similarity matrix!"

# Solution (uncomment if stuck):
# label_embeddings = model.encode(labels)
# review_embeddings = model.encode(test_reviews)
# sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

In the above code cell, we simply load and use the model on the test reviews. You are encouraged to run the above cell with the different labels option to see the difference similarity.

In [None]:
# Fill in: For each review, find the predicted label and print results
# Hint: Use np.argmax() to find the highest similarity score

print("Classification Results:")

for i, review in enumerate(test_reviews):
    # Write code to: Find which label has the highest similarity
    prediction = None  # YOUR CODE HERE (use np.argmax on sim_matrix[i])
    confidence = None  # YOUR CODE HERE (get the max similarity score)
    margin = None  # YOUR CODE HERE (difference between the two label scores)
    
    print(f"\nReview {i+1}: '{review}'")
    if prediction is not None:
        print(f"Predicted: {labels[prediction]}")
        print(f"Confidence: {confidence:.3f}")
        print(f"Scores -> Negative: {sim_matrix[i][0]:.3f}, Positive: {sim_matrix[i][1]:.3f}")
        print(f"Margin: {margin:.3f}")

# Solution (uncomment if stuck):
# for i, review in enumerate(test_reviews):
#     prediction = np.argmax(sim_matrix[i])
#     confidence = sim_matrix[i][prediction]
#     margin = abs(sim_matrix[i][0] - sim_matrix[i][1])
#     print(f"\nReview {i+1}: '{review}'")
#     print(f"Predicted: {labels[prediction]}")
#     print(f"Confidence: {confidence:.3f}")
#     print(f"Scores -> Negative: {sim_matrix[i][0]:.3f}, Positive: {sim_matrix[i][1]:.3f}")
#     print(f"Margin: {margin:.3f}")

To test other `test_reviews` or `labels`, comment the previous one. Make sure only one of the sets is uncommented.

Tip: to uncomment/comment lines, use `Ctrl + /`



---

### Questions

1. Why did the classifier fail on the sarcastic review ("*Oh great, another masterpiece... NOT*")? What semantic features did embeddings miss?

2. Which reviews changed predictions when you modified label descriptions? Why are some reviews more sensitive to label wording than others?

3. Which reviews have low confidence margins (<0.1)? What linguistic features make certain reviews harder to classify?

---



**About This Task:**
Different classification strategies affect model behavior and performance.

#### Easy Task 2: Classifier Strategy Analysis

### Instructions

1. Execute code to see three pre-built classifiers (conservative, aggressive, balanced)
2. Study each confusion matrix to identify error patterns
3. Modify `classifier_yours` to create a very conservative classifier (precision > 0.9)
4. Uncomment the TODO section to analyze your classifier
5. Experiment with creating different strategy combinations

In [43]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
import numpy as np

In [44]:
# True labels: first 5 are negative (0), last 5 are positive (1)
y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [45]:
# Pre-built classifiers to analyze
classifier_conservative = np.array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1])  # Rarely predicts positive
classifier_aggressive = np.array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1])     # Often predicts positive
classifier_balanced = np.array([0, 0, 0, 1, 0, 0, 1, 1, 1, 1])       # Balanced approach

In [46]:
# Your task: Modify these predictions to make YOUR classifier
# Try to achieve precision > 0.9 (be very selective about predicting 1)
classifier_yours = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

In [None]:
def analyze_classifier(name, y_true, y_pred):
    """Analyze classifier performance with detailed breakdown"""
    print(f"\n{name}")
    
    # Implement: Calculate the confusion matrix
    # Hint: Use confusion_matrix(y_true, y_pred)
    cm = None  # YOUR CODE HERE
    
    # Test
    assert cm is not None, "Calculate the confusion matrix!"
    
    # Show confusion matrix with labels
    print(f"\nConfusion Matrix:")
    print(f"                    Predicted Neg | Predicted Pos")
    print(f"Actual Neg (0):          {cm[0][0]}       |       {cm[0][1]}  (False Positives)")
    print(f"Actual Pos (1):          {cm[1][0]}       |       {cm[1][1]}  (True Positives)")
    print(f"                   (False Negatives)")
    
    # Complete this: Calculate precision, recall, and f1
    # Hint: Use precision_score, recall_score, f1_score from sklearn
    precision = None  # YOUR CODE HERE
    recall = None  # YOUR CODE HERE
    f1 = None  # YOUR CODE HERE
    
    if precision is not None and recall is not None and f1 is not None:
        print(f"\nMetrics:")
        print(f"Precision: {precision:.3f} = TP/(TP+FP) = {cm[1][1]}/({cm[1][1]}+{cm[0][1]})")
        print(f"Recall:    {recall:.3f} = TP/(TP+FN) = {cm[1][1]}/({cm[1][1]}+{cm[1][0]})")
        print(f"F1 Score:  {f1:.3f} = 2*(P*R)/(P+R)")
        
        # Explain strategy
        if precision > recall + 0.1:
            print(f"\nStrategy: Conservative (few false alarms, misses some positives)")
        elif recall > precision + 0.1:
            print(f"\nStrategy: Aggressive (finds most positives, many false alarms)")
        else:
            print(f"\nStrategy: Balanced")
    
    return precision, recall, f1

# Solution (uncomment if stuck):
# def analyze_classifier(name, y_true, y_pred):
#     print(f"\n{name}")
#     cm = confusion_matrix(y_true, y_pred)
#     precision = precision_score(y_true, y_pred, zero_division=0)
#     recall = recall_score(y_true, y_pred, zero_division=0)
#     f1 = f1_score(y_true, y_pred, zero_division=0)
#     ... (rest of the function)

In [None]:
# Analyze pre-built classifiers
results = {}
for name, classifier in [
    ("Conservative Classifier", classifier_conservative),
    ("Aggressive Classifier", classifier_aggressive),
    ("Balanced Classifier", classifier_balanced),
]:
    p, r, f = analyze_classifier(name, y_true, classifier)
    results[name] = (p, r, f)

In [49]:
# Fill in: Analyze your classifier
# # print("Analyzing Your Classifier")
# # p, r, f = analyze_classifier("YOUR Classifier", y_true, classifier_yours)
# results["YOUR Classifier"] = (p, r, f)

In [None]:
# Summary
print("Summary")
print(f"{'Classifier':<25} {'Precision':<12} {'Recall':<12} {'F1':<12}")
for name, (p, r, f) in results.items():
    print(f"{name:<25} {p:.3f}        {r:.3f}       {f:.3f}")

### Questions

1. The conservative classifier has 2 false negatives. What real-world mistake does this represent? Provide a movie review example.

2. What strategy did you use to achieve high precision in `classifier_yours`? Why does predicting positive less frequently increase precision?

3. Which classifier won on F1 score? Why doesn't the aggressive classifier win despite high recall?

**About This Task:**
Temperature controls randomness in language model outputs.

#### Easy Task 3: Temperature Effects on Text Generation

### Instructions

1. Execute code to see how temperature affects token selection with a confident model
2. Observe how probabilities and samples change across temperatures
3. Uncomment the uncertain probability distribution and run again
4. Compare temperature effects on confident vs uncertain models
5. Uncomment TODO to add a new temperature value and analyze results

In [51]:
import numpy as np

In [52]:
original_probs = np.array([0.50, 0.30, 0.12, 0.05, 0.03])
tokens = ["positive", "negative", "neutral", "good", "bad"]

# Try uncertain distribution
# original_probs = np.array([0.25, 0.24, 0.22, 0.18, 0.11]), much more balanced
# Run again and compare the temperature effects

In [None]:
def apply_temperature(probs, temperature):
    """Apply temperature scaling to change distribution sharpness"""
    if temperature == 0:
        # Write code to: For temperature=0, return deterministic distribution
        # Hint: Create array of zeros, set the max probability index to 1.0
        result = None  # YOUR CODE HERE
        return result if result is not None else probs
    
    # Your task: Apply temperature scaling
    # Hint: logits = log(probs), scale by temp, apply softmax
    # Steps: log -> divide by temp -> exp -> normalize
    logits = None  # YOUR CODE HERE
    scaled_logits = None  # YOUR CODE HERE
    exp_logits = None  # YOUR CODE HERE
    normalized = None  # YOUR CODE HERE (sum should equal 1.0)
    
    return normalized if normalized is not None else probs

# Solution (uncomment if stuck):
# def apply_temperature(probs, temperature):
#     if temperature == 0:
#         result = np.zeros_like(probs)
#         result[np.argmax(probs)] = 1.0
#         return result
#     logits = np.log(probs + 1e-10)
#     scaled_logits = logits / temperature
#     exp_logits = np.exp(scaled_logits)
#     return exp_logits / np.sum(exp_logits)

In [54]:
def visualize_distribution(probs, tokens):
    """Show probability distribution as bar chart"""
    for i, token in enumerate(tokens):
        bar_length = int(probs[i] * 100)
        bar = '' * bar_length
        print(f"  {token:10s}: {probs[i]:.3f} {bar}")

In [55]:
# Test different temperatures
temperatures = [0, 0.5, 1.0, 2.0]

# Add temperature=3.0
# temperatures = [0, 0.5, 1.0, 2.0, 3.0]

In [None]:
print(f"Original (temperature=1.0) probabilities:")
visualize_distribution(original_probs, tokens)

In [None]:
for temp in temperatures:
    print(f"\n{'='*70}")
    print(f"Temperature = {temp}")
    print('='*70)

    # Apply temperature
    new_probs = apply_temperature(original_probs, temp)

    # Visualize
    visualize_distribution(new_probs, tokens)

    # Sample tokens
    print(f"\n  Sampling 10 tokens:")
    if temp == 0:
        samples = [tokens[np.argmax(new_probs)]] * 10
    else:
        samples = np.random.choice(tokens, size=10, p=new_probs)

    print(f"  {samples}")

    # Show diversity metric
    unique_tokens = len(set(samples))
    print(f"   Diversity: {unique_tokens}/10 unique tokens")

    # Explain what's happening
    if temp == 0:
        print(f"   Effect: Deterministic - always outputs '{samples[0]}'")
    elif temp < 1.0:
        print(f"   Effect: Sharpened - makes confident tokens more likely")
    elif temp == 1.0:
        print(f"   Effect: Unchanged - original distribution")
    else:
        print(f"   Effect: Flattened - makes all tokens more equally likely")

### Questions

1. Why is temperature=0 critical for classification tasks? What would go wrong with temperature=1.0?

2. Compare temperature=0.5 vs 2.0. At what temperature did low-probability tokens like "bad" start appearing in samples?

3. With the uncertain distribution ([0.25, 0.24, 0.22, 0.18, 0.11]), how did temperature effects differ from the confident model?

**About This Task:**
Embedding similarity measures how semantically close two texts are in vector space.

#### Easy Task 4: Embedding Similarity Analysis

### Instructions

1. Execute code to see similarity matrix for movie reviews
2. Identify which reviews cluster together and which are distant
3. Uncomment TODO to add reviews from different domains
4. Analyze whether restaurant/product reviews cluster with movie reviews
5. Add a random sentence to test similarity boundaries

In [58]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [59]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [60]:
# Movie reviews - different types
texts = [
    # Positive reviews
    "Amazing movie! Absolutely loved it!",
    "Fantastic film, highly recommend!",
    "Great cinematography and acting",

    # Negative reviews
    "Terrible waste of time",
    "Very disappointing and boring",
    "Poor acting and weak plot",

    # Neutral reviews
    "It was okay, nothing special",
    "Some good parts, some bad parts",

    # Off-topic
    "The weather is nice today",
    "I like eating pizza"
]

# Implement: Test domain transfer
# texts = [
#     # Positive movie reviews
#     "Amazing movie! Absolutely loved it!",
#     "Fantastic film, highly recommend!",
#     "Great cinematography and acting",
#
#     # Negative movie reviews
#     "Terrible waste of time",
#     "Very disappointing and boring",
#     "Poor acting and weak plot",
#
#     # Neutral movie reviews
#     "It was okay, nothing special",
#     "Some good parts, some bad parts",
#
#     # Positive restaurant review (different domain!)
#     "Amazing food! Absolutely loved it!",
#     "Fantastic restaurant, highly recommend!",
#
#     # Off-topic
#     "The weather is nice today",
#     "I like eating pizza",
# ]

In [61]:
labels = [f"Text {i+1}" for i in range(len(texts))]

In [62]:
# Generate embeddings
embeddings = model.encode(texts)
similarity_matrix = cosine_similarity(embeddings)

In [None]:
print(f"Each text converted to {embeddings.shape[1]}-dimensional vector")
print(f"Comparing {embeddings.shape[0]} texts\n")

In [None]:
# Show full similarity matrix
print("Similarity Matrix (0=unrelated, 1=identical):")
print(f"{'':10s}", end="")
for i in range(len(texts)):
    print(f"T{i+1:2d} ", end="")
print()

for i in range(len(texts)):
    print(f"Text {i+1:2d}:  ", end="")
    for j in range(len(texts)):
        if i == j:
            print("---- ", end="")
        else:
            sim = similarity_matrix[i][j]
            if sim > 0.6:
                print(f"{sim:.2f}*", end="")  # High similarity
            else:
                print(f"{sim:.2f} ", end="")
            print(" ", end="")
    print()

print("\n* = High similarity (>0.6)")

In [None]:
# Detailed comparisons
print("Detailed Comparisons")

comparisons = [
    (0, 1, "Positive review vs Positive review"),
    (3, 4, "Negative review vs Negative review"),
    (0, 3, "Positive review vs Negative review"),
    (0, 8, "Movie review vs Off-topic text"),
]

# Complete this: Compare movie vs restaurant reviews
# comparisons.append((0, 8, "Positive MOVIE vs Positive RESTAURANT"))

for i, j, description in comparisons:
    if i < len(texts) and j < len(texts):
        sim = similarity_matrix[i][j]
        print(f"\n{description}:")
        print(f"  Text {i+1}: '{texts[i]}'")
        print(f"  Text {j+1}: '{texts[j]}'")
        print(f"  Similarity: {sim:.3f}")

        if sim > 0.7:
            print(f"   Very similar! These texts are closely related in meaning")
        elif sim > 0.4:
            print(f"   Moderately similar. Some shared concepts")
        else:
            print(f"   Different topics or sentiments")

In [None]:
# Fill in: Find texts that are similar to the first positive review
# Hint: Loop through all texts, check if similarity > 0.5

print("\nClusters (which texts group together?)")

positive_idx = 0
similar_to_positive = []

# Write code to: Loop through texts and find similar ones
# for i in range(len(texts)):
#     if i != positive_idx and similarity_matrix[positive_idx][i] > ???:
#         similar_to_positive.append(???)

print(f"\nTexts similar to '{texts[positive_idx]}':")
for idx, sim in sorted(similar_to_positive, key=lambda x: x[1], reverse=True):
    print(f"  Text {idx+1} (sim={sim:.3f}): '{texts[idx]}'")

# Solution (uncomment if stuck):
# for i in range(len(texts)):
#     if i != positive_idx and similarity_matrix[positive_idx][i] > 0.5:
#         similar_to_positive.append((i, similarity_matrix[positive_idx][i]))

### Questions

1. Compare similarity between Text 1 and Text 2 (both positive) vs Text 1 and Text 4 (positive vs negative). What aspects of semantic meaning do embeddings prioritize?

2. Find similarity scores between two negative reviews (Text 4 and Text 5) and two positive reviews (Text 1 and Text 2). Why would averaging embeddings per class work for classification?

3. After adding restaurant reviews: How similar was "Amazing food" to "Amazing movie"? What does this reveal about domain transfer?