# Chapter 4: Text Classification - Easy Tasks (Solutions)

Complete working solutions.


## Setup

Run all cells in this section to set up the environment and load necessary data.

Before running these cells, it is advised to first run and try to get familiar with the codes and concepts from the main Chapter 4 Notebook (`Start_Here.ipynb`).


### [Optional] - Installing Packages on <img src="https://colab.google/static/images/icons/colab.png" width=100>

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to **uncomment and run** the following codeblock to install the dependencies for this chapter:

---

 **Note**: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to
**Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

---


In [1]:
 %%capture
!pip install transformers sentence-transformers openai
!pip install -U datasets

### Data Loading


We use the same data as in Start_Here.ipynb notebook

In [2]:
from datasets import load_dataset
# Load our data
data = load_dataset("rotten_tomatoes")
data

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

### Helper Functions


In [3]:
from sklearn.metrics import classification_report
def evaluate_performance(y_true, y_pred):
    """Create and print the classification report"""
    performance = classification_report(
        y_true, y_pred,
        target_names=["Negative Review", "Positive Review"]
    )
    print(performance)

## Challenges

Complete the following tasks by implementing the starter code.

Solutions are in `00_Solutions.ipynb`.

### Level: Easy

These challenges introduce core concepts. Implement the Try sections to practice.

**About This Task:**
Zero-shot classification classifies text without training examples.

#### Easy Task 1: Zero-Shot Classifier

### Instructions

1. Execute the code to see baseline predictions for 3 basic reviews
2. Uncomment the larger `test_reviews` list and run again to test harder cases
3. Uncomment one label option to see how wording affects predictions
4. Compare which label style works best for ambiguous reviews

In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [5]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
# Test reviews - These work as-is
test_reviews = [
    "This movie was absolutely fantastic! A masterpiece!",
    "Terrible waste of time. Very disappointing.",
    "An okay film, nothing special but watchable.",
]
# Your task: Uncomment to test harder cases
# test_reviews = [
#     "This movie was absolutely fantastic! A masterpiece!",
#     "Terrible waste of time. Very disappointing.",
#     "An okay film, nothing special but watchable.",
#     "Oh great, another masterpiece... not!",  # Sarcastic
#     "Boring.",  # Very short
#     "Great acting but terrible plot.",  # Mixed sentiment
# ]

In [7]:
# Label descriptions - These work as-is
labels = [
    "A negative movie review",
    "A positive movie review"
]
# Implement: Try different label options
# labels = ["negative", "positive"]  # Option 1: Simple
# labels = ["bad movie review", "good movie review"]  # Option 2: Different wording
# labels = ["a scathing negative movie review", "an enthusiastic positive movie review"]  # Option 3: Detailed

In [8]:
# Complete this: Encode the labels and reviews, then calculate cosine similarity
# Hint: Use model.encode() for embeddings and cosine_similarity() for the matrix
label_embeddings = model.encode(labels)
review_embeddings = model.encode(test_reviews)
sim_matrix = cosine_similarity(review_embeddings, label_embeddings)

In the above code cell, we simply load and use the model on the test reviews. You are encouraged to run the above cell with the different labels option to see the difference similarity.

In [9]:
# Fill in: For each review, find the predicted label and print results
# Hint: Use np.argmax() to find the highest similarity score

print("Classification Results:")

for i, review in enumerate(test_reviews):
    # prediction: which label (0 or 1) has highest similarity
    prediction = np.argmax(sim_matrix[i])
    
    # confidence: how similar is the review to predicted label (0.0 to 1.0)
    # Higher confidence = model is more sure about its prediction
    confidence = sim_matrix[i][prediction]
    
    # margin: difference between top 2 predictions (how clear-cut the decision is)
    # Large margin (>0.15) = confident, small margin (<0.05) = uncertain/ambiguous
    top_two = sorted(sim_matrix[i], reverse=True)[:2]
    margin = top_two[0] - top_two[1]
    
    print(f"\nReview {i+1}: '{review}'")
    print(f"Predicted: {labels[prediction]}")
    print(f"Confidence: {confidence:.3f}")
    print(f"Margin: {margin:.3f}")

Classification Results:

Review 1: 'This movie was absolutely fantastic! A masterpiece!'
Predicted: A positive movie review
Confidence: 0.493
Margin: 0.111

Review 2: 'Terrible waste of time. Very disappointing.'
Predicted: A negative movie review
Confidence: 0.439
Margin: 0.140

Review 3: 'An okay film, nothing special but watchable.'
Predicted: A positive movie review
Confidence: 0.542
Margin: 0.039


To test other `test_reviews` or `labels`, comment the previous one. Make sure only one of the sets is uncommented.

Tip: to uncomment/comment lines, use `Ctrl + /`



### Questions

1. Why did the classifier fail on the sarcastic review ("*Oh great, another masterpiece... NOT*")? What semantic features did embeddings miss?

2. Which reviews changed predictions when you modified label descriptions? Why are some reviews more sensitive to label wording than others?

3. Which reviews have low confidence margins (<0.1)? What linguistic features make certain reviews harder to classify?



**About This Task:**
Different classification strategies affect model behavior and performance.

#### Easy Task 2: Classifier Strategy Analysis

### Instructions

1. Execute code to see three pre-built classifiers (conservative, aggressive, balanced)
2. Study each confusion matrix to identify error patterns
3. Modify `classifier_yours` to create a very conservative classifier (precision > 0.9)
4. Uncomment the Try section to analyze your classifier
5. Experiment with creating different strategy combinations

In [10]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
import numpy as np

In [11]:
y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [12]:
classifier_conservative = np.array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1])  # Rarely predicts positive
classifier_aggressive = np.array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1])     # Often predicts positive
classifier_balanced = np.array([0, 0, 0, 1, 0, 0, 1, 1, 1, 1])       # Balanced approach

In [13]:
# Your task: Modify these predictions to make your classifier
# Try to achieve precision > 0.9 (be very selective about predicting 1)
classifier_yours = np.array([0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

In [14]:
def analyze_classifier(name, y_true, y_pred):
    """Analyze classifier performance with detailed breakdown"""
    print(f"\n{name}")
    # Calculate the confusion matrix
    # Hint: Use confusion_matrix(y_true, y_pred)
    cm = confusion_matrix(y_true, y_pred)

    print(f"\nConfusion Matrix:")
    print(f"                    Predicted Neg | Predicted Pos")
    print(f"Actual Neg (0):          {cm[0][0]}       |       {cm[0][1]}  (False Positives)")
    print(f"Actual Pos (1):          {cm[1][0]}       |       {cm[1][1]}  (True Positives)")
    print(f"                   (False Negatives)")

    # Calculate precision, recall, and f1
    # Hint: Use precision_score, recall_score, f1_score from sklearn
    precision = precision_score(y_true, y_pred, zero_division=0)
    recall = recall_score(y_true, y_pred, zero_division=0)
    f1 = f1_score(y_true, y_pred, zero_division=0)
    if precision is not None and recall is not None and f1 is not None:
        print(f"\nMetrics:")
        print(f"Precision: {precision:.3f} = TP/(TP+FP) = {cm[1][1]}/({cm[1][1]}+{cm[0][1]})")
        print(f"Recall:    {recall:.3f} = TP/(TP+FN) = {cm[1][1]}/({cm[1][1]}+{cm[1][0]})")
        print(f"F1 Score:  {f1:.3f} = 2*(P*R)/(P+R)")
        # Explain strategy
        if precision > recall + 0.1:
            print(f"\nStrategy: Conservative (few false alarms, misses some positives)")
        elif recall > precision + 0.1:
            print(f"\nStrategy: Aggressive (finds most positives, many false alarms)")
        else:
            print(f"\nStrategy: Balanced")
    return precision, recall, f1

In [15]:
results = {}
for name, classifier in [
    ("Conservative Classifier", classifier_conservative),
    ("Aggressive Classifier", classifier_aggressive),
    ("Balanced Classifier", classifier_balanced),
]:
    p, r, f = analyze_classifier(name, y_true, classifier)
    results[name] = (p, r, f)


Conservative Classifier

Confusion Matrix:
                    Predicted Neg | Predicted Pos
Actual Neg (0):          5       |       0  (False Positives)
Actual Pos (1):          2       |       3  (True Positives)
                   (False Negatives)

Metrics:
Precision: 1.000 = TP/(TP+FP) = 3/(3+0)
Recall:    0.600 = TP/(TP+FN) = 3/(3+2)
F1 Score:  0.750 = 2*(P*R)/(P+R)

Strategy: Conservative (few false alarms, misses some positives)

Aggressive Classifier

Confusion Matrix:
                    Predicted Neg | Predicted Pos
Actual Neg (0):          1       |       4  (False Positives)
Actual Pos (1):          0       |       5  (True Positives)
                   (False Negatives)

Metrics:
Precision: 0.556 = TP/(TP+FP) = 5/(5+4)
Recall:    1.000 = TP/(TP+FN) = 5/(5+0)
F1 Score:  0.714 = 2*(P*R)/(P+R)

Strategy: Aggressive (finds most positives, many false alarms)

Balanced Classifier

Confusion Matrix:
                    Predicted Neg | Predicted Pos
Actual Neg (0):          4  

In [16]:
# Analyze your classifier
# # print("Analyzing Your Classifier")
# # p, r, f = analyze_classifier("YOUR Classifier", y_true, classifier_yours)
# results["your Classifier"] = (p, r, f)

In [17]:
print(f"{'Classifier':<25} {'Precision':<12} {'Recall':<12} {'F1':<12}")
for name, (p, r, f) in results.items():
    print(f"{name:<25} {p:.3f}        {r:.3f}       {f:.3f}")

Classifier                Precision    Recall       F1          
Conservative Classifier   1.000        0.600       0.750
Aggressive Classifier     0.556        1.000       0.714
Balanced Classifier       0.800        0.800       0.800


### Questions

1. The conservative classifier has 2 false negatives. What real-world mistake does this represent? Provide a movie review example.

2. What strategy did you use to achieve high precision in `classifier_yours`? Why does predicting positive less frequently increase precision?

3. Which classifier won on F1 score? Why doesn't the aggressive classifier win despite high recall?

**About This Task:**
Temperature controls randomness in language model outputs.

#### Easy Task 3: Temperature Effects on Text Generation

### Instructions

1. Execute code to see how temperature affects token selection with a confident model
2. Observe how probabilities and samples change across temperatures
3. Uncomment the uncertain probability distribution and run again
4. Compare temperature effects on confident vs uncertain models
5. Uncomment Try to add a new temperature value and analyze results

In [18]:
import numpy as np

In [19]:
original_probs = np.array([0.50, 0.30, 0.12, 0.05, 0.03])
tokens = ["positive", "negative", "neutral", "good", "bad"]

# Try uncertain distribution
# original_probs = np.array([0.25, 0.24, 0.22, 0.18, 0.11]), much more balanced
# Run again and compare the temperature effects

In [20]:
def apply_temperature(probs, temperature):
    """Apply temperature scaling to change distribution sharpness"""
    if temperature == 0:
        # For temperature=0, return deterministic distribution
        result = np.zeros_like(probs)
        result[np.argmax(probs)] = 1.0
        return result

    # Apply temperature scaling
    # Steps: log -> divide by temp -> exp -> normalize
    logits = np.log(probs + 1e-10)  # Add small value to avoid log(0)
    scaled_logits = logits / temperature  # Scale by temperature
    exp_logits = np.exp(scaled_logits)  # Exponentiate
    normalized = exp_logits / np.sum(exp_logits)  # Normalize to sum to 1.0
    return normalized

In [21]:
def visualize_distribution(probs, tokens):
    """Show probability distribution as bar chart"""
    for i, token in enumerate(tokens):
        bar_length = int(probs[i] * 100)
        bar = '' * bar_length
        print(f"  {token:10s}: {probs[i]:.3f} {bar}")

In [22]:
# Test different temperatures
temperatures = [0, 0.5, 1.0, 2.0]
# Add temperature=3.0
# temperatures = [0, 0.5, 1.0, 2.0, 3.0]

In [23]:
print(f"Original (temperature=1.0) probabilities:")
visualize_distribution(original_probs, tokens)

Original (temperature=1.0) probabilities:
  positive  : 0.500 
  negative  : 0.300 
  neutral   : 0.120 
  good      : 0.050 
  bad       : 0.030 


In [24]:
for temp in temperatures:
    print(f"\n{'='*70}")
    print(f"Temperature = {temp}")
    print('='*70)

    # Apply temperature
    new_probs = apply_temperature(original_probs, temp)
    # Visualize
    visualize_distribution(new_probs, tokens)
    # Sample tokens
    print(f"\n  Sampling 10 tokens:")
    if temp == 0:
        samples = [tokens[np.argmax(new_probs)]] * 10
    else:
        samples = np.random.choice(tokens, size=10, p=new_probs)
    print(f"  {samples}")
    # Show diversity metric
    unique_tokens = len(set(samples))
    print(f"   Diversity: {unique_tokens}/10 unique tokens")
    # Explain what's happening
    if temp == 0:
        print(f"   Effect: Deterministic - always outputs '{samples[0]}'")
    elif temp < 1.0:
        print(f"   Effect: Sharpened - makes confident tokens more likely")
    elif temp == 1.0:
        print(f"   Effect: Unchanged - original distribution")
    else:
        print(f"   Effect: Flattened - makes all tokens more equally likely")


Temperature = 0
  positive  : 1.000 
  negative  : 0.000 
  neutral   : 0.000 
  good      : 0.000 
  bad       : 0.000 

  Sampling 10 tokens:
  ['positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive', 'positive']
   Diversity: 1/10 unique tokens
   Effect: Deterministic - always outputs 'positive'

Temperature = 0.5
  positive  : 0.699 
  negative  : 0.252 
  neutral   : 0.040 
  good      : 0.007 
  bad       : 0.003 

  Sampling 10 tokens:
  ['positive' 'positive' 'negative' 'positive' 'positive' 'positive'
 'negative' 'negative' 'positive' 'positive']
   Diversity: 2/10 unique tokens
   Effect: Sharpened - makes confident tokens more likely

Temperature = 1.0
  positive  : 0.500 
  negative  : 0.300 
  neutral   : 0.120 
  good      : 0.050 
  bad       : 0.030 

  Sampling 10 tokens:
  ['good' 'positive' 'negative' 'negative' 'positive' 'neutral' 'positive'
 'positive' 'positive' 'positive']
   Diversity: 4/10 unique tokens
   

### Questions

1. Why is temperature=0 critical for classification tasks? What would go wrong with temperature=1.0?

2. Compare temperature=0.5 vs 2.0. At what temperature did low-probability tokens like "bad" start appearing in samples?

3. With the uncertain distribution ([0.25, 0.24, 0.22, 0.18, 0.11]), how did temperature effects differ from the confident model?

**About This Task:**
Embedding similarity measures how semantically close two texts are in vector space.

#### Easy Task 4: Embedding Similarity Analysis

### Instructions

1. Execute code to see similarity matrix for movie reviews
2. Identify which reviews cluster together and which are distant
3. Uncomment Try to add reviews from different domains
4. Analyze whether restaurant/product reviews cluster with movie reviews
5. Add a random sentence to test similarity boundaries

In [25]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [26]:
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

In [27]:
# Movie reviews - different types
texts = [
    # Positive reviews
    "Amazing movie! Absolutely loved it!",
    "Fantastic film, highly recommend!",
    "Great cinematography and acting",

    # Negative reviews
    "Terrible waste of time",
    "Very disappointing and boring",
    "Poor acting and weak plot",

    # Neutral reviews
    "It was okay, nothing special",
    "Some good parts, some bad parts",

    # Off-topic
    "The weather is nice today",
    "I like eating pizza"
]

In [28]:
# Implement: Test domain transfer
# texts = [
#     # Positive movie reviews
#     "Amazing movie! Absolutely loved it!",
#     "Fantastic film, highly recommend!",
#     "Great cinematography and acting",
#
#     # Negative movie reviews
#     "Terrible waste of time",
#     "Very disappointing and boring",
#     "Poor acting and weak plot",
#
#     # Neutral movie reviews
#     "It was okay, nothing special",
#     "Some good parts, some bad parts",
#
#     # Positive restaurant review (different domain!)
#     "Amazing food! Absolutely loved it!",
#     "Fantastic restaurant, highly recommend!",
#
#     # Off-topic
#     "The weather is nice today",
#     "I like eating pizza",
# ]

In [29]:
labels = [f"Text {i+1}" for i in range(len(texts))]

In [30]:
# Generate embeddings
embeddings = model.encode(texts)
similarity_matrix = cosine_similarity(embeddings)

In [31]:
print(f"Each text converted to {embeddings.shape[1]}-dimensional vector")
print(f"Comparing {embeddings.shape[0]} texts\n")

Each text converted to 768-dimensional vector
Comparing 10 texts



In [32]:
# Show full similarity matrix
print("Similarity Matrix (0=unrelated, 1=identical):")
print(f"{'':10s}", end="")
for i in range(len(texts)):
    print(f"T{i+1:2d} ", end="")
print()
for i in range(len(texts)):
    print(f"Text {i+1:2d}:  ", end="")
    for j in range(len(texts)):
        if i == j:
            print("---- ", end="")
        else:
            sim = similarity_matrix[i][j]
            if sim > 0.6:
                print(f"{sim:.2f}*", end="")  # High similarity
            else:
                print(f"{sim:.2f} ", end="")
            print(" ", end="")
    print()
    print("\n* = High similarity (>0.6)")

Similarity Matrix (0=unrelated, 1=identical):
          T 1 T 2 T 3 T 4 T 5 T 6 T 7 T 8 T 9 T10 
Text  1:  ---- 0.77* 0.52  0.09  0.24  0.23  0.31  0.16  0.10  0.03  

* = High similarity (>0.6)
Text  2:  0.77* ---- 0.47  0.11  0.26  0.23  0.21  0.14  0.09  0.01  

* = High similarity (>0.6)
Text  3:  0.52  0.47  ---- 0.10  0.32  0.51  0.36  0.26  0.12  0.06  

* = High similarity (>0.6)
Text  4:  0.09  0.11  0.10  ---- 0.55  0.29  0.34  0.21  0.03  0.07  

* = High similarity (>0.6)
Text  5:  0.24  0.26  0.32  0.55  ---- 0.50  0.55  0.36  0.09  0.13  

* = High similarity (>0.6)
Text  6:  0.23  0.23  0.51  0.29  0.50  ---- 0.37  0.25  -0.03  0.08  

* = High similarity (>0.6)
Text  7:  0.31  0.21  0.36  0.34  0.55  0.37  ---- 0.34  0.16  0.11  

* = High similarity (>0.6)
Text  8:  0.16  0.14  0.26  0.21  0.36  0.25  0.34  ---- 0.12  0.21  

* = High similarity (>0.6)
Text  9:  0.10  0.09  0.12  0.03  0.09  -0.03  0.16  0.12  ---- 0.20  

* = High similarity (>0.6)
Text 10:  0.03  0.0

In [33]:
# Detailed comparisons
print("Detailed Comparisons")
comparisons = [
    (0, 1, "Positive review vs Positive review"),
    (3, 4, "Negative review vs Negative review"),
    (0, 3, "Positive review vs Negative review"),
    (0, 8, "Movie review vs Off-topic text"),
]
# Complete this: Compare movie vs restaurant reviews
comparisons.append((0, 8, "Positive MOVIE vs Positive RESTAURANT"))
for i, j, description in comparisons:
    if i < len(texts) and j < len(texts):
        sim = similarity_matrix[i][j]
        print(f"\n{description}:")
        print(f"  Text {i+1}: '{texts[i]}'")
        print(f"  Text {j+1}: '{texts[j]}'")
        print(f"  Similarity: {sim:.3f}")
        if sim > 0.7:
            print(f"   Very similar. These texts are closely related in meaning")
        elif sim > 0.4:
            print(f"   Moderately similar. Some shared concepts")
        else:
            print(f"   Different topics or sentiments")

Detailed Comparisons

Positive review vs Positive review:
  Text 1: 'Amazing movie! Absolutely loved it!'
  Text 2: 'Fantastic film, highly recommend!'
  Similarity: 0.768
   Very similar. These texts are closely related in meaning

Negative review vs Negative review:
  Text 4: 'Terrible waste of time'
  Text 5: 'Very disappointing and boring'
  Similarity: 0.545
   Moderately similar. Some shared concepts

Positive review vs Negative review:
  Text 1: 'Amazing movie! Absolutely loved it!'
  Text 4: 'Terrible waste of time'
  Similarity: 0.090
   Different topics or sentiments

Movie review vs Off-topic text:
  Text 1: 'Amazing movie! Absolutely loved it!'
  Text 9: 'The weather is nice today'
  Similarity: 0.095
   Different topics or sentiments

Positive MOVIE vs Positive RESTAURANT:
  Text 1: 'Amazing movie! Absolutely loved it!'
  Text 9: 'The weather is nice today'
  Similarity: 0.095
   Different topics or sentiments


### Questions

1. Compare similarity between Text 1 and Text 2 (both positive) vs Text 1 and Text 4 (positive vs negative). What aspects of semantic meaning do embeddings prioritize?

2. Find similarity scores between two negative reviews (Text 4 and Text 5) and two positive reviews (Text 1 and Text 2). Why would averaging embeddings per class work for classification?

3. After adding restaurant reviews: How similar was "Amazing food" to "Amazing movie"? What does this reveal about domain transfer?