# CatShop: Learning Machine Learning Through a Cat's Eyes 🐱
*A Beginner's Guide to Three Core ML Approaches*


## Today's Learning Goals

We're going to explore machine learning by building **CatShop** - an e-commerce system that thinks like a cat!

**Why cats?** 
- Makes abstract concepts concrete and fun
- Forces us to think about perspective (a key ML skill!)
- Shows how we transform data for specific tasks

**What you'll learn:**
1. **Supervised Learning**: Teaching computers with examples
2. **Unsupervised Learning**: Finding patterns without labels  
3. **Active Learning**: Being smart about what to label

## The Big Picture

Think of ML like teaching a child:
- **Supervised**: "This is a dog, this is a cat" (lots of examples)
- **Unsupervised**: "Group these animals by similarity" (no labels)
- **Active**: "Which animal are you unsure about?" (strategic learning)

Today we'll use a real language model (Gemma-3) to experience all three!

## Part 1: Setting Up Our Workshop

In [None]:
import sys
print(sys.executable)
print(sys.version)

In [None]:
# RISE Configuration Cell (Run this first)
from traitlets.config.manager import BaseJSONConfigManager
from pathlib import Path

# Configure RISE settings
path = Path.home() / ".jupyter" / "nbconfig"
cm = BaseJSONConfigManager(config_dir=str(path))
cm.update('livereveal', {
    'scroll': True,  # Enable scrolling
    'width': 1024,
    'height': 768,
    'start_slideshow_at': 'beginning',
    'theme': 'white',  # Clean theme for teaching
    'transition': 'none',  # No distracting transitions
    'enable_chalkboard': True,  # For annotations during lecture
    'autolaunch': False
})

print("✅ RISE configured for teaching presentation")

In [None]:
import torch
import os

def get_optimal_device():
    """Get the best available device with proper fallback"""
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        # Check if MPS is actually built and functional
        try:
            # Test MPS with a small tensor
            test = torch.tensor([1.0]).to("mps")
            _ = test * 2
            return torch.device("mps")
        except:
            print("MPS available but not functional, using CPU")
            return torch.device("cpu")
    else:
        return torch.device("cpu")

# Set device globally
device = get_optimal_device()
print(f"Using device: {device}")

# os.environ['FORCE_CPU'] = '1'  # STUDENT: Add this line to force CPU
if os.environ.get('FORCE_CPU') == '1':
    device = torch.device("cpu")
    print("Forced to CPU mode for compatibility")

In [None]:
# Install required packages (run once)
%pip install -U "transformers>=4.44" datasets evaluate accelerate peft
%pip install -q torch scikit-learn matplotlib pandas
%pip install -q safetensors

import torch, transformers, peft, accelerate, numpy
from transformers import AutoTokenizer
print(torch.__version__)

import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import warnings
import torch
import torch.nn.functional as F
from collections import Counter, defaultdict
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, silhouette_score, davies_bouldin_score
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import random
from typing import Dict, List, Tuple
warnings.filterwarnings('ignore')

# Create directory structure
Path('data').mkdir(exist_ok=True)
Path('data/config').mkdir(parents=True, exist_ok=True)
Path('data/processed').mkdir(exist_ok=True)
Path('models').mkdir(exist_ok=True)
Path('models/gemma-cat-lora').mkdir(parents=True, exist_ok=True)

print("🐱 Welcome to CatShop ML Tutorial!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Using device: {device}")

Great! We've imported our tools. Notice we're using PEFT (Parameter-Efficient Fine-Tuning) - this is a modern technique that lets us fine-tune large models quickly by only training a small number of parameters. Think of it like teaching a skilled chef a new cuisine - we don't retrain everything they know about cooking, just add the new knowledge on top.

## Part 2: Loading and Understanding Our Data

In machine learning, data is everything. 

**The Challenge:** We have normal product descriptions. We need to teach the computer how a cat would see them!

In [None]:
# Load e-commerce products
with open('data/items_shuffle_1000.json', 'r') as f:
    items = json.load(f)

print(f"📦 We have {len(items)} products to work with")


### How Cats See the World 🐱

**Key Insight:** Same product, different perspective!
- Human sees: "Laptop computer"
- Cat sees: "Warm napping surface"

In [None]:
# Define the cat's worldview
CAT_CATEGORIES = {
    "NAP_SURFACE": "Things to sleep on",
    "HUNT_PLAY": "Things to chase",
    "TERRITORY": "Things to claim",
    "GROOMING": "Self-care items",
    "CONSUMPTION": "Food and water",
    "DANGER": "SCARY THINGS!",
    "IRRELEVANT": "Boring human stuff"
}

CATEGORY_TO_ID = {cat: i for i, cat in enumerate(CAT_CATEGORIES.keys())}
ID_TO_CATEGORY = {i: cat for cat, i in CATEGORY_TO_ID.items()}

# Use consistently throughout:
# - Use CAT_CATEGORIES.keys() when iterating over category names
# - Use CATEGORY_TO_ID[category] to get ID from category name
# - Use ID_TO_CATEGORY[id] to get category name from ID


Notice how we're transforming the problem. Instead of traditional e-commerce categories, we're creating a new taxonomy based on cat behavior. 

💡 **This is Feature Engineering:** Creating useful ways to represent data!


## Part 3: Data Transformation - Teaching Machines to Think Like Cats

Now comes the interesting part. We need to transform human product descriptions into cat perspectives. We'll use a comprehensive rule set that you've prepared using an LLM.


In [None]:
import json, os
from pathlib import Path
from collections import Counter

# 1) Load datasets
with open('data/items_shuffle_1000.json', 'r') as f:
    raw = json.load(f)
items_list = raw if isinstance(raw, list) else list(raw.values())

# 2) Load rules
with open('data/config/cat_mapping_rules.json', 'r') as f:
    keyword_rules = json.load(f)  # keyword -> cat label

# Optional: category mapping (top-level breadcrumb -> cat label)
top_map_path = Path('data/config/webshop_top_to_cat.json')
top_map = {}
if top_map_path.exists():
    with top_map_path.open('r') as f:
        top_map = json.load(f)


def top_level_from_breadcrumb(breadcrumb: str):
    if not isinstance(breadcrumb, str) or not breadcrumb.strip():
        return None
    for sep in ["›", ">", "/"]:
        if sep in breadcrumb:
            parts = [p.strip() for p in breadcrumb.split(sep)]
            for p in parts:
                if p:
                    return p
    return breadcrumb.strip()

def build_text(it):
    fields = [
        it.get("name",""),
        it.get("title",""),
        it.get("category",""),
        it.get("product_category",""),
        it.get("small_description_old",""),
        it.get("full_description",""),
    ]
    return " ".join(f for f in fields if isinstance(f, str)).lower()

def label_item(it):
    # 1) Category-based label (preferred if available)
    tl = top_level_from_breadcrumb(it.get("product_category", ""))
    if tl and tl in top_map:
        return top_map[tl]

    # 2) Keyword-based label (longest match wins)
    text = build_text(it)
    cat, max_len = "IRRELEVANT", 0
    for kw, label in keyword_rules.items():
        k = kw.lower()
        if k and k in text and len(k) > max_len:
            cat, max_len = label, len(k)
    return cat

# 3) Transform all products
cat_products = []
for it in items_list:
    name = it.get("name") or it.get("title") or "Unknown"
    cat = label_item(it)
    cat_products.append({
        "name": name[:200],
        "cat_category": cat,
        "cat_category_id": CATEGORY_TO_ID[cat]
    })

# 4) Analyze + save
dist = Counter([p['cat_category'] for p in cat_products])
print("📊 Category distribution:")
for cat, count in dist.most_common():
    pct = count / len(cat_products) * 100
    print(f"  {cat:12s}: {count:4d} ({pct:5.1f}%)")

Path('data/processed').mkdir(parents=True, exist_ok=True)
with open('data/processed/cat_products.json', 'w') as f:
    json.dump(cat_products, f, indent=2)
print(f"✅ Saved {len(cat_products)} transformed products -> data/processed/cat_products.json")


This rule-based labeling is our starting point. In industry, this is called **weak supervision** - using heuristics to create initial labels that we can refine with ML. The comprehensive rules ensure better coverage across all product categories.



### Quick diagnostics to validate robustness

Where did each label come from (category vs keywords)?

In [None]:
from collections import Counter
import json, re
from pathlib import Path

items = json.loads(Path("data/items_shuffle_1000.json").read_text())
items = items if isinstance(items, list) else list(items.values())
rules = json.loads(Path("data/config/cat_mapping_rules.json").read_text())
top_map = json.loads(Path("data/config/webshop_top_to_cat.json").read_text())

def top_level(bc):
    if not isinstance(bc, str): return None
    for sep in ["›", ">", "/"]:
        if sep in bc:
            for p in [s.strip() for s in bc.split(sep)]:
                if p: return p
    return bc.strip() or None

def build_text(it):
    fields = [it.get("name",""), it.get("title",""), it.get("category",""), it.get("product_category",""), it.get("small_description_old",""), it.get("full_description","")]
    return " ".join([f for f in fields if isinstance(f, str)]).lower()

src_counter = Counter()
label_counter = Counter()
examples = {k: [] for k in ["category","keyword"]}

for it in items:
    tl = top_level(it.get("product_category",""))
    if tl in top_map:
        label = top_map[tl]
        src = "category"
    else:
        text = build_text(it)
        label, max_len = "IRRELEVANT", 0
        match_kw = None
        for kw, lab in rules.items():
            k = kw.lower()
            if k and k in text and len(k) > max_len:
                label, max_len, match_kw = lab, len(k), kw
        src = "keyword"
    src_counter[src] += 1
    label_counter[label] += 1
    if len(examples[src]) < 3:
        examples[src].append((it.get("name") or it.get("title") or "Unknown", tl if src=="category" else match_kw, label))

print("By source:", src_counter)
print("By label:", label_counter)
print("Examples (category):", examples["category"])
print("Examples (keyword):", examples["keyword"])

## Part 3.5: Preparing Rich Training Data

To train our model effectively, we need diverse training examples. Let's load the pre-generated conversational and explanation data.


### Generate and load training data (Gemma-3-270m)

We generate two datasets for instruction-style finetuning 'Lecture 1 Overview/data/generate_training_data.py':
- `conversation_examples.json`: short cat-thought responses given product names.
- `explanation_examples.json`: brief rationales for the assigned cat category.

If the data already exists, we just load and summarize it.

In [None]:
from pathlib import Path
import json
from collections import Counter

processed_dir = Path("data/processed")  # <-- corrected

with open(processed_dir / "conversation_examples.json") as f:
    conversation_examples = json.load(f)

with open(processed_dir / "explanation_examples.json") as f:
    explanation_examples = json.load(f)

print("✅ Loaded datasets")
print(f"- Conversations: {len(conversation_examples)}")
print(f"- Explanations: {len(explanation_examples)}")

In [None]:
from pathlib import Path
import json
from collections import Counter

# Adjust this if your working directory is repo root:
# processed_dir = Path("Lecture 1 Overview/data/processed")
processed_dir = Path("data/processed")

with open(processed_dir / "conversation_examples.json") as f:
    conversation_examples = json.load(f)

with open(processed_dir / "explanation_examples.json") as f:
    explanation_examples = json.load(f)

print("✅ Loaded datasets")
print(f"- Conversations: {len(conversation_examples)}")
print(f"- Explanations: {len(explanation_examples)}")

def show_examples(examples, n=3, title="Examples"):
    print(f"\n--- {title} (showing {n}) ---")
    for ex in examples[:n]:
        mode = 'conversation' if 'conversation' in ex else ('explanation' if 'explanation' in ex else None)
        prompt = ex.get(mode, {}).get('prompt', '') if mode else ''
        completion = ex.get(mode, {}).get('completion', '') if mode else ''
        print(f"- Product: {ex.get('product_name','')}")
        print(f"  Category: {ex.get('category','')}")
        print(f"  Prompt: {prompt[:140]}...")
        print(f"  Completion: {completion[:200]}...\n")

show_examples(conversation_examples, n=3, title="Conversation examples")
show_examples(explanation_examples, n=3, title="Explanation examples")

Having diverse training data - classification, conversation, and explanation - helps the model learn multiple aspects of the task while maintaining its general capabilities.



## Part 4: Building Our Gemma-3 Cat Classifier

Now we'll load Gemma-3, a powerful but efficient language model from Google. We'll use it as our base model and teach it to think like a cat.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, TaskType

class GemmaCatClassifier:
    """
    A classifier that uses Gemma-3 to categorize products from a cat's perspective.
    Uses PEFT (LoRA) for efficient fine-tuning.
    """
    
    def __init__(self, model_name="google/gemma-3-270m", use_lora=True, checkpoint_path=None):
        print(f"🔧 Initializing {model_name}...")
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Consistent device handling
        self.device = device  # Use global device
        
        # Load model
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            attn_implementation="eager", #For Gemma-3, prefer eager attention
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device_map="auto" if torch.cuda.is_available() else None,
        )
        
        # Move to device after loading
        self.model = self.model.to(self.device)
        
        # Check available modules for LoRA
        if use_lora and checkpoint_path is None:
            print("🔍 Checking available modules for LoRA...")
            available_modules = [n for n, _ in self.model.named_modules() 
                               if any(key in n for key in ['proj', 'gate', 'fc'])]
            print(f"  Found modules: {available_modules[:5]}...")  # Show first 5
            
            # Apply LoRA for efficient fine-tuning
            print("📎 Applying LoRA for efficient fine-tuning...")
            peft_config = LoraConfig(
                task_type=TaskType.CAUSAL_LM,
                inference_mode=False,
                r=8,  # rank
                lora_alpha=32,
                lora_dropout=0.1,
                target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]  # Gemma modules
            )
            self.model = get_peft_model(self.model, peft_config)
            trainable, total = self.model.get_nb_trainable_parameters()
            pct = 100 * trainable / total
            print(f"Trainable params: {trainable:,} || all params: {total:,} || trainable%: {pct:.4f}")
        
        # Load from checkpoint if provided
        if checkpoint_path:
            print(f"📂 Loading checkpoint from {checkpoint_path}")
            from peft import PeftModel
            self.model = PeftModel.from_pretrained(self.model, checkpoint_path)
        
        self.device = next(self.model.parameters()).device
        
        # Define category tokens for classification
        self.cat_tokens = {
            'NAP_SURFACE': 'nap', 'HUNT_PLAY': 'hunt',
            'TERRITORY': 'territory', 'DANGER': 'danger',
            'CONSUMPTION': 'food', 'GROOMING': 'groom',
            'IRRELEVANT': 'boring'
        }
        
        # Get token IDs
        self.category_token_ids = {}
        for cat, token in self.cat_tokens.items():
            token_ids = self.tokenizer.encode(token, add_special_tokens=False)
            self.category_token_ids[cat] = token_ids[0]
    
    def classify(self, product_name, return_probs=False):
        """
        Classify a product using language model probabilities.
        This is the key insight: we use the model's next-token predictions!
        """

        
        device = next(self.model.parameters()).device
        prompt = f"Question: How would a cat categorize '{product_name}'?\nAnswer: This is" # Create prompt
        inputs = self.tokenizer(prompt, return_tensors="pt").to(device) # Tokenize
        self.model.eval()
        
        
        # Get predictions
        with torch.no_grad():
            outputs = self.model(**inputs)
            next_token_logits = outputs.logits[0, -1, :]
            
            # Extract logits for our category tokens
            category_logits = []
            for cat in CAT_CATEGORIES.keys():
                token_id = self.category_token_ids[cat]
                category_logits.append(next_token_logits[token_id])
            
            # Convert to probabilities
            probs = F.softmax(torch.stack(category_logits), dim=0)
        
        # Get prediction
        pred_idx = torch.argmax(probs).item()
        pred_category = list(CAT_CATEGORIES.keys())[pred_idx]
        
        if return_probs:
            prob_dict = {cat: p.item() for cat, p in zip(CAT_CATEGORIES.keys(), probs)}
            return pred_category, prob_dict
        return pred_category
    
    def get_uncertainty(self, product_name):
        """Calculate uncertainty using entropy of probability distribution"""
        _, probs = self.classify(product_name, return_probs=True)
        # Calculate entropy
        entropy = -sum(p * np.log(p + 1e-10) for p in probs.values() if p > 0)
        return entropy
    
    def get_embeddings(self, texts, batch_size=8):
        """Extract embeddings for unsupervised learning analysis"""
        embeddings = []

        # Check if we're on MPS and using PEFT - if so, temporarily use CPU
        is_mps = str(next(self.model.parameters()).device).startswith('mps')
        is_peft = hasattr(self.model, 'peft_config')

        if is_mps and is_peft:
            # Temporarily move to CPU for embedding extraction
            print("📍 Note: Using CPU for embedding extraction (MPS+PEFT compatibility)")
            original_device = next(self.model.parameters()).device
            self.model = self.model.to('cpu')
            compute_device = 'cpu'
        else:
            compute_device = self.device
            original_device = None

        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            inputs = self.tokenizer(batch, return_tensors="pt", 
                                   padding=True, truncation=True, max_length=256).to(compute_device)

            self.model.eval()
            with torch.no_grad():
                outputs = self.model(**inputs, output_hidden_states=True)
                # Use mean pooling of last hidden state
                hidden = outputs.hidden_states[-1]
                mask = inputs.attention_mask.unsqueeze(-1)
                masked_hidden = hidden * mask
                summed = masked_hidden.sum(dim=1)
                counts = mask.sum(dim=1)
                mean_pooled = summed / counts
                embeddings.extend(mean_pooled.cpu().numpy())

        # Move model back to original device if we switched
        if original_device is not None:
            self.model = self.model.to(original_device)

        return np.array(embeddings)

# Initialize our classifier
classifier = GemmaCatClassifier()


In [None]:
import json
from sklearn.model_selection import train_test_split

with open("data/processed/cat_products.json") as f:
    cat_products = json.load(f)

train_products, val_products = train_test_split(
    cat_products,
    test_size=0.2,
    random_state=42,
    stratify=[p['cat_category_id'] for p in cat_products]
)

print(f"Validation size: {len(val_products)}")

In [None]:
def evaluate_model_capabilities(classifier, test_products_sample):
    """Evaluate both classification and conversation abilities"""
    
    # Classification accuracy on validation set
    correct = 0
    for product in test_products_sample[:50]:
        pred = classifier.classify(product['name'])
        if pred == product['cat_category']:
            correct += 1
    accuracy = correct / 50

    # Test on specific example products
    test_products = ["laptop computer", "cardboard box", "cat toy", "vacuum cleaner"]
    test_results = []
    for product in test_products:
        pred, probs = classifier.classify(product, return_probs=True)
        confidence = probs[pred]
        test_results.append({
            'product': product,
            'prediction': pred,
            'confidence': confidence,
            'probs': probs
        })

    # Better prompts for conversation
    test_prompts = [
        "You are a cat expert. Question: Why do cats love boxes? Answer:",
        "You are a cat behavior specialist. Question: My cat keeps knocking things off the table. What's she thinking? Answer:",
        "From a cat's perspective, explain why a laptop is seen as a napping surface:"
    ]

    responses = []
    classifier.model.eval()
    
    for prompt in test_prompts:
        inputs = classifier.tokenizer(
            prompt, 
            return_tensors="pt", 
            truncation=True, 
            max_length=256
        ).to(classifier.device)
        
        with torch.no_grad():
            outputs = classifier.model.generate(
                **inputs,
                max_new_tokens=50,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                repetition_penalty=1.2,
                pad_token_id=classifier.tokenizer.pad_token_id,
                eos_token_id=classifier.tokenizer.eos_token_id
            )
        
        # Decode only newly generated tokens
        gen_ids = outputs[0, inputs["input_ids"].shape[-1]:]
        response = classifier.tokenizer.decode(gen_ids, skip_special_tokens=True)
        
        # Clean up response
        response = response.replace("Human:", "").replace("Assistant:", "").strip()
        if "\n" in response:
            response = response.split("\n")[0]
        
        responses.append(response)
    
    return accuracy, test_results, responses

def display_model_evaluation(accuracy, test_results, responses):
    """Display evaluation results in a clear, comprehensive format"""
    
    # Overall accuracy
    print(f"\n📈 Validation Accuracy: {accuracy:.1%}")
    
    # Test products classification
    print("\n🧪 Test Products Classification:")
    print("-" * 50)
    
    for result in test_results:
        product = result['product']
        pred = result['prediction']
        confidence = result['confidence']
        
        # Create confidence bar
        bar_length = int(confidence * 20)
        confidence_bar = '█' * bar_length + '░' * (20 - bar_length)
        
        print(f"  '{product:20s}' → {pred:12s} [{confidence_bar}] {confidence:.1%}")
    
    # Show confidence distribution for interesting cases
    print("\n📊 Confidence Distribution (for 'laptop computer'):")
    if test_results:
        laptop_result = next((r for r in test_results if r['product'] == 'laptop computer'), None)
        if laptop_result and 'probs' in laptop_result:
            probs = laptop_result['probs']
            # Sort by probability
            sorted_cats = sorted(probs.items(), key=lambda x: x[1], reverse=True)[:3]
            for cat, prob in sorted_cats:
                bar_length = int(prob * 10)
                bar = '▸' * bar_length + '·' * (10 - bar_length)
                print(f"    {cat:12s}: [{bar}] {prob:.1%}")
    
    # Conversation quality
    print("\n💬 Conversation Quality Samples:")
    print("-" * 50)
    
    questions = [
        "Why do cats love boxes?",
        "Why do cats knock things off tables?",
        "Why is a laptop a napping surface?"
    ]
    
    for i, (q, r) in enumerate(zip(questions, responses)):
        print(f"\n  Q{i+1}: {q}")
        # Truncate and clean response
        clean_response = r[:100] if len(r) > 100 else r
        if len(r) > 100:
            clean_response = clean_response.rsplit(' ', 1)[0] + "..."
        print(f"  🐱: {clean_response}")

In [None]:
# Baseline: untrained base model (no LoRA)
baseline_classifier = GemmaCatClassifier(use_lora=False)

# Ensure tokenizer pad and device consistency
if baseline_classifier.tokenizer.pad_token is None:
    baseline_classifier.tokenizer.pad_token = baseline_classifier.tokenizer.eos_token
baseline_classifier.model.to(baseline_classifier.device)

print("\n📊 Untrained baseline (no LoRA)")
print("-"*40)
acc0, testres0, resp0 = evaluate_model_capabilities(baseline_classifier, val_products)
display_model_evaluation(acc0, testres0, resp0)

Notice how we're using the language model for classification? Instead of adding a classification head, we're checking which category token the model thinks is most likely to come next. This is more flexible and maintains the model's conversational abilities!


## Part 5: Supervised Learning - Fine-tuning with LoRA

Time for our first paradigm: **Supervised Learning**. We'll fine-tune Gemma-3 using our labeled data. Thanks to LoRA, this will be fast and memory-efficient.


In [None]:
from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments
import torch.nn as nn

class CatProductDataset(Dataset):
    """Dataset for training our cat classifier"""
    
    def __init__(self, products, conversation_examples, explanation_examples, tokenizer, max_length=128):
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.examples = []
        
        # Add classification examples
        for product in products:
            self.examples.append({
                'input': f"Question: How would a cat categorize '{product['name']}'?\nAnswer: This is",
                'output': f" {classifier.cat_tokens[product['cat_category']]}",
                'category_id': product['cat_category_id']
            })
            
        # Add conversation examples
        for conv in conversation_examples[:len(products)//3]:  # Add 1/3 as many
            block = conv.get('conversation', conv)  # support nested or flat
            prompt = block.get('prompt')
            completion = block.get('completion')
            if not prompt or not completion:
                continue
            self.examples.append({
                'input': prompt,
                'output': completion,
                'category_id': CATEGORY_TO_ID.get(conv.get('category'), 6)
            })

        # Add explanation examples
        for expl in explanation_examples[:len(products)//3]:  # Add 1/3 as many
            block = expl.get('explanation', expl)  # support nested or flat
            prompt = block.get('prompt')
            completion = block.get('completion')
            if not prompt or not completion:
                continue
            self.examples.append({
                'input': prompt,
                'output': completion,
                'category_id': CATEGORY_TO_ID.get(expl.get('category'), 6)
            })


        
        print(f"  Created dataset with {len(self.examples)} total examples")
        print(f"    - Classification: {len(products)}")
        print(f"    - Conversations: {min(len(conversation_examples), len(products)//3)}")
        print(f"    - Explanations: {min(len(explanation_examples), len(products)//3)}")
    
    def __len__(self):
        return len(self.examples)
    
    def __getitem__(self, idx):

        example = self.examples[idx]
        full_text = example['input'] + example['output']
        
        # Tokenize
        encoding = self.tokenizer(
            full_text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )
        
        # Create labels (mask the input part for loss calculation)
        labels = encoding['input_ids'].clone()
        input_length = len(self.tokenizer.encode(
            example['input'],
            truncation=True,
            max_length=self.max_length,
            add_special_tokens=False  # avoid extra BOS/EOS affecting mask
        ))
        labels[0, :input_length] = -100  # Don't compute loss on input
        
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': labels.squeeze()
        }


In [None]:

def train_supervised_model(classifier, cat_products, conversation_examples, explanation_examples, epochs=2):
    """
    Fine-tune the model using supervised learning.
    This is where the magic happens!
    """
    print("\n🎯 SUPERVISED LEARNING: Fine-tuning Gemma-3")
    print("="*60)
    
    # Split data
    train_products, val_products = train_test_split(
        cat_products, test_size=0.2, random_state=42,
        stratify=[p['cat_category_id'] for p in cat_products]
    )
    
    print(f"📊 Dataset split:")
    print(f"  Training: {len(train_products)} products")
    print(f"  Validation: {len(val_products)} products")
    
    # Create datasets with mixed examples
    train_dataset = CatProductDataset(
        train_products, conversation_examples, explanation_examples, classifier.tokenizer
    )
    val_dataset = CatProductDataset(
        val_products, [], [], classifier.tokenizer  # Val only needs classification
    )
    
    # Training arguments
    
    training_args = TrainingArguments(
        output_dir="./models/gemma-cat-lora",
#         use_cpu=True,     # forces Trainer to keep model on CPU
#         no_cuda=True,     # (redundant with use_cpu=True but safe)
#         use_mps_device=False, #stable on Mac
        num_train_epochs=epochs,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        gradient_accumulation_steps=2,
        warmup_steps=50,              # Reduced from 100
        learning_rate=1e-5,           # KEY CHANGE: Was 3e-4 (6x lower!)
        fp16=False,
        logging_steps=50,
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
        greater_is_better=False,
        report_to=None
    )

    # Create trainer
    trainer = Trainer(
        model=classifier.model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=classifier.tokenizer,
    )
    
    # Train!
    print("\n🚀 Starting training...")
    train_result = trainer.train()
    
    # Save the model
    trainer.save_model("./models/gemma-cat-lora")
    print("✅ Model saved to ./models/gemma-cat-lora")
    
    # Evaluate
    print("\n📊 Evaluating on validation set...")
    eval_result = trainer.evaluate()
    print(f"  Validation loss: {eval_result['eval_loss']:.4f}")
    
    # Test accuracy
    correct = 0
    for product in val_products[:50]:
        pred = classifier.classify(product['name'])
        if pred == product['cat_category']:
            correct += 1
    
    accuracy = correct / 50
    print(f"  Classification accuracy: {accuracy:.2%}")
    
    return trainer, train_result, eval_result, val_products



In [None]:
# Train the model
trainer, train_result, eval_result, val_products = train_supervised_model(
    classifier, cat_products, conversation_examples, explanation_examples, epochs=2
)

#### Why we validate on classification-only (Optional Note)

- __Align eval with the target task__: The end goal is to classify products. By building the validation set in `train_supervised_model()` using `CatProductDataset(val_products, [], [], tokenizer)`, the `eval_loss` (and your post-hoc accuracy) measure what you actually care about.

- __Simplicity and speed__: A leaner `val_dataset` lowers eval time and keeps model selection straightforward (select by `eval_loss` on the classification-style sequences).

- __Clear interpretability__: When you later print top-line accuracy using `classifier.classify()`, it matches the validation distribution—no mismatch between how you evaluate during training and how you report after.

In [None]:
# Stage 1: Initial Performance (2 epochs)
print("\n📊 Stage 1: After 2 epochs")
print("-"*40)
acc1, testres1, resp1 = evaluate_model_capabilities(classifier, val_products)
display_model_evaluation(acc1, testres1, resp1)

from copy import deepcopy
classifier_stage1 = deepcopy(classifier)

Look at the improvement! The model has learned to categorize products like a cat. This is supervised learning: we provided labeled examples, and the model learned the patterns. The mixed training data helps maintain conversational abilities while learning classification.


In [None]:
# Stage 2: Extended Training (2 more epochs)
print("\n📊 Stage 2: Training 2 more epochs...")
print("-"*40)

# Update the total epochs to 4 (original 2 + additional 2)
trainer.args.num_train_epochs = 4  # This is the key line!

# Now continue training - it will train from epoch 2 to epoch 4
trainer.train(resume_from_checkpoint=True)
trainer.save_model("./models/gemma-cat-lora-stage2")


In [None]:
acc2, testres2, resp2 = evaluate_model_capabilities(classifier, val_products)
display_model_evaluation(acc2, testres2, resp2)

classifier_stage2 = deepcopy(classifier)

In [None]:
# Stage 3: Extended Training (2 more epochs)
print("\n📊 Stage 3: Training 2 more epochs...")
print("-"*40)

# Update the total epochs to 6 (original 4 + additional 2)
trainer.args.num_train_epochs = 6  # This is the key line!

# Now continue training - it will train from epoch 4 to epoch 6
trainer.train(resume_from_checkpoint=True)
trainer.save_model("./models/gemma-cat-lora-stage3")

acc3, testres3, resp3 = evaluate_model_capabilities(classifier, val_products)
display_model_evaluation(acc3, testres3, resp3)

## 🤔 Let's Analyze Our Results

### What do you notice about the progression?

#### 💭 Discussion Questions:
1. Why might the model be getting **better** at classification but **worse** at conversation?
2. What happened to the model's original language abilities?
3. Is more training always better?

## Catastrophic Forgetting: A Core ML Challenge

### What's Happening?

We're witnessing **catastrophic forgetting** - when a neural network forgets previously learned information while learning new tasks.

### Why Does This Happen?

1. **Imbalanced Training Data**
   - 800 classification examples
   - Only 266 conversation examples  
   - Classification dominates the learning signal!

2. **Single-Task Validation**
   - Our validation set only tests classification
   - Model optimizes for what we measure
   - Conversation quality isn't being tracked

3. **Model Capacity Limits**
   - Gemma-270M is tiny (only 270 million parameters)
   - Must choose: Be good at classification OR conversation
   - Not enough "brain space" for both!

4. **Training Dynamics**
   - Later epochs overwrite earlier learning
   - High learning rate (3e-4) causes aggressive updates
   - No mechanism to preserve original capabilities

### Strategies to Experiment With:

#### 1. **Data Balance** (Easiest)
```python
# Instead of 800:266:242, try 1:1:1 ratio
train_dataset = CatProductDataset(
    products, conversations*3, explanations*3,  # Triple conversation data
    mix_ratio=(1, 1, 1)
)
```

#### 2. **Progressive Learning Rates** (Recommended)
```python
# Start high, go lower
Stage 1: learning_rate=3e-4  # Learn new task
Stage 2: learning_rate=1e-4  # Refine
Stage 3: learning_rate=5e-5  # Polish (preserve knowledge)
```

#### 3. **Mixed Validation** 
```python
# Include ALL capabilities in validation
val_dataset = CatProductDataset(
    val_products, val_conversations, val_explanations
)
```

#### 4. **Architectural Solutions**
- **LoRA Rank**: Try `r=16` or `r=32` (more capacity)


### Challenge Questions:
1. Can you achieve 60% accuracy WITHOUT losing conversation ability?
2. What's the minimum model size needed for both tasks?
3. How would you design a curriculum to teach both skills?

## Part 6: Unsupervised Learning - Discovering Natural Structure

Now let's explore **Unsupervised Learning**. We'll see how products naturally cluster without using any labels, and how fine-tuning changes this structure.


In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

def true_unsupervised_analysis(embeddings, name="Dataset", k_min=2, k_max=14):
    """
    Perform truly unsupervised clustering analysis.
    Find optimal number of clusters without using ground truth labels.
    """
    print(f"\n🔍 Analyzing {name} (Truly Unsupervised)")
    print("-" * 40)
    
    # Try different numbers of clusters
    silhouette_scores = []
    davies_bouldin_scores = []
    inertias = []
    k_range = range(k_min, k_max + 1)
    
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        clusters = kmeans.fit_predict(embeddings)
        
        # Silhouette score (higher is better)
        sil_score = silhouette_score(embeddings, clusters)
        silhouette_scores.append(sil_score)
        
        # Davies-Bouldin score (lower is better)
        db_score = davies_bouldin_score(embeddings, clusters)
        davies_bouldin_scores.append(db_score)
        
        # Inertia (for elbow method)
        inertias.append(kmeans.inertia_)
    
    # Find optimal k using silhouette score
    optimal_k = silhouette_scores.index(max(silhouette_scores)) + k_min
    
    print(f"  Optimal clusters (by Silhouette): {optimal_k}")
    print(f"  Best Silhouette score: {max(silhouette_scores):.3f}")
    
    return optimal_k, silhouette_scores, davies_bouldin_scores, inertias

In [None]:
def compare_embeddings_unsupervised(base_classifier, trained_classifier, products, k_vis=None):
    """
    Compare clustering behavior before and after fine-tuning.
    If k_vis is provided, use that k for the visualization clustering
    instead of silhouette-optimal k.
    """
    print("\n🔮 UNSUPERVISED LEARNING: Natural Clustering Analysis")
    print("="*60)
    
    # Select subset of products
    sample_products = products[:200]
    texts = [p['name'] for p in sample_products]
    
    # Get embeddings from both models
    print("📍 Extracting embeddings from base model...")
    base_embeddings = base_classifier.get_embeddings(texts)
    
    print("📍 Extracting embeddings from fine-tuned model...")
    trained_embeddings = trained_classifier.get_embeddings(texts)
    
    # Analyze both without using labels
    optimal_k_base, sil_base, db_base, inertia_base = true_unsupervised_analysis(
        base_embeddings, "Base Model"
    )
    optimal_k_trained, sil_trained, db_trained, inertia_trained = true_unsupervised_analysis(
        trained_embeddings, "Fine-tuned Model"
    )
    
    # Visualize the metrics
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    k_range = range(2, 15)
    
    # Silhouette scores
    axes[0, 0].plot(k_range, sil_base, 'b-o', label='Base Model', linewidth=2)
    axes[0, 0].plot(k_range, sil_trained, 'r-s', label='Fine-tuned', linewidth=2)
    axes[0, 0].set_xlabel('Number of Clusters')
    axes[0, 0].set_ylabel('Silhouette Score')
    axes[0, 0].set_title('Silhouette Analysis (Higher is Better)')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].axvline(x=optimal_k_base, color='b', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=optimal_k_trained, color='r', linestyle='--', alpha=0.5)
    
    # Davies-Bouldin scores
    axes[0, 1].plot(k_range, db_base, 'b-o', label='Base Model', linewidth=2)
    axes[0, 1].plot(k_range, db_trained, 'r-s', label='Fine-tuned', linewidth=2)
    axes[0, 1].set_xlabel('Number of Clusters')
    axes[0, 1].set_ylabel('Davies-Bouldin Score')
    axes[0, 1].set_title('Davies-Bouldin Analysis (Lower is Better)')
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Elbow method
    axes[0, 2].plot(k_range, inertia_base, 'b-o', label='Base Model', linewidth=2)
    axes[0, 2].plot(k_range, inertia_trained, 'r-s', label='Fine-tuned', linewidth=2)
    axes[0, 2].set_xlabel('Number of Clusters')
    axes[0, 2].set_ylabel('Inertia')
    axes[0, 2].set_title('Elbow Method')
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)
    
    # Choose k to cluster for visualization
    k_vis_base = k_vis if k_vis is not None else optimal_k_base
    k_vis_trained = k_vis if k_vis is not None else optimal_k_trained
    
    # Cluster with chosen k and visualize
    kmeans_base = KMeans(n_clusters=k_vis_base, random_state=42, n_init=10)
    kmeans_trained = KMeans(n_clusters=k_vis_trained, random_state=42, n_init=10)
    
    clusters_base = kmeans_base.fit_predict(base_embeddings)
    clusters_trained = kmeans_trained.fit_predict(trained_embeddings)
    
    # PCA for visualization
    pca = PCA(n_components=2, random_state=42)
    vis_base = pca.fit_transform(base_embeddings)
    vis_trained = pca.fit_transform(trained_embeddings)
    
    # Plot base model clusters
    scatter1 = axes[1, 0].scatter(vis_base[:, 0], vis_base[:, 1], 
                                  c=clusters_base, cmap='viridis', alpha=0.6)
    axes[1, 0].set_title(f'Base Model\n({k_vis_base} clusters)', fontweight='bold')
    axes[1, 0].set_xlabel('PCA Component 1')
    axes[1, 0].set_ylabel('PCA Component 2')
    plt.colorbar(scatter1, ax=axes[1, 0])
    
    # Plot trained model clusters
    scatter2 = axes[1, 1].scatter(vis_trained[:, 0], vis_trained[:, 1],
                                  c=clusters_trained, cmap='viridis', alpha=0.6)
    axes[1, 1].set_title(f'Fine-tuned Model\n({k_vis_trained} clusters)', fontweight='bold')
    axes[1, 1].set_xlabel('PCA Component 1')
    axes[1, 1].set_ylabel('PCA Component 2')
    plt.colorbar(scatter2, ax=axes[1, 1])
    
    # Compare cluster characteristics
    axes[1, 2].axis('off')
    comparison_text = f"""
    Clustering Comparison:
    
    Base Model:
    - Optimal clusters: {optimal_k_base}
    - Best Silhouette: {max(sil_base):.3f}
    - Natural grouping based on
      general language patterns
    
    Fine-tuned Model:
    - Optimal clusters: {optimal_k_trained}
    - Best Silhouette: {max(sil_trained):.3f}
    - Grouping influenced by
      cat-perspective training
    
    Key Insight:
    Fine-tuning reorganizes the
    embedding space to reflect
    the task-specific structure!
    """
    axes[1, 2].text(0.1, 0.5, comparison_text, fontsize=10, verticalalignment='center')
    
    plt.suptitle('Unsupervised Learning: How Fine-tuning Changes Natural Clustering', 
                 fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Analyze what's in each cluster (without using labels)
    print("\n📊 Cluster Sample Analysis (Fine-tuned Model):")
    for i in range(min(5, k_vis_trained)):  # Show first 5 clusters
        cluster_indices = np.where(clusters_trained == i)[0][:3]  # First 3 items
        print(f"\n  Cluster {i} samples:")
        for idx in cluster_indices:
            print(f"    - {texts[idx][:50]}...")
    
    return optimal_k_base, optimal_k_trained

In [None]:
def analyze_cluster_purity(embeddings, clusters, true_labels, label_names, model_name="Model", k=7):
    """Analyze how pure clusters are with respect to true labels"""
    from collections import Counter
    
    print(f"\n📊 {model_name} Cluster Analysis (k={k})")
    print("-" * 50)
    
    cluster_purities = []
    
    for cluster_id in range(k):
        # Get items in this cluster
        cluster_mask = clusters == cluster_id
        cluster_labels = [true_labels[i] for i, m in enumerate(cluster_mask) if m]
        
        if not cluster_labels:
            continue
            
        # Find most common label and its percentage
        label_counts = Counter(cluster_labels)
        most_common_label, count = label_counts.most_common(1)[0]
        purity = count / len(cluster_labels)
        cluster_purities.append(purity)
        
        # Show cluster composition (top 3 categories)
        print(f"\nCluster {cluster_id} ({len(cluster_labels)} items, {purity:.1%} pure):")
        for label, cnt in label_counts.most_common(3):
            pct = cnt / len(cluster_labels) * 100
            bar = '█' * int(pct / 10) + '░' * (10 - int(pct / 10))
            name = label_names[label] if isinstance(label_names, dict) else str(label)
            print(f"  {name:12s} [{bar}] {pct:.0f}%")
    
    avg_purity = sum(cluster_purities) / len(cluster_purities) if cluster_purities else 0.0
    print(f"\n➤ Average Purity: {avg_purity:.1%}")
    return avg_purity

In [None]:
def compare_clustering_perspectives(base_embeddings, trained_embeddings, products, k=4):
    """Show how base model clusters by human logic vs fine-tuned by cat logic"""
    from sklearn.cluster import KMeans
    import numpy as np
    
    print("\n🔍 CLUSTERING PERSPECTIVE COMPARISON")
    print("="*60)
    
    # True labels
    true_cat_labels = [p['cat_category_id'] for p in products]
    
    # Correct ID -> name mapping
    # Prefer ID_TO_CATEGORY if defined; otherwise derive from CATEGORY_TO_ID
    try:
        label_names = ID_TO_CATEGORY
    except NameError:
        # Fallback if ID_TO_CATEGORY not defined
        label_names = {v: k for k, v in CATEGORY_TO_ID.items()}
    
    # Cluster both with k
    kmeans_base = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_trained = KMeans(n_clusters=k, random_state=42, n_init=10)
    
    clusters_base = kmeans_base.fit_predict(base_embeddings)
    clusters_trained = kmeans_trained.fit_predict(trained_embeddings)
    
    # Analyze purity
    purity_base = analyze_cluster_purity(
        base_embeddings, clusters_base, true_cat_labels, 
        label_names, "Base Model", k=k
    )
    
    purity_trained = analyze_cluster_purity(
        trained_embeddings, clusters_trained, true_cat_labels,
        label_names, "Fine-tuned Model", k=k
    )
    
    # Show specific examples to illustrate the difference
    print("\n🎯 EXAMPLE: What's in Cluster 0?")
    print("-" * 50)
    
    # Get sample products from cluster 0 for both models
    base_cluster0_idx = np.where(clusters_base == 0)[0][:5]
    trained_cluster0_idx = np.where(clusters_trained == 0)[0][:5]
    
    print("Base Model Cluster 0:")
    for idx in base_cluster0_idx:
        print(f"  • {products[idx]['name'][:40]:40s} [{products[idx]['cat_category']}]")
    
    print("\nFine-tuned Cluster 0:")
    for idx in trained_cluster0_idx:
        print(f"  • {products[idx]['name'][:40]:40s} [{products[idx]['cat_category']}]")
    
    
    return purity_base, purity_trained

In [None]:
def visualize_cluster_composition(base_embeddings, trained_embeddings, products, k=4):
    """Create a heatmap showing how categories distribute across clusters"""
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    import numpy as np
    
    # Cluster with k
    kmeans_base = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans_trained = KMeans(n_clusters=k, random_state=42, n_init=10)
    
    clusters_base = kmeans_base.fit_predict(base_embeddings)
    clusters_trained = kmeans_trained.fit_predict(trained_embeddings)
    
    # Correct label names
    try:
        label_names = ID_TO_CATEGORY
    except NameError:
        label_names = {v: k for k, v in CATEGORY_TO_ID.items()}
    
    # Create composition matrices
    def get_composition_matrix(clusters, true_labels, k):
        matrix = np.zeros((k, k), dtype=float)  # k clusters x k categories
        for cluster_id in range(k):
            cluster_mask = clusters == cluster_id
            cluster_items = [true_labels[i] for i, m in enumerate(cluster_mask) if m]
            for cat_id in cluster_items:
                if 0 <= cat_id < k:
                    matrix[cluster_id, cat_id] += 1.0
        # Normalize by row (each cluster sums to 1)
        row_sums = matrix.sum(axis=1, keepdims=True)
        row_sums[row_sums == 0] = 1.0  # avoid divide-by-zero
        return matrix / row_sums
    
    true_labels = [p['cat_category_id'] for p in products]
    base_matrix = get_composition_matrix(clusters_base, true_labels, k)
    trained_matrix = get_composition_matrix(clusters_trained, true_labels, k)
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Base model heatmap
    im1 = axes[0].imshow(base_matrix, cmap='YlOrRd', aspect='auto', vmin=0, vmax=0.7)
    axes[0].set_title('Base Model\n(Products cluster by type)', fontweight='bold')
    axes[0].set_xlabel('Cat Categories')
    axes[0].set_ylabel('Cluster ID')
    axes[0].set_xticks(range(k))
    axes[0].set_xticklabels([label_names[i] for i in range(k)], rotation=45, ha='right')
    axes[0].set_yticks(range(k))
    
    # Fine-tuned model heatmap
    im2 = axes[1].imshow(trained_matrix, cmap='YlOrRd', aspect='auto', vmin=0, vmax=0.7)
    axes[1].set_title('Fine-tuned Model\n(Products cluster by cat behavior)', fontweight='bold')
    axes[1].set_xlabel('Cat Categories')
    axes[1].set_ylabel('Cluster ID')
    axes[1].set_xticks(range(k))
    axes[1].set_xticklabels([label_names[i] for i in range(k)], rotation=45, ha='right')
    axes[1].set_yticks(range(k))
    
    # Add colorbars
    plt.colorbar(im1, ax=axes[0], label='Proportion')
    plt.colorbar(im2, ax=axes[1], label='Proportion')
    
    plt.suptitle('Cluster Composition: How Categories Distribute Across Clusters', 
                 fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\n📖 How to read: Darker = more concentrated")
    print("   Diagonal pattern = perfect clustering by category")
    print("   Scattered pattern = mixed clustering")

In [None]:
# Create a fresh base model for comparison
print("🔧 Loading base model for comparison...")
base_classifier = GemmaCatClassifier(use_lora=False)

# Get sample of products for analysis
sample_products = cat_products[:200]

# Extract embeddings once
print("\n📍 Extracting embeddings...")
texts = [p['name'] for p in sample_products]
base_embeddings = base_classifier.get_embeddings(texts)

# Use the Stage 2 classifier here
trained_embeddings = classifier_stage1.get_embeddings(texts)

K = 4

# Unsupervised comparison (metrics use optimal k, visualization forced to K)
optimal_k_base, optimal_k_trained = compare_embeddings_unsupervised(
    base_classifier, classifier_stage1, sample_products, k_vis=K
)

# Purity and composition using Stage 2
purity_base, purity_trained = compare_clustering_perspectives(
    base_embeddings, trained_embeddings, sample_products, k=K
)

visualize_cluster_composition(
    base_embeddings, trained_embeddings, sample_products, k=K
)

## What Did We Learn?

### The Power of Fine-tuning on Representation Space
**Key Insight**:
   - Supervised learning doesn't just add a classification layer
   - It fundamentally **reorganizes the entire representation space**
   - Products that cats see similarly become closer in embedding space
   - This is why transfer learning works so well!

### Why This Matters
- **For ML Practice**: Shows that fine-tuning changes deep representations
- **For Applications**: Embeddings from fine-tuned models are task-specific
- **For Understanding**: Neural networks learn structured representations, not just decision boundaries

## Part 7: Active Learning - Smart Labeling with Gemma

### 7.1 The Big Question

### What if the model could tell us what to label next?

**The Problem:**
- Labeling data is expensive 💰
- Most examples are "easy" and redundant
- We waste time labeling obvious cases

**The Solution: Active Learning**
- Let the model identify what confuses it
- Focus human effort on the hard cases
- Achieve high accuracy with minimal labels



### 7.2 How Active Learning Works

### The Core Concept


**Traditional Approach:**
- Label random data
- Train model
- Hope for the best

**Active Learning Approach:**
- Start with tiny labeled set
- Model identifies confusing examples
- Human labels only those
- Repeat until accurate


**Key Insight:** The model knows what it doesn't know!


### 7.3 Setting Up the Demo

Let's load our pre-computed results and see active learning in action!


In [None]:
import json
import numpy as np
import matplotlib.pyplot as plt
import os
from pathlib import Path
import torch
from IPython.display import display, HTML
import ipywidgets as widgets
from ipywidgets import interact

# Check if we have pre-computed results
CHECKPOINT_DIR = Path('./models/active_learning_checkpoints')
ASSETS_DIR = Path('./models/lecture_assets')

if CHECKPOINT_DIR.exists():
    print("✅ Pre-computed checkpoints found!")
    print(f"   Found {len(list(CHECKPOINT_DIR.glob('checkpoint_*')))} checkpoints")
else:
    print("⚠️ No checkpoints found. Please run: python prepare_lecture_demo.py")

### 7.4 Live Uncertainty Demo

### See How the Model Identifies Confusion

Let's start with a live demonstration of how the model calculates uncertainty:


In [None]:
def live_uncertainty_demo(classifier, products=None):
    """
    Interactive demo showing how the model identifies confusing products
    Perfect for explaining the concept!
    """
    print("🔍 LIVE DEMO: How Active Learning Identifies Confusion")
    print("=" * 60)
    
    # Test products that showcase different uncertainty levels
    demo_products = [
        ("laptop computer", "Ambiguous: warm surface or electronics?"),
        ("cardboard box", "Classic cat item, but play or sleep?"),
        ("laser pointer", "Clearly a toy... or is it?"),
        ("vacuum cleaner", "Definitely scary!"),
        ("cat food bowl", "Obviously for eating"),
        ("bluetooth speaker", "Toy-like but not really"),
        ("electric blanket", "Warm but also dangerous?"),
        ("paper bag", "Territory or play?"),
    ]
    
    uncertainties = []
    
    print("\nCalculating uncertainty for each product:\n")
    
    for product, description in demo_products:
        # Get prediction and probabilities
        pred, probs = classifier.classify(product, return_probs=True)
        
        # Calculate entropy (uncertainty)
        entropy = -sum(p * np.log(p + 1e-10) for p in probs.values() if p > 0)
        
        # Visual uncertainty meter
        bar_length = int(entropy * 10)
        uncertainty_bar = '█' * bar_length + '░' * (10 - bar_length)
        
        # Confidence of top prediction
        confidence = probs[pred]
        
        # Get top 2 categories for confusion analysis
        sorted_probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)[:2]
        
        print(f"📦 {product:20s}")
        print(f"   {description}")
        print(f"   Prediction: {pred:12s} (confidence: {confidence:.1%})")
        print(f"   Uncertainty: [{uncertainty_bar}] {entropy:.3f}")
        
        if entropy > 1.0:  # High uncertainty
            print(f"   ⚠️ CONFUSED between {sorted_probs[0][0]} ({sorted_probs[0][1]:.1%}) "
                  f"and {sorted_probs[1][0]} ({sorted_probs[1][1]:.1%})")
            print(f"   ✅ PERFECT for active learning!\n")
        else:
            print(f"   ✓ Pretty confident, lower priority\n")
        
        uncertainties.append((product, entropy, pred, confidence))
    
    # Sort by uncertainty
    uncertainties.sort(key=lambda x: x[1], reverse=True)
    
    print("🎯 ACTIVE LEARNING WOULD SELECT:")
    print(f"   → '{uncertainties[0][0]}' (uncertainty: {uncertainties[0][1]:.3f})")
    print("\n💡 This is the product that would teach the model the most!")
    
    return uncertainties

# Run the demo with our trained classifier
uncertainty_results = live_uncertainty_demo(classifier)


### 7.5 The Power of Smart Selection

### Watch How Active Learning Outperforms Random Sampling

Now let's load our pre-computed results and see the dramatic difference:

In [None]:
from pathlib import Path
import json
import numpy as np

# Load results
CHECKPOINT_DIR = Path("models/active_learning_checkpoints")
RESULTS_PATH = CHECKPOINT_DIR / "results.json"
with open(RESULTS_PATH, "r") as f:
    results = json.load(f)

print("📊 ACTIVE LEARNING VS RANDOM SAMPLING")
print("=" * 60)

active = results["active_learning"]
random_ = results["random_sampling"]

initial = results.get("initial_samples", 7)
step = 5

def x_axis(accs): 
    return [initial + i*step for i in range(len(accs))]

a_acc = active.get("accuracies", [])
r_acc = random_.get("accuracies", [])
xa, xr = x_axis(a_acc), x_axis(r_acc)

# 1) Equal-budget comparison (use last common label count)
common_budget = min(xa[-1] if xa else 0, xr[-1] if xr else 0)
def value_at_budget(xs, ys, budget):
    if not xs: return 0.0
    # nearest index (xs are monotonic)
    idx = min(range(len(xs)), key=lambda i: abs(xs[i]-budget))
    return ys[idx]

a_eq = value_at_budget(xa, a_acc, common_budget)
r_eq = value_at_budget(xr, r_acc, common_budget)
delta_pp = (a_eq - r_eq) * 100

print(f"\n🪙 Equal-budget comparison at {common_budget} labels:")
print(f"   Active Learning: {a_eq:.1%}")
print(f"   Random Sampling: {r_eq:.1%}")
print(f"   Advantage: {delta_pp:.1f} percentage points")

# 2) Label-efficiency: AUC of accuracy vs labels up to common budget
def auc(xs, ys, limit):
    if not xs: return 0.0
    # truncate to <= limit
    x_t, y_t = zip(*[(x, y) for x, y in zip(xs, ys) if x <= limit]) if xs[0] <= limit else ([], [])
    if len(x_t) < 2: return 0.0
    return np.trapz(y_t, x_t)

auc_a = auc(xa, a_acc, common_budget)
auc_r = auc(xr, r_acc, common_budget)
auc_gain = (auc_a - auc_r) / max(auc_r, 1e-8) * 100

print(f"\n📐 Label-efficiency (AUC up to {common_budget} labels):")
print(f"   Active: {auc_a:.3f} | Random: {auc_r:.3f} | Gain: {auc_gain:.1f}%")

# 3) Early gain (after first retrain milestone, e.g., +5 labels)
milestone = initial + 5
a_m = value_at_budget(xa, a_acc, milestone)
r_m = value_at_budget(xr, r_acc, milestone)
print(f"\n🚀 Early gain at {milestone} labels:")
print(f"   Active: {a_m:.1%} vs Random: {r_m:.1%} | Δ={((a_m-r_m)*100):.1f} pp")

# 4) Confidence/coverage deltas (if available)
a_conf = active.get("avg_confidences", [])
r_conf = random_.get("avg_confidences", [])
if a_conf and r_conf:
    a_conf_eq = value_at_budget(x_axis(a_conf), a_conf, common_budget)
    r_conf_eq = value_at_budget(x_axis(r_conf), r_conf, common_budget)
    print(f"\n🔒 Confidence at {common_budget} labels:")
    print(f"   Active: {a_conf_eq:.1%} vs Random: {r_conf_eq:.1%}")

a_cov = active.get("category_coverage", [])
r_cov = random_.get("category_coverage", [])
if a_cov and r_cov:
    a_cov_eq = value_at_budget(x_axis(a_cov), a_cov, common_budget)
    r_cov_eq = value_at_budget(x_axis(r_cov), r_cov, common_budget)
    print(f"\n🧭 Category coverage at {common_budget} labels:")
    print(f"   Active: {a_cov_eq:.0f} vs Random: {r_cov_eq:.0f} categories")

### 7.6 Visualizing the Advantage

### See the Dramatic Difference in Learning Curves

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_active_vs_random(results):
    active = results["active_learning"]
    random_ = results["random_sampling"]
    initial = results.get("initial_samples", 7)
    step = 5

    def x_axis(arr): 
        return [initial + i*step for i in range(len(arr))]

    a_acc = active.get("accuracies", [])
    r_acc = random_.get("accuracies", [])
    xa, xr = x_axis(a_acc), x_axis(r_acc)

    # Confidence x-axes
    a_conf = active.get("avg_confidences", [])
    r_conf = random_.get("avg_confidences", [])
    xa_conf, xr_conf = x_axis(a_conf), x_axis(r_conf)

    # Common budget and helpers
    common_budget = min(xa[-1] if xa else 0, xr[-1] if xr else 0)
    def val_at(xs, ys, x0):
        if not xs: return 0.0
        idx = min(range(len(xs)), key=lambda i: abs(xs[i]-x0))
        return ys[idx]

    a_eq = val_at(xa, a_acc, common_budget)
    r_eq = val_at(xr, r_acc, common_budget)

    def auc(xs, ys, limit):
        if not xs: return 0.0
        pts = [(x, y) for x, y in zip(xs, ys) if x <= limit]
        if len(pts) < 2: return 0.0
        x_t, y_t = zip(*pts)
        return np.trapz(y_t, x_t)

    auc_a = auc(xa, a_acc, common_budget)
    auc_r = auc(xr, r_acc, common_budget)

    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 6))

    # 1) Accuracy curves + equal-budget marker
    ax1.plot(xa, [y*100 for y in a_acc], 'b-o', label='Active', linewidth=3, markersize=7)
    ax1.plot(xr, [y*100 for y in r_acc], 'r--s', label='Random', linewidth=2, markersize=6, alpha=0.8)
    if common_budget:
        ax1.axvline(common_budget, color='gray', linestyle=':', alpha=0.6)
        ax1.annotate(f'Equal budget: {common_budget}', xy=(common_budget, 0.5), xytext=(common_budget+2, 50),
                     arrowprops=dict(arrowstyle='->', color='gray'), fontsize=10, color='gray')

        # Delta label at budget
        ax1.text(common_budget+1, (a_eq*100 + r_eq*100)/2, f'+{(a_eq-r_eq)*100:.1f} pp', color='blue', fontsize=10)

    ax1.set_xlabel('Number of Labeled Examples', fontsize=13, fontweight='bold')
    ax1.set_ylabel('Accuracy (%)', fontsize=13, fontweight='bold')
    ax1.set_title('Active vs Random at Same Label Budget', fontsize=15, fontweight='bold')
    ax1.legend(fontsize=12)
    ax1.grid(True, alpha=0.3)
    # Auto-scale to your observed range
    all_acc = [*(y*100 for y in a_acc), *(y*100 for y in r_acc)]
    if all_acc:
        lo = max(0, min(all_acc) - 5); hi = min(100, max(all_acc) + 5)
        ax1.set_ylim([lo, hi])

    # 2) Confidence evolution (if present)
    if a_conf or r_conf:
        if a_conf:
            ax2.plot(xa_conf, [c*100 for c in a_conf], 'b-o', label='Active', linewidth=2.5, markersize=6)
        if r_conf:
            ax2.plot(xr_conf, [c*100 for c in r_conf], 'r--s', label='Random', linewidth=2, markersize=5, alpha=0.8)
        ax2.set_xlabel('Number of Labeled Examples', fontsize=12)
        ax2.set_ylabel('Confidence (%)', fontsize=12)
        ax2.set_title('Confidence Over Labels', fontsize=14, fontweight='bold')
        ax2.legend(fontsize=11)
        ax2.grid(True, alpha=0.3)
    else:
        ax2.axis('off')
        ax2.set_title('Confidence Not Logged', fontsize=14)

    # 3) Label-efficiency via AUC up to equal budget
    x = np.arange(2)
    width = 0.6
    bars = ax3.bar(x, [auc_a, auc_r], color=['#2E86AB', '#F18F01'], alpha=0.85, width=width)
    ax3.set_xticks(x)
    ax3.set_xticklabels(['Active', 'Random'])
    ax3.set_ylabel('AUC (accuracy vs labels)', fontsize=12, fontweight='bold')
    ax3.set_title(f'Label Efficiency ≤ {common_budget} labels', fontsize=14, fontweight='bold')
    ax3.grid(True, alpha=0.3, axis='y')

    for bar in bars:
        ax3.text(bar.get_x() + bar.get_width()/2., bar.get_height() * 1.01,
                 f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=10)

    plt.suptitle('Active Learning: Better Accuracy for the Same Label Budget', fontsize=16, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

plot_active_vs_random(results)


### 7.7 Interactive Uncertainty Explorer

### Play with Different Products to See Uncertainty

In [None]:
def create_interactive_explorer():
    """Interactive widget to explore uncertainty on any product"""
    
    print("🎮 INTERACTIVE UNCERTAINTY EXPLORER")
    print("=" * 60)
    print("Try different products to see what confuses the model!\n")
    
    @interact(product_name=widgets.Text(
        value='laptop computer',
        placeholder='Enter a product name',
        description='Product:',
        style={'description_width': 'initial'}
    ))
    def explore_uncertainty(product_name):
        if not product_name:
            return
        
        # Get predictions
        pred, probs = classifier.classify(product_name, return_probs=True)
        
        # Calculate entropy
        entropy = -sum(p * np.log(p + 1e-10) for p in probs.values() if p > 0)
        
        # Create visualization
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
        
        # Probability distribution
        categories = list(probs.keys())
        values = list(probs.values())
        colors = ['#2E86AB' if v == max(values) else '#CCCCCC' for v in values]
        
        bars = ax1.barh(categories, values, color=colors)
        ax1.set_xlabel('Probability', fontsize=12)
        ax1.set_title(f'Model Predictions for "{product_name}"', fontsize=13, fontweight='bold')
        ax1.set_xlim(0, 1)
        
        # Add percentage labels
        for bar, val in zip(bars, values):
            ax1.text(val + 0.01, bar.get_y() + bar.get_height()/2, 
                    f'{val:.1%}', va='center', fontsize=10)
        
        # Uncertainty meter
        ax2.clear()
        
        # Create uncertainty gauge
        if entropy > 1.5:
            color = '#FF4444'
            label = "HIGH\nPRIORITY!"
            message = "Perfect for labeling!"
        elif entropy > 1.0:
            color = '#FFA500'
            label = "Medium\nPriority"
            message = "Good candidate"
        else:
            color = '#44AA44'
            label = "Low\nPriority"
            message = "Model is confident"
        
        # Draw gauge
        bar = ax2.bar(['Uncertainty'], [entropy], color=color, width=0.5)
        ax2.set_ylim(0, 2.5)
        ax2.set_ylabel('Entropy', fontsize=12)
        ax2.set_title('Should We Label This?', fontsize=13, fontweight='bold')
        
        # Add text
        ax2.text(0, entropy + 0.1, label, ha='center', fontsize=12, fontweight='bold')
        ax2.text(0, -0.2, message, ha='center', fontsize=11, style='italic')
        
        # Add entropy value
        ax2.text(0, entropy/2, f'{entropy:.3f}', ha='center', va='center', 
                fontsize=14, fontweight='bold', color='white')
        
        plt.suptitle(f'Prediction: {pred} (Confidence: {probs[pred]:.1%})', 
                    fontsize=14, fontweight='bold', y=1.05)
        plt.tight_layout()
        plt.show()
        
        # Print confusion analysis
        sorted_probs = sorted(probs.items(), key=lambda x: x[1], reverse=True)[:2]
        if entropy > 1.0:
            print(f"\n⚠️ Model is confused between:")
            print(f"   • {sorted_probs[0][0]}: {sorted_probs[0][1]:.1%}")
            print(f"   • {sorted_probs[1][0]}: {sorted_probs[1][1]:.1%}")
            print(f"\n✅ This would be a great example to label!")
        else:
            print(f"\n✓ Model is confident: {pred} ({probs[pred]:.1%})")
            print(f"  Lower priority for labeling")

# Create the interactive explorer
create_interactive_explorer()


### 7.8 Progressive Model Improvement

### Watch the Model Learn with Each Selection


In [None]:
from pathlib import Path
import json
import numpy as np
import matplotlib.pyplot as plt

CHECKPOINT_DIR = Path("models/active_learning_checkpoints")
RESULTS_PATH = CHECKPOINT_DIR / "results.json"

def show_progressive_learning_real(results_path=RESULTS_PATH):
    with open(results_path, "r") as f:
        results = json.load(f)

    active = results["active_learning"]
    random_ = results["random_sampling"]  # Not used here, but available

    initial = results.get("initial_samples", 7)
    step = 5

    def x_axis(arr): 
        return [initial + i*step for i in range(len(arr))]

    a_acc = active.get("accuracies", [])
    a_conf = active.get("avg_confidences", [])
    a_unc  = active.get("uncertainties", [])
    a_cov  = active.get("category_coverage", [])

    xa_acc  = x_axis(a_acc)
    xa_conf = x_axis(a_conf)
    xa_unc  = list(range(len(a_unc)))  # uncertainties are per-round; index is fine
    xa_cov  = x_axis(a_cov)

    print("📈 PROGRESSIVE MODEL IMPROVEMENT (Real Data)")
    print("=" * 60)
    print("\nWatch how the model gets smarter at each evaluation checkpoint:\n")

    # Build a unified list of checkpoints from accuracy (since eval logged every 5 labels)
    checkpoints = xa_acc

    # Helper to get value at a checkpoint from a metric with its own x-axis
    def val_at(xs, ys, x0, default=None):
        if not xs or not ys:
            return default
        if x0 in xs:
            return ys[xs.index(x0)]
        # nearest prior
        prior = [(x, y) for x, y in zip(xs, ys) if x <= x0]
        if prior:
            return prior[-1][1]
        return ys[0] if ys else default

    print("N  | Accuracy  | Confidence | Uncertainty | Coverage | Visual")
    print("---|-----------|------------|-------------|----------|" + "-" * 35)

    for N in checkpoints:
        acc = val_at(xa_acc, a_acc, N, 0.0)                # 0–1
        conf = val_at(xa_conf, a_conf, N, None)            # 0–1 or None
        # For uncertainty, map eval checkpoints to nearest selection index proportionally
        # If you prefer, show N/A when a strict mapping isn’t possible.
        unc = None
        if a_unc:
            # Approximate mapping: use proportional index
            idx = min(len(a_unc)-1, max(0, round((N - initial) / step) - 1))
            unc = a_unc[idx]

        cov = val_at(xa_cov, a_cov, N, None)              # 0–7 or None

        # Visual bar for confidence if available
        if conf is not None:
            conf_bar_length = int(max(0, min(30, round(conf * 30))))
            conf_bar = '█' * conf_bar_length + '░' * (30 - conf_bar_length)
            status = "✅" if conf >= 0.8 else ("🔶" if conf >= 0.6 else "❌")
            conf_str = f"{conf:10.1%}"
        else:
            conf_bar = '·' * 30
            status = "–"
            conf_str = f"{'N/A':>10s}"

        unc_str = f"{unc:11.3f}" if unc is not None else f"{'N/A':>11s}"
        cov_str = f"{int(round(cov)):2d}/7" if cov is not None else "N/A"

        print(f"{N:2d} | {acc:9.1%} | {conf_str} | {unc_str} | {cov_str:>8s} | [{conf_bar}] {status}")

    print("\n✨ See confidence grow, uncertainty trend down, and coverage progress as labels increase.")

    # Plots with real arrays and per-metric x-axes
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # 1) Accuracy and Confidence (if present)
    ax1.plot(xa_acc, [y*100 for y in a_acc], 'b-o', linewidth=3, markersize=6, label='Accuracy')
    if a_conf:
        ax1.plot(xa_conf, [c*100 for c in a_conf], 'g--s', linewidth=2, markersize=5, alpha=0.9, label='Confidence')
    ax1.set_xlabel('Number of Labeled Examples', fontsize=12)
    ax1.set_ylabel('Metric (%)', fontsize=12)
    ax1.set_title('Accuracy and Confidence over Labels', fontsize=13, fontweight='bold')
    ax1.grid(True, alpha=0.3)
    ax1.legend(loc='lower right')

    # 2) Uncertainty and Coverage (if present)
    plotted_any = False
    if a_unc:
        ax2.plot(range(len(a_unc)), a_unc, 'r-s', linewidth=2.5, markersize=6, label='Avg Uncertainty')
        plotted_any = True
    if a_cov:
        ax2_ = ax2.twinx()
        ax2_.plot(xa_cov, a_cov, 'k-.^', linewidth=2, markersize=5, label='Category Coverage')
        ax2_.set_ylabel('Coverage (categories)', fontsize=12)
        ax2_.set_ylim(0, 7.5)
        # Merge legends
        lines, labels = ax2.get_legend_handles_labels()
        lines2, labels2 = ax2_.get_legend_handles_labels()
        ax2.legend(lines + lines2, labels + labels2, loc='upper right')
        plotted_any = True

    ax2.set_xlabel('Round (uncertainty) / Labels (coverage)', fontsize=12)
    ax2.set_title('Uncertainty Reduction and Coverage', fontsize=13, fontweight='bold')
    ax2.grid(True, alpha=0.3)
    if not plotted_any:
        ax2.axis('off')
        ax2.set_title('No Uncertainty/Coverage Logged', fontsize=13)

    plt.suptitle('Active Learning: From Confusion to Confidence (Real Data)', fontsize=14, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

show_progressive_learning_real()


This is the power of active learning! Instead of randomly labeling data, we let the model guide us to the most informative examples. In production systems, this can reduce labeling costs by 50-70% while achieving the same accuracy.


# Comparing ML Paradigms: Supervised, Unsupervised, and Active Learning

## What each paradigm is best at
- **Supervised Learning**
  - Optimizes for maximum predictive accuracy once you have labeled data.
  - Reliable and production-ready when labels are abundant and consistent.

- **Unsupervised Learning**
  - Explores structure in unlabeled data to reveal patterns and groups.
  - Great for discovery, sense-making, and informing downstream tasks.

- **Active Learning**
  - Selects the most informative examples to label next.
  - Aims to reach useful performance with far fewer labels.



## Trade-offs and costs
- **Supervised**
  - Strengths: Highest ceiling on accuracy; stable training.
  - Costs: Labeling is expensive and time-consuming; up-front investment required.

- **Unsupervised**
  - Strengths: No labels needed; fast to explore; uncovers hidden structure.
  - Costs: Not directly optimizing a labeled objective; requires interpretation.

- **Active**
  - Strengths: Label-efficient; focuses human effort where it matters; faster early gains.
  - Costs: Iterative loop (model+query+label); requires uncertainty/selection strategy.

## When to use which
- **Supervised:** Production systems, high-stakes decisions, when labels are available and accuracy is paramount.
- **Unsupervised:** Early-stage exploration, market/customer segmentation, discovering anomalies or themes.
- **Active:** Limited labeling budget, costly experts, need quick improvements with minimal labels.



## Part 8: Testing Our Complete Cat Assistant

Let's have some fun and test our trained model's ability to both classify and chat!


In [None]:
def interactive_cat_demo(classifier):
    """
    Interactive demo showing both classification and conversation
    """
    print("\n" + "="*60)
    print("🐱 INTERACTIVE CAT ASSISTANT DEMO")
    print("="*60)
    
    demo_products = [
        "Apple MacBook Pro laptop",
        "Large cardboard shipping box",
        "Automatic laser pointer toy",
        "Roomba robot vacuum",
        "Ceramic water fountain",
        "Fleece blanket",
        "Bluetooth speaker"
    ]
    
    print("\n🎯 Classification Mode:\n")
    for product in demo_products:
        pred, probs = classifier.classify(product, return_probs=True)
        confidence = probs[pred]
        uncertainty = classifier.get_uncertainty(product)
        
        # Create visual confidence bar
        bar_length = int(confidence * 20)
        bar = '█' * bar_length + '░' * (20 - bar_length)
        
        print(f"📦 {product:30s}")
        print(f"   → {pred:12s} [{bar}] {confidence:.1%}")
        print(f"   Uncertainty: {uncertainty:.3f}")
    
    print("\n💬 Conversation Mode:\n")
    
    # Test conversational ability
    conversations = [
        "What would a cat think about a warm laptop?",
        "My cat keeps sitting on my keyboard. Why?",
        "Is a cardboard box a good cat toy?"
    ]
    
    for question in conversations:
        prompt = f"Human: {question}\nAssistant:"
        
        # Generate response
        inputs = classifier.tokenizer(prompt, return_tensors="pt").to(classifier.device)
        
        with torch.no_grad():
            outputs = classifier.model.generate(
                **inputs,
                max_new_tokens=50,
                temperature=0.7,
                do_sample=True,
                pad_token_id=classifier.tokenizer.pad_token_id
            )
        
        response = classifier.tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response[len(prompt):].strip()
        
        print(f"👤 Human: {question}")
        print(f"🐱 Cat Assistant: {response}\n")
    
    print("✨ Notice how the model maintains both capabilities!")
    print("   It can classify products AND have conversations about them.")

# Run the demo
interactive_cat_demo(classifier)

Our model can do both classification and conversation! This dual capability is crucial for real-world applications. We didn't sacrifice the model's general abilities to teach it our specific task.


## Part 9: Running the CatShop Web Application

Now that we've trained our cat classifier, let's see it in action in a real web application! The CatShop website integrates our trained model to provide:

1. **Cat Classifications**: Each product shows how a cat would categorize it
2. **Confidence Scores**: Visual indicators of how certain the cat is
3. **Cat Chat**: Interactive chat with the cat about products

Let's set up and run the web application.

In [None]:
# First, ensure the trained model is in the expected location for the web app
import shutil
from pathlib import Path

# The web app expects the model in a specific location relative to the catshop module
web_model_path = Path('catshop/models/gemma-cat-lora')
notebook_model_path = Path('models/gemma-cat-lora')

# Create the directory structure if it doesn't exist
web_model_path.parent.mkdir(parents=True, exist_ok=True)

# Copy the trained model to where the web app expects it
if notebook_model_path.exists():
    print(f"✅ Copying trained model from {notebook_model_path} to {web_model_path}")
    if web_model_path.exists():
        shutil.rmtree(web_model_path)
    shutil.copytree(notebook_model_path, web_model_path)
    print(f"✅ Model copied successfully!")
else:
    print(f"⚠️ No trained model found at {notebook_model_path}")
    print("   The web app will fall back to rule-based classification")

# Verify the model files are in place
if web_model_path.exists():
    model_files = list(web_model_path.glob('*'))
    print(f"\n📁 Model files in {web_model_path}:")
    for f in model_files[:5]:  # Show first 5 files
        print(f"   - {f.name}")

In [None]:
# Lightweight web demo deps (avoid pyserini/nmslib native build issues)
import subprocess, sys, os

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # quiet HF tokenizers warning

required = [
    "flask",
    "flask-cors",
    "rank-bm25",
    "spacy",
    "thefuzz",     # fuzzy string matching
    "rich"
]

def ensure(pkg, import_name=None, post=None):
    import_name = import_name or pkg.replace("-", "_")
    try:
        __import__(import_name)
    except ImportError:
        print(f"Installing {pkg}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])
    if post:
        post()

def _ensure_spacy_model():
    try:
        import spacy
        spacy.load("en_core_web_sm")
    except Exception:
        print("Downloading spaCy English model...")
        subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])

for p in required:
    ensure(p)

_ensure_spacy_model()

# Try optional pyserini (will skip if it fails)
try:
    __import__("pyserini")
    print("✅ Optional: pyserini available")
except Exception:
    print("ℹ️ Optional: pyserini not installed (skipping). Using BM25/fuzzy matching instead.")

print("✅ All web demo dependencies ready!")

In [None]:
# Update the cat_classifier.py to use relative paths correctly
cat_classifier_path = Path('catshop/cat_classifier.py')

if cat_classifier_path.exists():
    # Read the file
    with open(cat_classifier_path, 'r') as f:
        content = f.read()
    
    # Update the model path to be relative to the catshop directory
    # The original uses Path(__file__).parent.parent which may not work in notebook context
    updated_content = content.replace(
        'self.model_path = Path(__file__).parent.parent / model_path',
        'self.model_path = Path("catshop") / model_path if Path("catshop").exists() else Path(model_path)'
    )
    
    # Write back if changed
    if updated_content != content:
        with open(cat_classifier_path, 'w') as f:
            f.write(updated_content)
        print("✅ Updated cat_classifier.py paths for notebook compatibility")
    else:
        print("✅ cat_classifier.py paths already configured")

In [None]:
# Remove all unused imports from engine.py
from pathlib import Path

engine_path = Path('catshop/engine/engine.py')

if engine_path.exists():
    with open(engine_path, 'r') as f:
        lines = f.readlines()
    
    # Remove unused imports
    unused_imports = ['import cleantext', 'from selenium']
    
    cleaned_lines = []
    for line in lines:
        skip = False
        for unused in unused_imports:
            if unused in line:
                print(f"  Removing: {line.strip()}")
                skip = True
                break
        if not skip:
            cleaned_lines.append(line)
    
    with open(engine_path, 'w') as f:
        f.writelines(cleaned_lines)
    
    print("✅ Cleaned up unused imports from engine.py")

In [None]:
import threading
import time
import os
from IPython.display import IFrame, display, HTML

import sys
from pathlib import Path

sys.path.insert(0, str(Path("Lecture 1 Overview").resolve()))
# sanity check
import importlib; importlib.import_module("catshop")

# Set up Flask app in a thread
def run_flask_app():
    """Run the Flask application in a separate thread"""
    import sys
    sys.path.insert(0, 'catshop')
    
    # Change to catshop directory for proper static file serving
    original_dir = os.getcwd()
    os.chdir('catshop')
    
    try:
        from app import app
        # Run with debug=False to avoid reloader issues in notebook
        app.run(host='127.0.0.1', port=3000, debug=False, use_reloader=False)
    finally:
        os.chdir(original_dir)

# Check if Flask is already running
import socket
def is_port_open(port):
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    result = sock.connect_ex(('127.0.0.1', port))
    sock.close()
    return result == 0

if not is_port_open(3000):
    # Start Flask in a background thread
    flask_thread = threading.Thread(target=run_flask_app, daemon=True)
    flask_thread.start()
    
    print("🚀 Starting CatShop web application...")
    # Wait for Flask to start
    for i in range(10):
        if is_port_open(3000):
            print("✅ CatShop is running!")
            break
        time.sleep(1)
    else:
        print("⚠️ Flask app didn't start in time")
else:
    print("✅ CatShop is already running!")

# Display link and embedded iframe
display(HTML("""
<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); 
            padding: 20px; border-radius: 10px; color: white; margin: 20px 0;">
    <h2>🐱 CatShop is Ready!</h2>
    <p>The web application is now running with your trained cat classifier.</p>
    <p><strong>Access the app:</strong> 
       <a href="http://localhost:3000/test_session" target="_blank" 
          style="color: #FFE082; text-decoration: underline;">
          Open CatShop in a new tab
       </a>
    </p>
    <p><strong>Features to try:</strong></p>
    <ul>
        <li>Search for products (e.g., "laptop", "box", "toy")</li>
        <li>See cat classifications with emojis and confidence scores</li>
        <li>Click on products and use the "Ask the Cat" button for cat chat</li>
    </ul>
</div>
"""))

# Embed the app in an iframe (optional - may not work in all notebook environments)
IFrame('http://localhost:3000/test_session', width=1000, height=600)

That's it! You've built a complete ML system that properly demonstrates all three paradigms with the actual Gemma model. Well done! 🎉