# Lesson 6 — Evaluate, Prompt, and Add Simple Safety
**Goal:** Measure model quality (perplexity), practice deliberate prompting, and implement a tiny safety/guardrails layer.

**Why this matters**
- Building a model is only half the story—you need to know if it performs well and behaves responsibly.
- Evaluation metrics like perplexity tell you how shocked the model is by real data. Lower perplexity = better predictions.
- Prompt engineering and safety checks help you steer outputs once the model is deployed.

**Vocabulary check**
- **Perplexity:** `exp(cross_entropy)`. Imagine how many equally likely options the model juggles at each step. Perplexity 5 ≈ "the model thinks there are about 5 plausible next tokens on average."
- **Held-out set:** A chunk of text you never showed the model during training, used purely for evaluation.
- **Prompt template:** A repeatable structure that gives the model context, instructions, and examples.
- **Guardrail / Safety filter:** Rules or models that catch unwanted inputs or outputs.

In [None]:

import math, re
from pathlib import Path
import torch

# We'll reuse Lesson 4's character-level model if available.
# Otherwise, demonstrate perplexity with n-gram from Lesson 3 re-implemented quickly.

data_dir = Path("../data")
text = ""
for fname in ["space.txt","animals.txt","minecraft.txt"]:
    text += (data_dir / fname).read_text(encoding="utf-8") + "\n"
tokens = re.findall(r"[a-zA-Z']+|[.,!?;:]", text.lower())


In [None]:

# Simple bigram LM for evaluation demo
import collections, math, random
def ngrams(tokens, n):
    for i in range(len(tokens)-n+1):
        yield tuple(tokens[i:i+n])

def train_bigram(tokens, k=0.5):
    counts = collections.Counter(ngrams(tokens, 2))
    ctx_counts = collections.Counter(ngrams(tokens, 1))
    vocab = sorted(set(tokens))
    V = len(vocab)
    def prob(context, w):
        c = counts[(context,w)]
        ctx = ctx_counts[(context,)]
        return (c + k) / (ctx + k*V)
    return prob, vocab

bigram, V = train_bigram(tokens, k=0.5)

def cross_entropy(prob, vocab, tokens):
    split = int(0.8*len(tokens))
    test = tokens[split:]
    H = 0.0
    count = 0
    for i in range(1, len(test)):
        p = max(prob(test[i-1], test[i]), 1e-12)
        H += -math.log2(p)
        count += 1
    return H/max(count,1), 2**(H/max(count,1))

H, ppl = cross_entropy(bigram, V, tokens)
print(f"Bigram perplexity on held-out: {ppl:.2f}")


## Prompting patterns (for larger LLMs)
When you use a bigger model (like GPT-2/3+), structure prompts with:
- **Role/Goal:** “You are a helpful math tutor…” (sets the mindset)
- **Constraints:** “Use numbered steps, keep answers under 5 sentences.”
- **Examples (few-shot):** Provide 1–3 mini demonstrations of the task with correct answers.
- **Checks:** “Double-check arithmetic before answering. If unsure, say you’re unsure.”

**Practice drill**
1. Write a plain prompt for your fine-tuned model (Lesson 5) and note the response.
2. Rewrite it using the structure above and compare. Did clarity improve? Did the model stay on topic better?

## Simple Safety Filter (demo)
Below is a tiny demonstration of *rule-based* screening (e.g., reject if input matches forbidden patterns). Real systems layer many techniques:
- Keyword or regex filters for obviously disallowed content.
- Classification models trained to detect safety categories.
- Human review for tricky edge cases.

**Mini project suggestion**
- Extend the provided rules carefully. For example, add patterns for spoilers or personal data.
- Log when the filter triggers so you can review decisions later.
- Reflect on limitations: rule lists can miss rephrased or subtle unsafe requests.

In [None]:

FORBIDDEN = [
    r"how to make a bomb",
    r"credit card number",
    r"social security number",
]

def safe_input(user_text):
    t = user_text.lower()
    for pat in FORBIDDEN:
        if re.search(pat, t):
            return False, f"Blocked by rule: {pat}"
    return True, "ok"

tests = [
    "Tell me a Minecraft story about wolves",
    "how to make a bomb from household items",
    "What's a credit card number?"
]
for t in tests:
    ok, msg = safe_input(t)
    print(f"{t!r} -> {ok}, {msg}")


### Challenges
- **Metric mix:** Evaluate both your trigram model (Lesson 3) and tiny Transformer (Lesson 4) on the same held-out text. Compare perplexity—does the Transformer win?
- **Prompt A/B test:** Design two prompt templates for the same task and run them on a bigger LLM. Collect responses and vote on which template works better.
- **Safety upgrade:** Add a second layer to the filter (e.g., simple sentiment analysis) and document scenarios it catches or misses.
- **Reflection journal:** Write a short summary of what each evaluation number means and how you would explain it to a teammate.