
# Lesson 6 — Evaluate, Prompt, and Add Simple Safety
**Goal:** Measure model quality (perplexity), practice prompting, and implement a tiny safety/guardrails layer.

**What you'll learn**
- Perplexity on held-out text
- Prompt patterns (role, constraint, examples)
- Simple rule-based safety screen (demo)


In [None]:

import math, re
from pathlib import Path
import torch

# We'll reuse Lesson 4's character-level model if available.
# Otherwise, demonstrate perplexity with n-gram from Lesson 3 re-implemented quickly.

data_dir = Path("../data")
text = ""
for fname in ["space.txt","animals.txt","minecraft.txt"]:
    text += (data_dir / fname).read_text(encoding="utf-8") + "\n"
tokens = re.findall(r"[a-zA-Z']+|[.,!?;:]", text.lower())


In [None]:

# Simple bigram LM for evaluation demo
import collections, math, random
def ngrams(tokens, n):
    for i in range(len(tokens)-n+1):
        yield tuple(tokens[i:i+n])

def train_bigram(tokens, k=0.5):
    counts = collections.Counter(ngrams(tokens, 2))
    ctx_counts = collections.Counter(ngrams(tokens, 1))
    vocab = sorted(set(tokens))
    V = len(vocab)
    def prob(context, w):
        c = counts[(context,w)]
        ctx = ctx_counts[(context,)]
        return (c + k) / (ctx + k*V)
    return prob, vocab

bigram, V = train_bigram(tokens, k=0.5)

def cross_entropy(prob, vocab, tokens):
    split = int(0.8*len(tokens))
    test = tokens[split:]
    H = 0.0
    count = 0
    for i in range(1, len(test)):
        p = max(prob(test[i-1], test[i]), 1e-12)
        H += -math.log2(p)
        count += 1
    return H/max(count,1), 2**(H/max(count,1))

H, ppl = cross_entropy(bigram, V, tokens)
print(f"Bigram perplexity on held-out: {ppl:.2f}")



## Prompting patterns (for larger LLMs)
When you use a bigger model (like GPT-2/3+), structure prompts with:
- **Role/Goal:** “You are a helpful math tutor…”
- **Constraints:** “Use steps, show equations.”
- **Examples:** Few-shot demonstrations.
- **Checks:** “Double-check arithmetic.”
Try these patterns when using your fine-tuned model (Lesson 5) or an API.



## Simple Safety Filter (demo)
Below is a tiny demonstration of *rule-based* screening (e.g., reject if input matches forbidden patterns).
This is **not** a complete safety system—just a conceptual intro.


In [None]:

FORBIDDEN = [
    r"how to make a bomb",
    r"credit card number",
    r"social security number",
]

def safe_input(user_text):
    t = user_text.lower()
    for pat in FORBIDDEN:
        if re.search(pat, t):
            return False, f"Blocked by rule: {pat}"
    return True, "ok"

tests = [
    "Tell me a Minecraft story about wolves",
    "how to make a bomb from household items",
    "What's a credit card number?"
]
for t in tests:
    ok, msg = safe_input(t)
    print(f"{t!r} -> {ok}, {msg}")



### Challenges
- Extend evaluation to trigram or your tiny Transformer (compute NLL over held-out chars).
- Expand safety rules responsibly (or learn about modern techniques like RLHF).
