# 2.4 Prompt Engineering Best Practices

## Playground Notebook

This notebook covers **production-grade best practices** for crafting, testing, and managing prompts at scale.

| Topic | Core Idea |
|-------|-----------|
| **Iterative Refinement** | Systematically improve prompts through scored feedback loops |
| **A/B Testing** | Measure and compare prompt variants with objective metrics |
| **Hallucination Handling** | Detect and reduce fabricated facts through grounding techniques |
| **Bias Detection & Mitigation** | Identify and neutralise unfair assumptions in prompts and outputs |
| **Documentation & Version Control** | Track, version, and audit prompts like production code |

---

In [1]:
import json
import time
import re
import hashlib
import uuid
from datetime import datetime, timezone
from collections import defaultdict
from statistics import mean, stdev
from IPython.display import display, Markdown, HTML
from langchain_ollama import ChatOllama
from langchain_core.messages import SystemMessage, HumanMessage, AIMessage

# ============================================================
#  CONFIGURATION
# ============================================================
MODEL = "gpt-oss:20b-cloud"  # qwen2.5:1.5b -> alternate model

llm = ChatOllama(model=MODEL)

# ============================================================
#  HELPER FUNCTIONS
# ============================================================

def build_messages(message_dicts):
    """Convert role/content dicts into LangChain message objects."""
    type_map = {"system": SystemMessage, "user": HumanMessage, "assistant": AIMessage}
    return [type_map[m["role"]](content=m["content"]) for m in message_dicts]


def chat(messages, show=True, **kwargs):
    """Send messages to the model and return the response text."""
    _llm = ChatOllama(model=MODEL, **kwargs) if kwargs else llm
    lc_messages = build_messages(messages)
    start = time.time()
    response = _llm.invoke(lc_messages)
    elapsed = time.time() - start
    content = response.content
    if show:
        display(Markdown(content))
        print(f"\n‚è±Ô∏è {elapsed:.2f}s | {len(content)} chars")
    return content


def show_messages(messages):
    """Pretty-print the message list."""
    colors = {"system": "#e74c3c", "user": "#3498db", "assistant": "#2ecc71"}
    html = ""
    for msg in messages:
        role = msg["role"]
        color = colors.get(role, "#888")
        preview = msg["content"][:400] + ("..." if len(msg["content"]) > 400 else "")
        html += (
            f'<div style="margin:6px 0;padding:8px 12px;border-left:4px solid {color};'
            f'background:#1e1e1e;border-radius:4px;">'
            f'<strong style="color:{color};text-transform:uppercase;">{role}</strong>'
            f'<br><span style="color:#ccc;white-space:pre-wrap;">{preview}</span></div>'
        )
    display(HTML(html))


def section_header(title, subtitle=""):
    html = f"""
    <div style="background:linear-gradient(135deg,#1a1a2e,#16213e);padding:16px 20px;
                border-radius:8px;border-left:5px solid #e94560;margin:20px 0;">
      <h2 style="color:#e94560;margin:0;">{title}</h2>
      <p style="color:#aaa;margin:6px 0 0;">{subtitle}</p>
    </div>"""
    display(HTML(html))


def score_card(label, scores: dict, highlight_best=True):
    """Render a colour-coded score card."""
    best_key = max(scores, key=scores.get) if highlight_best else None
    html = f'<div style="margin:10px 0;"><strong style="color:#e94560;">{label}</strong><br>'
    for k, v in scores.items():
        bar_w = int(v * 10)  # assume 0‚Äì10 scale
        color = "#2ecc71" if k == best_key else "#3498db"
        html += (
            f'<div style="margin:4px 0;display:flex;align-items:center;gap:8px;">'
            f'<span style="color:#ccc;min-width:160px;">{k}</span>'
            f'<div style="background:{color};width:{bar_w * 12}px;height:14px;border-radius:3px;"></div>'
            f'<span style="color:#fff;">{v:.1f}/10</span></div>'
        )
    html += "</div>"
    display(HTML(html))


print(f"‚úÖ Using model: {MODEL}")

  from .autonotebook import tqdm as notebook_tqdm


‚úÖ Using model: gpt-oss:20b-cloud


---

## 1. Iterative Prompt Refinement Methodology

**Iterative refinement** treats prompt engineering as an empirical process ‚Äî start with a baseline, measure what's wrong, then make targeted improvements.

```
Baseline Prompt
     ‚Üì
Run & Observe Output
     ‚Üì
Identify Failure Modes  ‚Üê score against criteria
     ‚Üì
Targeted Improvement
     ‚Üì
Run & Compare  ‚Üê‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
     ‚Üì                            ‚îÇ
Satisfactory? ‚îÄ‚îÄ No ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
     ‚Üì Yes
Final Prompt
```

### Experiment 1A: Baseline ‚Üí Critique ‚Üí Refined Pipeline

In [2]:
# Task: Summarise a technical article for a non-technical audience
SAMPLE_ARTICLE = """
Transformer models use a self-attention mechanism that computes pairwise similarity scores
between all tokens in a sequence. The attention weights are derived from query, key, and
value projections of the input embeddings. Multi-head attention allows the model to attend
to different representation subspaces simultaneously. Positional encodings are added to
preserve sequence order information since the attention operation itself is permutation-
invariant. Feed-forward sublayers apply non-linear transformations independently to each
position. Layer normalisation and residual connections stabilise training.
"""

# ---- BASELINE PROMPT (vague, no format, no audience guidance) ----
BASELINE_SYSTEM = "You are a helpful assistant. Summarise the following text."

print("=" * 60)
print("  BASELINE PROMPT")
print("=" * 60)
baseline_msgs = [
    {"role": "system", "content": BASELINE_SYSTEM},
    {"role": "user", "content": SAMPLE_ARTICLE}
]
show_messages(baseline_msgs)
baseline_output = chat(baseline_msgs, temperature=0.3)

  BASELINE PROMPT


ResponseError: model 'gpt-oss:20b-cloud' not found (status code: 404)

In [None]:
# ---- CRITIQUE: Score the baseline output against criteria ----
CRITERIA = [
    "clarity_for_non_technical_reader (0‚Äì10)",
    "accuracy_of_key_concepts (0‚Äì10)",
    "appropriate_length (0‚Äì10)",
    "avoidance_of_jargon (0‚Äì10)",
]

critique_msgs = [
    {
        "role": "system",
        "content": (
            "You are a prompt quality evaluator. Score the response on each criterion (0‚Äì10) "
            "and explain what to fix. Return JSON only:\n"
            '{"scores": {"criterion": score, ...}, "issues": ["issue1", "issue2", ...], '
            '"improvement_suggestions": ["suggestion1", ...]}'
        )
    },
    {
        "role": "user",
        "content": (
            f"Task: Summarise a technical AI article for a complete non-technical beginner.\n"
            f"Criteria: {', '.join(CRITERIA)}\n\n"
            f"Response to evaluate:\n{baseline_output}"
        )
    }
]

print("CRITIQUE ‚Äî Scoring baseline output")
print("=" * 60)
critique_raw = chat(critique_msgs, show=False, temperature=0.1)

# Parse critique
json_match = re.search(r'\{[\s\S]*\}', critique_raw)
critique = json.loads(json_match.group()) if json_match else {}
scores = critique.get("scores", {})
issues = critique.get("issues", [])
suggestions = critique.get("improvement_suggestions", [])

score_card("Baseline Scores", {k: float(v) for k, v in scores.items()})
print("\nüî¥ Issues identified:")
for i in issues:
    print(f"  ‚Ä¢ {i}")
print("\nüí° Suggestions:")
for s in suggestions:
    print(f"  ‚Üí {s}")

In [None]:
# ---- REFINED PROMPT ‚Äî incorporate critique findings ----
# Use the model to auto-generate the improved prompt based on the critique
refine_msgs = [
    {
        "role": "system",
        "content": (
            "You are a prompt engineer. Rewrite the given system prompt to fix the identified issues. "
            "Output ONLY the improved system prompt text."
        )
    },
    {
        "role": "user",
        "content": (
            f"Original system prompt:\n{BASELINE_SYSTEM}\n\n"
            f"Issues:\n" + "\n".join(f"- {i}" for i in issues) + "\n\n"
            f"Suggestions:\n" + "\n".join(f"- {s}" for s in suggestions) + "\n\n"
            "Write the improved system prompt:"
        )
    }
]

REFINED_SYSTEM = chat(refine_msgs, show=False, temperature=0.3)
print("REFINED SYSTEM PROMPT:")
print("=" * 60)
display(Markdown(f"```\n{REFINED_SYSTEM}\n```"))

# Run refined prompt on same input
print("\nREFINED OUTPUT:")
print("=" * 60)
refined_msgs = [
    {"role": "system", "content": REFINED_SYSTEM},
    {"role": "user", "content": SAMPLE_ARTICLE}
]
show_messages(refined_msgs)
refined_output = chat(refined_msgs, temperature=0.3)

In [None]:
# ---- SCORE REFINED OUTPUT and compare ----
refined_critique_msgs = [
    critique_msgs[0],
    {
        "role": "user",
        "content": (
            f"Task: Summarise a technical AI article for a complete non-technical beginner.\n"
            f"Criteria: {', '.join(CRITERIA)}\n\n"
            f"Response to evaluate:\n{refined_output}"
        )
    }
]
refined_critique_raw = chat(refined_critique_msgs, show=False, temperature=0.1)
json_match2 = re.search(r'\{[\s\S]*\}', refined_critique_raw)
refined_scores = json.loads(json_match2.group()).get("scores", {}) if json_match2 else {}

print("BEFORE vs AFTER ‚Äî Score Comparison")
print("=" * 60)
score_card("Baseline", {k: float(v) for k, v in scores.items()})
score_card("Refined", {k: float(v) for k, v in refined_scores.items()})

if scores and refined_scores:
    b_avg = mean(float(v) for v in scores.values())
    r_avg = mean(float(v) for v in refined_scores.values())
    delta = r_avg - b_avg
    print(f"\nüìà Average score: Baseline {b_avg:.1f} ‚Üí Refined {r_avg:.1f} (Œî {delta:+.1f})")

### Experiment 1B: Multi-Round Automated Refinement Loop

Instead of a single refinement, run **N rounds** automatically until quality crosses a threshold or stops improving.

In [None]:
def auto_refine(task_description: str, initial_prompt: str, test_input: str,
                criteria: list, target_score: float = 8.0, max_rounds: int = 3):
    """
    Automatically refine a prompt until the average score hits target_score
    or max_rounds is reached.
    """
    history = []  # track (round, prompt, avg_score)
    current_prompt = initial_prompt

    for rnd in range(1, max_rounds + 1):
        print(f"\n{'‚îÄ' * 60}")
        print(f"  ROUND {rnd}")
        print(f"{'‚îÄ' * 60}")

        # 1. Run current prompt
        output = chat(
            [{"role": "system", "content": current_prompt}, {"role": "user", "content": test_input}],
            show=False, temperature=0.3
        )
        display(Markdown(f"**Round {rnd} output (truncated):** {output[:300]}..."))

        # 2. Score it
        score_raw = chat([
            {
                "role": "system",
                "content": (
                    f"Score the response on each criterion (0‚Äì10). Task: {task_description}. "
                    'Return JSON: {"scores": {criterion: score}, "issues": [...], "suggestions": [...]}'
                )
            },
            {"role": "user", "content": f"Criteria: {criteria}\n\nResponse:\n{output}"}
        ], show=False, temperature=0.1)

        m = re.search(r'\{[\s\S]*\}', score_raw)
        data = json.loads(m.group()) if m else {}
        rnd_scores = {k: float(v) for k, v in data.get("scores", {}).items()}
        issues = data.get("issues", [])
        suggestions = data.get("suggestions", [])

        avg = mean(rnd_scores.values()) if rnd_scores else 0.0
        score_card(f"Round {rnd} scores (avg {avg:.1f})", rnd_scores)
        history.append((rnd, current_prompt[:80], avg))

        if avg >= target_score:
            print(f"\n‚úÖ Target score {target_score} reached at round {rnd}!")
            break

        # 3. Refine prompt
        current_prompt = chat([
            {"role": "system", "content": "You are a prompt engineer. Rewrite the system prompt to fix the issues. Output ONLY the improved prompt."},
            {"role": "user", "content": f"Prompt:\n{current_prompt}\n\nIssues:\n{issues}\n\nSuggestions:\n{suggestions}"}
        ], show=False, temperature=0.3)

    print("\nüìä Refinement History:")
    for r, p, s in history:
        print(f"  Round {r}: avg={s:.1f} | prompt_start='{p}...'")
    return current_prompt


WEAK_PROMPT = "Write a product description."
PRODUCT_INFO = """Product: EcoBreeze Air Purifier. Features: HEPA + activated carbon filter,
quiet (22dB), covers 500 sq ft, smart app control, auto mode. Price: $189."""

print("AUTO-REFINE LOOP")
print("=" * 60)
final_prompt = auto_refine(
    task_description="Write a compelling e-commerce product description that drives conversions",
    initial_prompt=WEAK_PROMPT,
    test_input=PRODUCT_INFO,
    criteria=["persuasiveness", "clarity", "feature_coverage", "call_to_action"],
    target_score=7.5,
    max_rounds=3
)

---

## 2. A/B Testing Prompts for Optimization

**A/B testing** applies the scientific method to prompts: define variants, define metrics, run experiments, and let data decide which prompt wins.

```
Variant A (control)  ‚îê
                     ‚îú‚îÄ‚îÄ same inputs ‚Üí outputs ‚Üí score each ‚Üí compare
Variant B (treatment)‚îò
```

### Experiment 2A: Head-to-Head Prompt Comparison

In [None]:
# Define two prompt variants for a customer email response task
VARIANT_A = {
    "name": "Variant A ‚Äî Generic",
    "system": "You are a customer support agent. Reply to the customer email professionally."
}

VARIANT_B = {
    "name": "Variant B ‚Äî Structured Role",
    "system": (
        "You are a senior customer support specialist at a software company. "
        "When replying to customer emails:\n"
        "1. Acknowledge the customer's specific issue in the first sentence\n"
        "2. Provide a clear, actionable resolution or next step\n"
        "3. Set expectations (timeframe, what they'll receive)\n"
        "4. Close with a personalised, warm sign-off\n"
        "Keep responses under 150 words. Use plain English, no jargon."
    )
}

# Test inputs ‚Äî realistic customer emails
TEST_EMAILS = [
    "I've been trying to log in for 3 days but keep getting 'invalid credentials'. I've reset my password twice. This is really frustrating.",
    "My invoice shows I was charged twice for last month's subscription. Can you please sort this out?",
    "I can't figure out how to export my data to CSV. The export button doesn't seem to do anything."
]

ab_results = {"Variant A ‚Äî Generic": [], "Variant B ‚Äî Structured Role": []}

JUDGE_SYSTEM = """
You are a customer service quality evaluator. Score the response (0‚Äì10) on:
- empathy: acknowledges the customer's frustration
- clarity: is the resolution easy to understand?
- actionability: does it tell the customer exactly what to do or expect?
- brevity: is it appropriately concise?
Return JSON: {"empathy": n, "clarity": n, "actionability": n, "brevity": n}
"""

for email_idx, email in enumerate(TEST_EMAILS):
    print(f"\n{'=' * 60}")
    print(f"TEST INPUT {email_idx + 1}: {email[:80]}...")
    print(f"{'=' * 60}")

    for variant in [VARIANT_A, VARIANT_B]:
        output = chat(
            [{"role": "system", "content": variant["system"]},
             {"role": "user", "content": email}],
            show=False, temperature=0.4
        )

        score_raw = chat([
            {"role": "system", "content": JUDGE_SYSTEM},
            {"role": "user", "content": f"Customer email:\n{email}\n\nResponse:\n{output}"}
        ], show=False, temperature=0.1)

        m = re.search(r'\{[\s\S]*?\}', score_raw)
        if m:
            scores_dict = {k: float(v) for k, v in json.loads(m.group()).items()}
            avg = mean(scores_dict.values())
            ab_results[variant["name"]].append(avg)
            print(f"  {variant['name']}: avg={avg:.1f} | {scores_dict}")
            print(f"  Output: {output[:120]}...")

print("\n" + "=" * 60)
print("A/B TEST RESULTS SUMMARY")
print("=" * 60)
for variant_name, run_scores in ab_results.items():
    if run_scores:
        overall = mean(run_scores)
        print(f"  {variant_name}: overall avg = {overall:.2f}")

winner = max(ab_results, key=lambda k: mean(ab_results[k]) if ab_results[k] else 0)
print(f"\nüèÜ Winner: {winner}")

### Experiment 2B: Multi-Variant Testing with Metrics Dashboard

In [None]:
# Multi-variant A/B test: code explanation task
CODE_SNIPPET = """
def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)
"""

VARIANTS = [
    {
        "name": "Plain",
        "system": "Explain this Python code."
    },
    {
        "name": "Beginner-Focused",
        "system": (
            "Explain this Python code to someone who just started programming. "
            "Use a simple analogy. Avoid technical jargon. Under 100 words."
        )
    },
    {
        "name": "Structured Expert",
        "system": (
            "You are a Python instructor. Explain the code under these headers: "
            "'What it does', 'How it works (step by step)', 'Time complexity', 'When to use it'. "
            "Be concise and precise."
        )
    },
    {
        "name": "Socratic",
        "system": (
            "Explain this code by asking and answering questions: "
            "'What problem does this solve?', 'How does the pivot work?', "
            "'What happens with duplicates?', 'What is the complexity?'"
        )
    }
]

CODE_JUDGE = """
Score the code explanation (0‚Äì10) on:
- correctness: is the explanation technically accurate?
- clarity: is it easy to understand?
- completeness: does it cover the key concepts?
- engagement: is it interesting to read?
Return JSON: {"correctness": n, "clarity": n, "completeness": n, "engagement": n}
"""

variant_results = {}

print("MULTI-VARIANT A/B TEST ‚Äî Code Explanation")
print("=" * 60)

for v in VARIANTS:
    output = chat(
        [{"role": "system", "content": v["system"]},
         {"role": "user", "content": CODE_SNIPPET}],
        show=False, temperature=0.3
    )
    score_raw = chat([
        {"role": "system", "content": CODE_JUDGE},
        {"role": "user", "content": f"Explanation:\n{output}"}
    ], show=False, temperature=0.1)

    m = re.search(r'\{[\s\S]*?\}', score_raw)
    if m:
        s = {k: float(val) for k, val in json.loads(m.group()).items()}
        variant_results[v["name"]] = {"scores": s, "avg": mean(s.values()), "output": output}
        print(f"\n  Variant: {v['name']} (avg={mean(s.values()):.1f})")
        print(f"  Scores: {s}")
        print(f"  Output preview: {output[:140]}...")

# Dashboard
print("\n" + "=" * 60)
print("RESULTS DASHBOARD")
print("=" * 60)

ranked = sorted(variant_results.items(), key=lambda x: x[1]["avg"], reverse=True)
for rank, (name, data) in enumerate(ranked, 1):
    print(f"  #{rank} {name}: {data['avg']:.2f}/10")
    score_card(name, data["scores"])

print(f"\nüèÜ Best variant: {ranked[0][0]} (avg {ranked[0][1]['avg']:.2f})")

---

## 3. Handling Hallucinations and Improving Factuality

**Hallucination** occurs when a model confidently generates false or fabricated information. Key strategies to combat it:

| Technique | How It Works |
|-----------|-------------|
| **Uncertainty elicitation** | Ask the model to flag what it's unsure about |
| **Grounding / RAG** | Supply source documents; restrict to provided facts |
| **Self-verification** | Ask model to check its own answer against its reasoning |
| **Citation forcing** | Require inline citations; fabricated citations are detectable |
| **Calibration prompts** | Instruct the model to say "I don't know" rather than guess |

### Experiment 3A: Hallucination-Prone vs. Grounded Prompts

In [None]:
# A question designed to elicit hallucination (obscure, specific claim)
HALLUCINATION_QUESTION = (
    "What did Dr. Elena Marchetti say about transformer positional encodings "
    "in her 2021 NeurIPS paper, and what dataset did she use?"
)

# ---- PROMPT A: No guardrails (likely to hallucinate) ----
print("PROMPT A ‚Äî No guardrails (hallucination risk)")
print("=" * 60)
msgs_a = [
    {"role": "system", "content": "You are a knowledgeable AI research assistant. Answer questions accurately."},
    {"role": "user", "content": HALLUCINATION_QUESTION}
]
show_messages(msgs_a)
output_a = chat(msgs_a, temperature=0.5)

print()

# ---- PROMPT B: With uncertainty elicitation ----
print("PROMPT B ‚Äî Uncertainty elicitation")
print("=" * 60)
msgs_b = [
    {
        "role": "system",
        "content": (
            "You are a careful AI research assistant. Important rules:\n"
            "- If you are not certain about a specific claim, explicitly say 'I am not certain about this.'\n"
            "- If you do not have reliable information, say 'I don't have verified information about this.'\n"
            "- Never fabricate names, citations, or specific statistics.\n"
            "- Distinguish clearly between what you know with confidence and what is uncertain."
        )
    },
    {"role": "user", "content": HALLUCINATION_QUESTION}
]
show_messages(msgs_b)
output_b = chat(msgs_b, temperature=0.3)

print("\nüí° Compare: Prompt A may fabricate details; Prompt B should acknowledge uncertainty.")

### Experiment 3B: Grounded Answering (RAG-Style Constraint)

In [None]:
# Simulate a RAG scenario: supply a document excerpt, restrict answers to it
CONTEXT_DOCUMENT = """
[SOURCE: TechBrief Q3 Report, 2024]
Global smartphone shipments declined 3.2% year-over-year in Q3 2024, reaching 312 million units.
Samsung held the largest market share at 22%, followed by Apple at 18%.
Xiaomi was the fastest growing brand with 14% YoY growth.
5G adoption reached 58% of all new shipments in developed markets.
The average selling price (ASP) across all brands rose to $387, up from $361 in Q3 2023.
"""

questions = [
    "What was Apple's market share in Q3 2024?",          # answerable from context
    "Which brand had the fastest growth?",                 # answerable
    "What was the global revenue from smartphone sales?"   # NOT in context ‚Üí should refuse
]

GROUNDED_SYSTEM = """
You are a data analyst assistant. You ONLY answer questions based on the provided source document.
Rules:
- Quote the relevant part of the document when answering.
- If the answer is not in the document, say exactly: "This information is not in the provided source."
- Never add information from outside the document.
- Always cite the source tag at the end of your answer.
"""

print("GROUNDED ANSWERING ‚Äî Restricted to source document")
print(f"\nContext:\n{CONTEXT_DOCUMENT}")
print("=" * 60)

for q in questions:
    print(f"\n‚ùì Question: {q}")
    msgs = [
        {"role": "system", "content": GROUNDED_SYSTEM},
        {"role": "user", "content": f"Source document:\n{CONTEXT_DOCUMENT}\n\nQuestion: {q}"}
    ]
    _ = chat(msgs, temperature=0.1)

### Experiment 3C: Self-Verification ‚Äî Ask the Model to Check Itself

In [None]:
def self_verify(question: str, temperature: float = 0.4):
    """
    Two-stage approach:
    1. Generate an initial answer
    2. Ask the model to critique and correct its own answer
    """
    # Stage 1: Initial answer
    print("STAGE 1 ‚Äî Initial answer")
    print("-" * 40)
    initial_msgs = [
        {"role": "system", "content": "Answer the question accurately and concisely."},
        {"role": "user", "content": question}
    ]
    initial_answer = chat(initial_msgs, temperature=temperature)

    # Stage 2: Self-critique
    print("\nSTAGE 2 ‚Äî Self-verification")
    print("-" * 40)
    verify_msgs = [
        {
            "role": "system",
            "content": (
                "You are a fact-checker. Review the answer below critically:\n"
                "1. Identify any claims that might be incorrect or that you are not fully certain about.\n"
                "2. Flag uncertain claims with [UNCERTAIN].\n"
                "3. Correct any errors you find.\n"
                "4. Provide a final verified answer.\n"
                "Format: Issues Found: [...], Verified Answer: [...]"
            )
        },
        {
            "role": "user",
            "content": f"Question: {question}\n\nAnswer to verify:\n{initial_answer}"
        }
    ]
    _ = chat(verify_msgs, temperature=0.1)


# Test on a question with a commonly confused answer
self_verify("How many bones does an adult human body have, and which is the smallest?")

---

## 4. Bias Detection and Mitigation in Prompts

Prompts can **introduce or amplify bias** through implicit framing, persona assignment, or loaded language. Mitigation requires:

- **Auditing** outputs across demographic and topic dimensions
- **Neutral framing** in prompts
- **Explicit fairness instructions**
- **Perspective balancing** for contested topics

### Experiment 4A: Detecting Bias ‚Äî Demographic Consistency Test

In [None]:
# Test whether the same question yields different tones for different demographics
BASE_SCENARIO = "A candidate applied for a senior software engineering position. They have 5 years of experience and a relevant degree."

DEMOGRAPHIC_VARIANTS = [
    ("Alex", "they/them"),
    ("Wei Chen", "he/him"),
    ("Fatima Al-Hassan", "she/her"),
    ("James O'Brien", "he/him"),
]

BIAS_SYSTEM = "You are a hiring manager. Write a brief (2-sentence) assessment of the candidate's suitability."

outputs = {}
print("DEMOGRAPHIC CONSISTENCY TEST")
print("=" * 60)

for name, pronoun in DEMOGRAPHIC_VARIANTS:
    scenario = f"{BASE_SCENARIO} The candidate's name is {name} ({pronoun})."
    output = chat(
        [{"role": "system", "content": BIAS_SYSTEM}, {"role": "user", "content": scenario}],
        show=False, temperature=0.3
    )
    outputs[name] = output
    print(f"\n  Candidate: {name} ({pronoun})")
    print(f"  Assessment: {output}")

# Use the model to detect bias patterns
print("\n" + "=" * 60)
print("BIAS ANALYSIS")
print("=" * 60)

all_outputs = "\n\n".join(f"{name}: {out}" for name, out in outputs.items())
bias_analysis_msgs = [
    {
        "role": "system",
        "content": (
            "You are a bias auditor. Analyse the following hiring assessments for the same-qualified candidate "
            "with different names/demographics. Identify:\n"
            "1. Any differences in tone, language strength, or assumptions\n"
            "2. Whether any candidate received notably warmer or colder language\n"
            "3. Specific biased phrases or patterns, if any\n"
            "4. Overall bias verdict: LOW / MODERATE / HIGH"
        )
    },
    {"role": "user", "content": f"Assessments for equally qualified candidates:\n\n{all_outputs}"}
]
_ = chat(bias_analysis_msgs, temperature=0.1)

### Experiment 4B: Mitigation ‚Äî Explicit Fairness Instructions

In [None]:
# Re-run with explicit fairness constraints
FAIR_SYSTEM = """
You are a hiring manager committed to fair and equitable evaluation.
Rules:
- Evaluate ONLY based on the stated qualifications (experience and degree).
- Do NOT make assumptions based on the candidate's name or demographic markers.
- Use identical assessment criteria and language strength for all candidates.
- Write a 2-sentence suitability assessment that would be identical in substance
  regardless of the candidate's name, gender, or cultural background.
"""

fair_outputs = {}
print("BIAS MITIGATION ‚Äî With Explicit Fairness Instructions")
print("=" * 60)

for name, pronoun in DEMOGRAPHIC_VARIANTS:
    scenario = f"{BASE_SCENARIO} The candidate's name is {name} ({pronoun})."
    output = chat(
        [{"role": "system", "content": FAIR_SYSTEM}, {"role": "user", "content": scenario}],
        show=False, temperature=0.3
    )
    fair_outputs[name] = output
    print(f"\n  Candidate: {name}")
    print(f"  Assessment: {output}")

# Compare bias levels
fair_all = "\n\n".join(f"{name}: {out}" for name, out in fair_outputs.items())
print("\n" + "=" * 60)
print("RE-AUDIT after mitigation")
print("=" * 60)
_ = chat([
    {"role": "system", "content": bias_analysis_msgs[0]["content"]},
    {"role": "user", "content": f"Assessments (after fairness prompt):\n\n{fair_all}"}
], temperature=0.1)

### Experiment 4C: Perspective Balancing for Contested Topics

In [None]:
CONTESTED_TOPIC = "Should companies mandate a return to office for all employees?"

# ---- Biased framing (implicitly one-sided) ----
print("BIASED FRAMING")
print("=" * 60)
biased_msgs = [
    {"role": "system", "content": "You are a productivity consultant who believes in-person collaboration is essential."},
    {"role": "user", "content": f"{CONTESTED_TOPIC} Give your view."}
]
show_messages(biased_msgs)
_ = chat(biased_msgs, temperature=0.4)

print()

# ---- Balanced framing ----
print("BALANCED FRAMING")
print("=" * 60)
balanced_msgs = [
    {
        "role": "system",
        "content": (
            "You are a neutral organisational analyst. When discussing contested workplace topics:\n"
            "1. Present the strongest arguments FOR (2 bullets)\n"
            "2. Present the strongest arguments AGAINST (2 bullets)\n"
            "3. Identify the key factor that determines which side is right for a given company\n"
            "4. Do NOT state a personal preference or conclusion.\n"
            "Label sections clearly: Arguments For / Arguments Against / Key Decision Factor."
        )
    },
    {"role": "user", "content": CONTESTED_TOPIC}
]
show_messages(balanced_msgs)
_ = chat(balanced_msgs, temperature=0.3)

print("\nüí° Balanced framing lets readers draw their own informed conclusions.")

---

## 5. Documentation and Version Control for Prompts

Prompts are **first-class production assets** and need the same rigor as code:

- Unique IDs and semantic versioning (`v1.0.0`)
- Metadata: author, date, task, model target
- Changelog entries explaining *why* each version changed
- Test cases + expected outputs baked in
- A registry for discovery and retrieval

### Experiment 5A: Prompt Schema and Registry

In [None]:
# ============================================================
#  PROMPT REGISTRY ‚Äî a simple in-memory versioned store
# ============================================================

class PromptVersion:
    """A single versioned snapshot of a prompt."""

    def __init__(self, version: str, system: str, user_template: str,
                 author: str, change_reason: str, test_cases: list = None):
        self.id = str(uuid.uuid4())[:8]
        self.version = version
        self.system = system
        self.user_template = user_template
        self.author = author
        self.created_at = datetime.now(timezone.utc).isoformat()
        self.change_reason = change_reason
        self.test_cases = test_cases or []
        # Fingerprint for change detection
        content = system + user_template
        self.fingerprint = hashlib.sha256(content.encode()).hexdigest()[:12]

    def to_dict(self) -> dict:
        return {
            "id": self.id,
            "version": self.version,
            "fingerprint": self.fingerprint,
            "author": self.author,
            "created_at": self.created_at,
            "change_reason": self.change_reason,
            "system_preview": self.system[:80] + "...",
            "test_cases_count": len(self.test_cases),
        }


class PromptRegistry:
    """Registry for storing and retrieving versioned prompts."""

    def __init__(self):
        self._store: dict[str, list[PromptVersion]] = defaultdict(list)

    def register(self, prompt_name: str, version: PromptVersion):
        self._store[prompt_name].append(version)
        print(f"‚úÖ Registered '{prompt_name}' v{version.version} [id:{version.id}]")

    def get_latest(self, prompt_name: str) -> PromptVersion:
        versions = self._store.get(prompt_name, [])
        return versions[-1] if versions else None

    def get_version(self, prompt_name: str, version: str) -> PromptVersion:
        return next((v for v in self._store.get(prompt_name, []) if v.version == version), None)

    def list_versions(self, prompt_name: str):
        """Print a changelog view."""
        versions = self._store.get(prompt_name, [])
        print(f"\nüìã Changelog for '{prompt_name}':")
        for v in versions:
            print(f"  v{v.version} [{v.id}] | {v.created_at[:10]} | {v.author}")
            print(f"    Reason: {v.change_reason}")
            print(f"    Fingerprint: {v.fingerprint}")

    def diff(self, prompt_name: str, v1: str, v2: str):
        """Show system prompt diff between two versions."""
        pv1 = self.get_version(prompt_name, v1)
        pv2 = self.get_version(prompt_name, v2)
        if not pv1 or not pv2:
            print("One or both versions not found.")
            return
        print(f"\nüîç Diff: {prompt_name} v{v1} ‚Üí v{v2}")
        print(f"  Fingerprint changed: {pv1.fingerprint} ‚Üí {pv2.fingerprint}")
        s1_lines = pv1.system.splitlines()
        s2_lines = pv2.system.splitlines()
        removed = set(s1_lines) - set(s2_lines)
        added = set(s2_lines) - set(s1_lines)
        for line in removed:
            if line.strip():
                print(f"  - {line}")
        for line in added:
            if line.strip():
                print(f"  + {line}")


# ---- Populate the registry with versions of a real prompt ----
registry = PromptRegistry()

# v1.0.0 ‚Äî initial
registry.register("email-support", PromptVersion(
    version="1.0.0",
    system="You are a customer support agent. Reply to customer emails helpfully.",
    user_template="Customer email:\n{email}",
    author="alice@team.com",
    change_reason="Initial release",
    test_cases=[
        {"input": "My order hasn't arrived.", "expected_keywords": ["apologise", "check", "order"]}
    ]
))

# v1.1.0 ‚Äî added tone guidance after negative user feedback
registry.register("email-support", PromptVersion(
    version="1.1.0",
    system=(
        "You are a senior customer support specialist. Reply to emails with empathy and clarity.\n"
        "Always: acknowledge the issue, offer a resolution, set timeframe expectations."
    ),
    user_template="Customer email:\n{email}",
    author="bob@team.com",
    change_reason="Added empathy and structured response guidelines after CSAT drop in week 12",
    test_cases=[
        {"input": "My order hasn't arrived.", "expected_keywords": ["apologise", "check", "within"]},
        {"input": "I was charged twice!", "expected_keywords": ["refund", "business days"]}
    ]
))

# v2.0.0 ‚Äî major rewrite with length constraint and routing hint
registry.register("email-support", PromptVersion(
    version="2.0.0",
    system=(
        "You are a senior customer support specialist at a SaaS company.\n"
        "Rules: Acknowledge issue in sentence 1. Provide resolution in sentence 2-3. "
        "Set expectations (timeframe). Close warmly. Under 120 words. Plain English only."
    ),
    user_template="Customer email:\n{email}\n\nCustomer tier: {tier}",
    author="alice@team.com",
    change_reason="Major rewrite: added word limit, customer tier variable, plain English rule. v2 tested 8.4 avg CSAT vs 7.1 in v1.1",
    test_cases=[
        {"input": {"email": "I can't login", "tier": "Enterprise"}, "expected_keywords": ["escalate", "priority"]}
    ]
))

# View changelog
registry.list_versions("email-support")

### Experiment 5B: Diff Versions and Run Regression Tests

In [None]:
# Show diff between two versions
registry.diff("email-support", "1.0.0", "2.0.0")

print()

# ---- REGRESSION TEST: run all test cases on each version ----
def run_regression(prompt_name: str, reg: PromptRegistry):
    """Run test cases for all versions and report pass/fail."""
    versions = reg._store.get(prompt_name, [])
    print(f"\nüß™ REGRESSION TESTS for '{prompt_name}'")
    print("=" * 60)

    for pv in versions:
        if not pv.test_cases:
            continue
        print(f"\n  v{pv.version} ‚Äî {len(pv.test_cases)} test(s)")
        for i, tc in enumerate(pv.test_cases):
            user_input = tc["input"]
            keywords = tc.get("expected_keywords", [])

            # Format user message (handle dict or str inputs)
            if isinstance(user_input, dict):
                formatted = pv.user_template
                for k, v in user_input.items():
                    formatted = formatted.replace(f"{{{k}}}", str(v))
            else:
                formatted = pv.user_template.replace("{email}", user_input).replace("{tier}", "Standard")

            output = chat(
                [{"role": "system", "content": pv.system}, {"role": "user", "content": formatted}],
                show=False, temperature=0.1
            )

            hits = [kw for kw in keywords if kw.lower() in output.lower()]
            status = "‚úÖ PASS" if len(hits) == len(keywords) else f"‚ùå FAIL (missing: {set(keywords)-set(hits)})"
            print(f"    Test {i+1}: {status}")
            print(f"      Input: {str(user_input)[:60]}")
            print(f"      Output: {output[:120]}...")


run_regression("email-support", registry)

### Experiment 5C: Auto-Document a Prompt

In [None]:
def auto_document_prompt(system_prompt: str, user_template: str) -> str:
    """Use the LLM to auto-generate documentation for a prompt."""
    doc_msgs = [
        {
            "role": "system",
            "content": (
                "You are a technical writer specialising in AI prompt documentation. "
                "Generate a concise markdown doc block for the given prompt. Include:\n"
                "- **Purpose**: What task this prompt accomplishes\n"
                "- **Input Variables**: List of {placeholders} and what they represent\n"
                "- **Output Format**: What the model returns\n"
                "- **Constraints**: Key rules or guardrails in the prompt\n"
                "- **Best For**: Ideal use cases\n"
                "- **Not Suitable For**: When NOT to use this prompt\n"
                "- **Suggested Model**: Recommended model tier (small/medium/large)\n"
                "Keep the total doc under 200 words."
            )
        },
        {
            "role": "user",
            "content": (
                f"System prompt:\n```\n{system_prompt}\n```\n\n"
                f"User template:\n```\n{user_template}\n```"
            )
        }
    ]
    return chat(doc_msgs, temperature=0.3)


# Document the latest version of our email-support prompt
latest = registry.get_latest("email-support")
print(f"AUTO-DOCUMENTING: email-support v{latest.version}")
print("=" * 60)
doc = auto_document_prompt(latest.system, latest.user_template)

### Experiment 5D: Export Registry to JSON

In [None]:
def export_registry(reg: PromptRegistry) -> dict:
    """Export the full registry to a JSON-serialisable dict."""
    export = {}
    for name, versions in reg._store.items():
        export[name] = {
            "latest_version": versions[-1].version if versions else None,
            "total_versions": len(versions),
            "versions": [v.to_dict() for v in versions]
        }
    return export


registry_export = export_registry(registry)
print("REGISTRY EXPORT (JSON)")
print("=" * 60)
print(json.dumps(registry_export, indent=2))

print("\nüí° Save this JSON to disk or a database to persist your prompt history across sessions.")
print("   Load it back by re-instantiating PromptVersion objects from the stored data.")

---

## 6. Sandbox ‚Äî Try It Yourself!

Experiment freely with any technique from this notebook.

In [None]:
# ============================================================
#  SANDBOX
# ============================================================

# ---- Try: Iterative Refinement ----
# final = auto_refine(
#     task_description="Your task here",
#     initial_prompt="Your weak starting prompt",
#     test_input="A sample input",
#     criteria=["criterion1", "criterion2"],
#     target_score=8.0
# )

# ---- Try: A/B test your own variants ----
# VARIANTS = [{"name": "A", "system": "..."}, {"name": "B", "system": "..."}]
# ... run and score

# ---- Try: Self-verification ----
# self_verify("What is the capital of Australia, and what is its population?")

# ---- Try: Register your own prompt version ----
# registry.register("my-prompt", PromptVersion(
#     version="1.0.0",
#     system="Your system prompt here",
#     user_template="User input: {input}",
#     author="your@email.com",
#     change_reason="Initial version"
# ))

# Default sandbox: test the grounded answering technique on your own document
MY_DOCUMENT = """
[SOURCE: My Notes]
The playground notebook covers five topics: iterative refinement, A/B testing,
hallucination handling, bias detection, and prompt documentation.
It uses the gpt-oss:20b-cloud model via ChatOllama.
"""

messages = [
    {"role": "system", "content": GROUNDED_SYSTEM},
    {"role": "user", "content": f"Source:\n{MY_DOCUMENT}\n\nQuestion: What model does the notebook use?"}
]

print("SANDBOX ‚Äî Grounded answer from custom document")
print("=" * 60)
show_messages(messages)
_ = chat(messages, temperature=0.1)

---

## Key Takeaways

| Best Practice | Core Principle | Key Tool/Pattern |
|---------------|---------------|------------------|
| **Iterative Refinement** | Treat prompts as hypotheses; test and improve empirically | Score ‚Üí Critique ‚Üí Rewrite loop |
| **A/B Testing** | Let metrics decide, not intuition | Judge LLM + scoring rubric |
| **Hallucination Handling** | Ground answers; encourage "I don't know" | Context constraints + self-verification |
| **Bias Mitigation** | Test across demographics; add explicit fairness rules | Consistency audit + balanced framing |
| **Documentation & Versioning** | Prompts are code ‚Äî track changes, reasons, and tests | PromptRegistry + semantic versioning |

### Prompt Quality Checklist

Before deploying a prompt to production:

```
‚òê Baseline scored against criteria?
‚òê At least one round of refinement completed?
‚òê A/B tested against at least one alternative?
‚òê Tested on edge cases and adversarial inputs?
‚òê Hallucination guardrails in place?
‚òê Bias audit run across demographic variants?
‚òê Registered with version, author, and change reason?
‚òê Test cases written and passing?
‚òê Auto-documentation generated?
```

### The Prompt Lifecycle

```
Draft ‚îÄ‚îÄ‚ñ∂ Score ‚îÄ‚îÄ‚ñ∂ Critique ‚îÄ‚îÄ‚ñ∂ Refine
                                    ‚îÇ
                      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                      ‚ñº
               A/B Test ‚îÄ‚îÄ‚ñ∂ Bias Audit ‚îÄ‚îÄ‚ñ∂ Hallucination Check
                                                   ‚îÇ
                               ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                               ‚ñº
                     Register v1.0.0 ‚îÄ‚îÄ‚ñ∂ Deploy ‚îÄ‚îÄ‚ñ∂ Monitor
                                                       ‚îÇ
                                          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                          ‚ñº
                                  v1.1.0 (iterate)
```