# Level 2 - Week 10 - 01 Demo Script

**Estimated time:** 60-90 minutes

## Learning Objectives

- Define demo steps
- Include a failure case
- Call out evidence


## Overview

A good demo shows:

- problem
- approach
- live run
- failure case
- evidence
- roadmap

## Underlying theory: a demo is an argument (claim → evidence → reasoning)

A strong technical demo is not just “look, it runs”. It is a structured argument:

- claim: what your system can do (and cannot do)
- evidence: artifacts and live runs that support the claim
- reasoning: why the evidence supports the claim (mechanism + metrics)

For RAG systems, the most important claims are usually:

- grounded answers when evidence exists
- safe behavior (clarify/refuse) when evidence is missing

## Why you must include a failure case

Every real RAG system fails. Showing a controlled failure case demonstrates:

- you understand failure modes
- you built deterministic guardrails (not “prompt hope”)
- you can debug via `/search` and traces

## Practice Steps

- Write a demo script with specific steps, timings, and queries.
- Include one in-KB example and one out-of-KB example.
- Include explicit evidence artifacts (metrics + labeled failures) and a roadmap.

### Sample code

Demo steps template.


In [None]:
DEMO_STEPS = [
    'state problem and constraints',
    'show architecture diagram',
    'run in-KB query with citations',
    'run out-of-KB query with refuse',
    'show metrics and failures',
]

print(DEMO_STEPS)


### Student fill-in

Add timing and specific queries.


In [None]:
DEMO_SCRIPT = [
    {
        "time_s": 60,
        "step": "Problem + constraints",
        "say": "We built a RAG assistant for <domain>. It must answer with citations, refuse/clarify when evidence is missing, and stay under <latency_target>."
    },
    {
        "time_s": 90,
        "step": "Architecture + data flow",
        "say": "Request goes: /chat → /search → context assembly → LLM → citation validation → response (mode + citations)."
    },
    {
        "time_s": 120,
        "step": "In-KB live run",
        "do": "Call /chat with a known in-KB question; show answer + citations; optionally show /search output.",
        "example_question": "<in_kb_question_here>",
    },
    {
        "time_s": 60,
        "step": "Out-of-KB failure case",
        "do": "Call /chat with an out-of-KB question; show mode=clarify/refuse; explain deterministic rule.",
        "example_question": "<out_of_kb_question_here>",
    },
    {
        "time_s": 60,
        "step": "Evidence pack",
        "do": "Open runs/<run_id>/metrics.json and failures.json; show 1–2 headline metrics and 1–2 labeled failures.",
    },
    {
        "time_s": 30,
        "step": "Roadmap",
        "say": "Next iterations: (1) fix failure type X via change Y, (2) improve metric Z via change W."
    },
]

for s in DEMO_SCRIPT:
    print(s)

## Self-check

- Is a failure case included?
- Are evidence artifacts referenced?


## Template: Defense narrative (fill-in)

Use this as a short preparation worksheet.

Goal: make your claims falsifiable (tied to metrics/artifacts), and keep tradeoffs explicit.

### What you will show

- in-KB question + citations
- out-of-KB question + clarify/refuse behavior
- a small evidence pack (metrics + failures)

### What you will claim

- one sentence describing what the system does
- one sentence describing a known limitation

### What you will prove with artifacts

- one metric movement (before → after)
- one failure case you understand (label + mitigation or planned fix)

In [None]:
DEFENSE_TEMPLATE = """Architecture summary:
- Components:
- Data flow:

Tradeoffs (name the tradeoff + what you chose):
- precision vs recall:
- latency vs reliability:
- strict validation vs UX:

Evidence:
- baseline run_id:
- improved run_id:
- headline metric(s) that moved:
- 1-2 labeled failures (root cause + note):

Roadmap (next 2 iterations tied to evidence):
- 
- 
"""

print(DEFENSE_TEMPLATE)

## Self-check

- Can you demo without editing code live?
- Do you have a clear failure case story?
- Can you point to concrete artifacts (metrics + failures) for your claims?
- Are your tradeoffs explicit (not hand-wavy)?