
# 01\_Prompt\_Management\_and\_Eval\_for\_Agentic\_AI

## 🤔 Why it matters

* 🧩 Prompts are **product logic** for LLMs/agents.
* 🔁 Without versioning & eval, you can’t **reproduce**, **compare**, or **roll back**.

---

## 🧱 Prompt management (treat prompts as code)

* 🧩 **Anatomy**: `system` 🎛️ + `instructions` 📜 + `few-shot` 🧪 + `schema` 🧾 + `tools` 🔧.
* 🏷️ **IDs & versions**: `prompt_id`, `version`, **content\_hash**.
* 📦 **Artifacts**: store **template** + **rendered examples** + **changelog**.
* 🔣 **Variables**: define a **schema** (types, defaults, validators).
* 🧼 **Style**: deterministic format (JSON output spec, stop words, delimiters).
* 🛡️ **Safety**: built-in rules (PII, refusal policy), jailbreak resistance snippets.
* 🌐 **Locale**: plan for i18n; avoid culture-locked phrasing.

---

## 🕹️ Agentic AI specifics

* 🧭 **Planning prompts**: decompose → plan → execute → reflect.
* 🔧 **Tool use**: clarify **function schemas**; require **JSON args**.
* 🧠 **Memory**: retrieval prompt for **context selection**; cap context by **budget**.
* 🔁 **Self-reflection**: critique prompt → revise answer/tool args when needed.
* 🛑 **Termination**: explicit **done criteria** + max steps/timeouts.
* 🧯 **Error handling**: retry/backoff prompts; guard invalid tool calls.

---

## 📊 What to measure (metrics)

* 📝 **Task quality**: EM/F1/ROUGE/BLEU or rubric score ✅
* 💬 **Preference**: win-rate (pairwise A/B) 🥇
* 🔧 **Agent/tooling**: tool-selection accuracy, **invalid-arg rate**, success\@k 🧰
* 🧩 **Process**: steps per task, re-plan rate, stuck rate 🔄
* ⏱️ **Latency** (p50/p95) & 💸 **Cost** (tokens in/out)
* 🧯 **Safety**: toxicity/PII/unsafe-tool flags

---

## 🧪 Evaluation modes

* 🧰 **Offline** (fast, repeatable)

  * 🔬 **Gold set** (balanced, tricky, adversarial).
  * 🧑‍⚖️ **LLM-as-Judge** with **strict rubric** + calibration items.
  * 🔁 **Robustness**: paraphrase, noise, order shuffle, seed sweep.
* 🌐 **Online** (real traffic)

  * 🐤 **Canary/A-B** with guardrails; measure preference, SLOs, safety.
  * 🌗 **Shadow**: score new prompt behind the current.

---

## ⚖️ LLM-as-Judge best practices

* 📋 Use **criteria-by-criterion** scoring (faithfulness, completeness, style).
* 🧪 Include **reference** + **retrieved context** for grounding.
* 🔁 **Double-blind** pairwise with **tie** option; **two judges + tiebreaker** for important gates.

---

## 📓 MLflow integration (what to log)

* 🏷️ **Params**: `prompt.id`, `prompt.version`, `prompt.hash`, `llm.model`, `temperature`, `rag.k`, `tools.enabled`.
* ⏱️ **Metrics**: quality (EM/F1/pref), tool success, invalid-arg rate, steps, latency p95, tokens, cost, safety flags.
* 📦 **Artifacts**: prompt template, rendered prompts, eval set snapshot, judge rubric, comparison report, traces.
* 🧭 **Tags**: `task=agentic-planner`, `release=candidate-N`, `dataset=vX`.

---

## 🚦 Promotion gates (example)

* ✅ **Pref win-rate ≥ 60%** (vs current)
* 🧰 **Invalid-arg rate ≤ 2%**
* 🧯 **Safety pass** (no criticals)
* ⏱️ **p95 latency ≤ 1.2s** & 💰 **\$ ≤ budget**

---

## ✅ Management checklist

* [ ] Versioned prompt with **hash + changelog**
* [ ] Variable **schema & validation**
* [ ] Tool/function **schemas** clear & tested
* [ ] Safety clauses + jailbreak tests
* [ ] Saved **rendered examples** as artifacts

## ✅ Eval checklist

* [ ] Fixed **gold set** (incl. adversarial)
* [ ] **Rubric** + judge prompt frozen
* [ ] Offline scores logged to MLflow
* [ ] Online **canary/A-B** plan + alerts
* [ ] Gates configured → **promote/rollback** via Registry

---

## ⚠️ Common pitfalls

* 🌀 Changing prompts mid-experiment 🤦 → no comparability
* 🧪 Judge leakage (seeing labels/answers) → biased scores
* 🔧 Vague tool instructions → invalid calls & loops
* 📈 Optimizing for average only → p95 SLO breaches

---

## 🚀 Quick wins

* 🧱 Add a **JSON output schema** and validate it.
* 🧪 Use **pairwise preference** over raw EM when answers are open-ended.
* 🏷️ Create MLflow saved view: **“pref≥0.6 & p95≤1200ms & \$≤0.002”**.
* 🔁 Keep a **1-click rollback** (aliases: `champion`/`challenger`).

---

## 🗣️ One-liner

**“Manage prompts like code and evaluate like models—version, test, gate, and roll back with data.”**
