
# 01\_Reproducible\_GenAI\_Pipelines

## 🤔 Why it matters

* 🔁 **Same input → same output** across laptops/servers.
* 🧪 **Fair comparisons** between prompts/models.
* 📝 **Audit & rollback** when things regress.

---

## 🧩 The recipe (end-to-end)

1. 🗂️ **Config first** — single YAML for all knobs (llm, rag, eval).
2. 🧾 **Fix seeds** — set global/random/torch seeds for determinism.
3. 📦 **Pin env** — lock deps (`requirements.txt/conda.yaml`) or Docker.
4. 🏷️ **Track runs** — MLflow **params/metrics/tags** every time.
5. 📚 **Snapshot data** — eval set & prompt bundle saved as artifacts.
6. 🤖 **Package model** — wrap pipeline as MLflow **pyfunc**.
7. 📚 **Register** — version in **Model Registry** (Staging → Prod).
8. 🧪 **CI gates** — auto-eval → promote only if thresholds pass.
9. 🔍 **Monitor** — latency/cost/quality/safety tracked post-deploy.

---

## 📌 What to **pin & version**

* 🤖 **LLM**: `model_id`, `temperature`, `max_tokens`.
* 🧠 **RAG**: `embed_model`, `chunk_size/overlap`, `retriever_k`, `reranker`.
* 📝 **Prompts**: `prompt_id`, `version`, **template hash**.
* 🗃️ **Data**: eval **snapshot + hash**.
* 🧱 **Index**: vector store **fingerprint/version**.
* ⚙️ **Env**: deps/Docker tag + **git commit**.

---

## 📊 Minimum to log (per run)

* 🏷️ **Params**: llm/rag/prompt config.
* ⏱️ **Metrics**: latency p50/p95, tokens, **\$ cost**, quality (EM/F1/pref).
* 🧯 **Safety**: toxicity/PII flags, block rate.
* 📦 **Artifacts**: prompt template + renders, eval set, traces, report HTML.

---

## 🧪 Promotion gates (example)

* ✅ **Quality ≥** target (e.g., EM ≥ 0.6, pref win ≥ 60%).
* 🛡️ **Safety pass** (no critical violations).
* ⚡ **Latency ≤** SLO (e.g., p95 ≤ 1.2s).
* 💰 **Cost ≤** budget (e.g., ≤ \$0.002 per Q).

> Only then: **Registry → Staging → Production**.

---

## 🧠 RAG specifics to remember

* 🔢 **Embeddings** strictly pinned (model + dim).
* 🧩 **Chunking** settings logged (size/overlap/splitter).
* 🧷 **Context use rate** & **hit\@k** as first-class metrics.
* 🗂️ **Index rebuilds** create **new version IDs**.

---

## ⚠️ Common pitfalls

* 🌀 Unpinned deps → “works on my machine”.
* 🧪 Changing eval set mid-experiment.
* 🔐 Storing secrets or raw PII in artifacts.
* 🏷️ Inconsistent names (`k` vs `topK`) breaking UI filters.

---

## 🚀 Quick wins

* 🧩 Treat prompts as **code** (versioned + hashed).
* 🔁 Re-run a **fixed eval harness** on every PR.
* 🧭 Save a **`git_commit`** tag on each run.
* 📈 Create saved MLflow views (e.g., “**EM≥0.6 & p95≤1200ms & \$≤0.002**”).

---

## 🗣️ One-liner

**“Reproducible GenAI = pinned configs + tracked runs + versioned artifacts + gated releases.”**
