
# 01\_MLflow\_Models\_for\_LLM\_Inference

## 🤔 What is it?

* 📦 **MLflow Models** = portable bundles that expose a **predict()** API.
* 🧰 Use the **pyfunc flavor** to wrap **LLM or full RAG pipelines** (pre → gen → post).

---

## 🧩 What goes inside the model

* 🧠 **Code**: pre/post-processing, routing, safety checks, retriever calls.
* 📝 **Prompts**: templates + version/hash (as artifacts).
* ⚙️ **Config**: YAML for llm/rag knobs (model\_id, k, reranker, temps).
* 📚 **Signature**: input/output schema for safe serving.
* 📎 **Conda/reqs**: pinned deps for reproducible runtime.

> 🔑 Treat the model as a **shim** that may call external LLM APIs (OpenAI, etc.) or local inference.

---

## 🧪 Flavors you’ll use

* 🧩 **pyfunc** (universal) → defines `predict(model_input)`.
* 🤖 (Optional) **transformers** flavor when you ship local HF models.
* 🧵 For RAG, stick to **pyfunc** so you can orchestrate retrieval + generation together.

---

## 🔌 Serving options

* ▶️ **Local**: `mlflow models serve -m <path_or_runs_uri> --port 5000`
* 🧱 **Docker**: `mlflow models build-docker -m ...`
* ☁️ **Remote**: register → serve behind your API gateway/load balancer.

> 📡 Exposes a simple **HTTP/JSON** endpoint calling your `predict()`.

---

## 🧾 Recommended I/O schema (LLM/RAG)

**Inputs**

* `question: str`
* `chat_history: list[dict]` *(opt)*
* `retrieval: { top_k:int, filters:dict }` *(opt)*
* `meta: { user_id:str, request_id:str }` *(opt)*

**Outputs**

* `answer: str`
* `context_used: list[str]` *(doc IDs or snippets)*
* `scores: { latency_ms:int, tokens_in:int, tokens_out:int, cost_usd:float }`
* `flags: { safety_blocked: bool }`

> 🧯 Keep **sensitive data out** of the outputs/artifacts.

---

## 📈 Observability hooks (at inference)

* ⏱️ Log **latency/tokens/cost** to your telemetry (and optionally to MLflow via batch jobs).
* 📚 Attach **trace IDs** so offline analyzers can join requests ↔ runs.
* 🧯 Emit **safety outcomes** (toxicity/PII) as counters.

---

## 🔐 Secrets & config

* ❌ Don’t bake API keys into the model.
* ✅ Read creds from **env/secret manager** at runtime.
* 🧭 Keep **model config** (prompt IDs, thresholds) in artifacts or a small YAML.

---

## 🔁 Lifecycle with Registry

1. 📦 **Log** the pyfunc model with artifacts & signature.
2. 📚 **Register** → versioned entry.
3. 🧪 **Stage gates** (eval harness, safety checks, cost SLO).
4. 🚦 Promote: **Staging → Production**; use **aliases** (`champion`, `canary`).
5. 🔙 Roll back by switching version/alias—no code changes.

---

## ⚠️ Gotchas

* 🧪 Mismatched signatures → failed requests; **lock schema** early.
* 🌐 External LLM rate limits → add **retries/backoff** in predict().
* 🧵 Concurrency: make clients stateless; persist session state outside.
* 🧱 Heavy vector indexes inside the model = bloated images → store **externally** and version by **fingerprint**.
* 🔒 Never log raw prompts containing secrets.

---

## 🚀 Quick wins

* 🧩 Ship **full RAG** as one pyfunc → simpler deploy/SLOs.
* 🏷️ Add `model_card.md` artifact with usage & limits.
* 📊 Return `context_used` + `trace_id` for debuggability.
* 🧪 Keep a **smoke test input** artifact; run it in CI before promote.

---

## 🗣️ One-liner

**“Package your entire LLM/RAG pipeline as a single MLflow pyfunc model—predictable I/O, portable serving, registry-controlled releases.”**
