Add trace integration: mlflow.skill_context() and harness hooks

jwm4 · claude · jwm4 · commit bd26ad82f74a · 2026-05-27T16:42:43.000-04:00
RFC-0005: Add skill_context() context manager that creates SKILL spans
with registry coordinates, supporting nested skill stacks. Strengthen
motivation item on trace-to-skill linkage. Move trace integration from
Phase 3 to Phase 1 in adoption strategy.

RFC-0006: Add harness trace integration via install-time manifest and
Claude Code PreToolUse/PostToolUse hooks on the Skill tool for automatic
SKILL span creation.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/rfcs/0005-skill-registry/0005-skill-registry.md b/rfcs/0005-skill-registry/0005-skill-registry.md
@@ -8,7 +8,7 @@ rfc_pr: https://github.com/mlflow/rfcs/pull/10
 
 | Author(s)              | Bill Murdock (Red Hat) |
 | :--------------------- | :-- |
-| **Date Last Modified** | 2026-05-17 |
+| **Date Last Modified** | 2026-05-27 |
 | **AI Assistant(s)**    | Claude Code (Opus 4.6) |
 
 # Summary
@@ -386,10 +386,15 @@ address:
    and hooks together. But there is no agent-neutral way to represent
    these bundles for governance and discovery.
 
-5. **No usage analytics linkage.** MLflow traces can capture skill
-   metadata, but without a governed registry, there is no way to link
-   trace data back to a governed record to understand adoption across
-   an organization.
+5. **No trace-to-skill linkage.** MLflow already traces agent
+   conversations (Claude Code via `mlflow autolog claude`, SDK
+   applications via framework autologgers such as
+   `mlflow.langchain.autolog()` and `mlflow.anthropic.autolog()`). These traces capture LLM calls,
+   tool use, and token consumption, but there is no way to know which
+   governed, versioned skill was active during any part of a trace.
+   Without a registry, organizations cannot answer questions like
+   "which skill versions are most used?" or "show me all traces where
+   the deprecated code-review v1.0 was loaded."
 
 6. **No pull mechanism.** Once a user discovers a capability in the
    registry, there is no standard way to fetch its content from the
@@ -1614,6 +1619,157 @@ separate permissions for scan results, or richer scan metadata),
 structured scan metadata can be added as a first-class entity in a
 follow-up without breaking the tag-based approach.
 
+### Trace integration
+
+MLflow already traces agent conversations across multiple frameworks:
+Claude Code (via `mlflow autolog claude`), SDK applications (via
+framework autologgers such as `mlflow.langchain.autolog()` and
+`mlflow.anthropic.autolog()`), and others. These
+traces capture LLM calls, tool use, and timing as a tree of spans.
+The skill registry closes the observability loop by letting agent
+developers indicate which registered skill is active during each
+part of a trace.
+
+#### `mlflow.skill_context()` context manager
+
+The primary instrumentation API is a context manager that creates a
+span of type `SKILL` and attaches registry coordinates as span
+attributes:
+
+```python
+with mlflow.skill_context(name="code-review", version="1.0.0") as span:
+    # All spans created inside this block (including those from
+    # autologgers) become children of this SKILL span.
+    result = llm.chat([{"role": "user", "content": "Review this code..."}])
+```
+
+The context manager creates a span with the following attributes:
+
+| Attribute | Value | Description |
+|---|---|---|
+| `mlflow.skill.name` | Skill name | Registry name of the active skill |
+| `mlflow.skill.version` | Version string | Registered version |
+| `mlflow.skill.registry` | Workspace name | MLflow workspace (defaults to `"default"`) |
+
+These three attributes form the `{workspace, name, version}`
+coordinates that link the span back to a specific skill version in
+the registry.
+
+#### Skill stacks via nesting
+
+Skills can invoke other skills. Because `skill_context()` creates a
+real span, nesting context managers naturally produces a skill stack
+in the trace tree. Consider an agent that uses a "code-review" skill,
+which internally invokes a "style-check" skill:
+
+```python
+import mlflow
+
+def run_code_review(diff: str):
+    with mlflow.skill_context(name="code-review", version="1.0.0"):
+        # First LLM call: analyze the diff
+        analysis = llm.chat([
+            {"role": "user", "content": f"Review this diff:\n{diff}"}
+        ])
+
+        # Invoke a sub-skill for style checking
+        style_issues = run_style_check(diff)
+
+        # Second LLM call: synthesize final review
+        review = llm.chat([
+            {"role": "user", "content": f"Summarize: {analysis}, {style_issues}"}
+        ])
+        return review
+
+def run_style_check(code: str):
+    with mlflow.skill_context(name="style-check", version="2.0.0"):
+        return llm.chat([
+            {"role": "user", "content": f"Check style:\n{code}"}
+        ])
+```
+
+The resulting trace tree:
+
+```
+Trace: tr-abc123
+|
++-- Span: "code-review" (type: SKILL)
+|   |   mlflow.skill.name = "code-review"
+|   |   mlflow.skill.version = "1.0.0"
+|   |
+|   +-- Span: ChatCompletion (type: LLM)
+|   |       "Review this diff: ..."
+|   |
+|   +-- Span: "style-check" (type: SKILL)
+|   |   |   mlflow.skill.name = "style-check"
+|   |   |   mlflow.skill.version = "2.0.0"
+|   |   |
+|   |   +-- Span: ChatCompletion (type: LLM)
+|   |           "Check style: ..."
+|   |
+|   +-- Span: ChatCompletion (type: LLM)
+|           "Summarize: ..."
+```
+
+For any span in the tree, walking up the ancestor chain and
+collecting SKILL-type spans reconstructs the skill stack. For the
+"Check style" LLM call, the stack is
+`[code-review@1.0.0, style-check@2.0.0]`. For the "Summarize" LLM
+call, the stack is just `[code-review@1.0.0]` because it executes
+after the style-check block exits.
+
+#### What this enables
+
+With skill-annotated traces, organizations can answer questions that
+are impossible without trace-to-registry linkage:
+
+- **Adoption tracking.** "Which skill versions are most used across
+  the organization?" Query for SKILL spans grouped by name and
+  version.
+- **Deprecation impact.** "Show me all traces where the deprecated
+  code-review v1.0 was loaded." Filter traces by
+  `mlflow.skill.name` and `mlflow.skill.version`.
+- **Per-skill cost attribution.** Each SKILL span contains all child
+  spans. Aggregate token usage and latency per skill, including or
+  excluding sub-skills.
+- **Regression detection.** "Did error rates change after upgrading
+  style-check from v1.0 to v2.0?" Compare trace outcomes across
+  skill versions.
+
+#### Autologger compatibility
+
+Because `skill_context()` creates a standard MLflow span, it works
+with existing autologgers without modification. When an autologger
+(Claude, LangChain, OpenAI, etc.) creates a span inside a
+`skill_context()` block, that span automatically becomes a child of
+the SKILL span. No changes to the autologgers are needed.
+
+For harness-specific integration (e.g., Claude Code automatically
+wrapping skill loads in `skill_context()` spans), see RFC-0006.
+
+#### Registry validation
+
+`skill_context()` does not validate that the named skill exists in
+the registry at call time. Validating on every invocation would add
+latency and create a hard dependency on registry availability. The
+trace records the `{workspace, name, version}` coordinates
+regardless; the MLflow UI performs a best-effort lookup when
+displaying traces and shows a "not found in registry" indicator if
+the coordinates do not resolve.
+
+#### Relationship to MCP trace linking
+
+The MCP Registry (RFC-0004) provides `link_mcp_server_versions_to_trace()`
+for after-the-fact, trace-level association between traces and MCP
+server versions. Skill trace integration takes a different approach:
+span-level, inline annotation via context managers. The span-based
+approach is a better fit for skills because skills are ambient (active
+during inference rather than handling discrete requests) and can nest
+(a skill invoking a sub-skill). MCP servers have clearer
+request/response boundaries that make after-the-fact linking more
+natural. Both approaches produce trace metadata that the MLflow UI
+can display together.
+
 ## Drawbacks
 
 - **Source pointer validity.** The registry stores source pointers but
@@ -1651,6 +1807,6 @@ The two approaches are complementary.
 
 New feature, not a breaking change. Phased rollout:
 
-- **Phase 1 (this RFC):** Registry entities, store, REST API, SDK, CLI, UI, and `mlflow skills pull`.
-- **Phase 2 (RFC-0006):** Harness-specific `mlflow skills install` for Claude Code, Codex CLI, and Cursor.
-- **Phase 3 (follow-up):** Trace integration and usage analytics, install count tracking, cross-workspace export/import (following cross-registry patterns), and shared base extraction with the MCP registry.
+- **Phase 1 (this RFC):** Registry entities, store, REST API, SDK, CLI, UI, `mlflow skills pull`, and `mlflow.skill_context()` for trace integration.
+- **Phase 2 (RFC-0006):** Harness-specific `mlflow skills install` for Claude Code, Codex CLI, and Cursor. Automatic `skill_context()` wrapping in harness-specific autologgers.
+- **Phase 3 (follow-up):** Usage analytics dashboards, install count tracking, cross-workspace export/import (following cross-registry patterns), and shared base extraction with the MCP registry.
diff --git a/rfcs/0006-skill-harness-integration/0006-skill-harness-integration.md b/rfcs/0006-skill-harness-integration/0006-skill-harness-integration.md
@@ -8,7 +8,7 @@ rfc_pr: https://github.com/mlflow/rfcs/pull/10
 
 | Author(s)              | Bill Murdock (Red Hat) |
 | :--------------------- | :-- |
-| **Date Last Modified** | 2026-05-17 |
+| **Date Last Modified** | 2026-05-27 |
 | **AI Assistant(s)**    | Claude Code (Opus 4.6) |
 
 # Summary
@@ -457,6 +457,139 @@ mlflow.genai.skills.install(
 mlflow.genai.skills.install()
 ```
 
+### Trace integration
+
+RFC-0005 defines `mlflow.skill_context()`, a context manager that
+creates SKILL spans in MLflow traces (see RFC-0005, Trace
+integration). Agent developers using the Python SDK can call
+`skill_context()` directly in their code. This section describes how
+harness-specific installation can automate that instrumentation so
+users get skill-annotated traces without writing any tracing code.
+
+#### Install-time manifest
+
+When `mlflow skills install` places files for a harness, it also
+writes a manifest that maps installed skill names to their registry
+coordinates:
+
+**`mlflow-skills-manifest.json`:**
+```json
+{
+  "manifest_version": "1.0",
+  "skills": {
+    "code-review": {
+      "name": "code-review",
+      "version": "1.0.0",
+      "registry": "default"
+    },
+    "security-auditor": {
+      "name": "security-auditor",
+      "version": "1.0.0",
+      "registry": "default"
+    }
+  }
+}
+```
+
+The manifest is keyed by the skill's local name (the name the harness
+uses to invoke it). The value provides the `{registry, name, version}`
+coordinates that link back to the skill registry. This file is used
+by trace hooks to annotate spans with registry coordinates without
+requiring a registry lookup at runtime.
+
+#### Claude Code: hook-based instrumentation
+
+Claude Code invokes skills via a built-in `Skill` tool, which fires
+`PreToolUse` and `PostToolUse` hook events. The `mlflow skills install`
+command can configure hooks that create SKILL spans automatically
+when a registered skill is invoked.
+
+**Hook configuration** (added to `.claude/settings.json`):
+
+```json
+{
+  "hooks": {
+    "PreToolUse": [
+      {
+        "matcher": "Skill",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "mlflow skills trace-start --manifest .mlflow-skills-manifest.json"
+          }
+        ]
+      }
+    ],
+    "PostToolUse": [
+      {
+        "matcher": "Skill",
+        "hooks": [
+          {
+            "type": "command",
+            "command": "mlflow skills trace-end --manifest .mlflow-skills-manifest.json"
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+The `PreToolUse` hook receives the skill name in its input, looks it
+up in the manifest to get registry coordinates, and opens a SKILL
+span via the `mlflow autolog claude` trace pipeline. The `PostToolUse`
+hook closes the span. Because these hooks integrate with the same
+tracing mechanism that `mlflow autolog claude` already uses, SKILL
+spans appear as part of the existing trace tree alongside LLM and
+tool call spans.
+
+The hook commands shown above are illustrative. The exact CLI
+subcommands and their integration with the `mlflow autolog claude`
+trace pipeline are implementation details.
+
+**Hook installation behavior.** `mlflow skills install` writes the
+manifest automatically. It does not modify `settings.json` by
+default. Instead, it prints instructions showing the hook
+configuration to add. Users can opt in with `--install-hooks` to
+have the installer merge hook entries into `settings.json`. This
+follows the same security principle as hook handling for plugin
+members: users must explicitly enable hooks.
+
+#### Agent SDK: direct instrumentation
+
+For developers building agents with the Claude Agent SDK or other
+Python frameworks, the recommended approach is to use
+`mlflow.skill_context()` directly (see RFC-0005). The Agent SDK's
+hook system also supports Python callbacks, so a similar automatic
+approach is possible:
+
+```python
+from claude_code_sdk import ClaudeAgentOptions
+
+async def on_skill_start(input_data, tool_use_id, context):
+    skill_name = input_data["tool_input"].get("skill")
+    # Look up registry coordinates from manifest
+    # Open mlflow.skill_context() span
+    return {}
+
+options = ClaudeAgentOptions(
+    hooks={"PreToolUse": [{"matcher": "Skill", "hook": on_skill_start}]}
+)
+```
+
+Because Agent SDK hooks run in-process, they can call
+`mlflow.skill_context()` directly, creating SKILL spans in the
+same trace tree as the autologger spans.
+
+#### Other harnesses
+
+Trace integration depends on each harness exposing a hook or event
+mechanism for skill invocation. Harnesses that support pre/post tool
+use hooks (Codex CLI, GitHub Copilot) can follow the same pattern as
+Claude Code. Harnesses without hook support cannot be automatically
+instrumented; users of those harnesses can still use
+`mlflow.skill_context()` manually in SDK-based agent code.
+
 ## Drawbacks
 
 - **Adapter maintenance.** Each harness adapter must be maintained as
@@ -486,6 +619,10 @@ critical for driving adoption.
   format).
 - Cursor adapter (second-highest priority for MLflow's user base).
 - `marketplace.json` generation for Claude Code / Codex CLI.
+- Install-time manifest (`mlflow-skills-manifest.json`) for trace
+  integration.
+- Claude Code trace hooks for automatic SKILL span creation via
+  `PreToolUse`/`PostToolUse` on the `Skill` tool.
 
 **Follow-up:**
 - Additional harness adapters based on demand.