Skip to content

Commit bd26ad8

Browse files
jwm4claude
andcommitted
Add trace integration: mlflow.skill_context() and harness hooks
RFC-0005: Add skill_context() context manager that creates SKILL spans with registry coordinates, supporting nested skill stacks. Strengthen motivation item on trace-to-skill linkage. Move trace integration from Phase 3 to Phase 1 in adoption strategy. RFC-0006: Add harness trace integration via install-time manifest and Claude Code PreToolUse/PostToolUse hooks on the Skill tool for automatic SKILL span creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 13d040b commit bd26ad8

2 files changed

Lines changed: 302 additions & 9 deletions

File tree

rfcs/0005-skill-registry/0005-skill-registry.md

Lines changed: 164 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ rfc_pr: https://github.com/mlflow/rfcs/pull/10
88

99
| Author(s) | Bill Murdock (Red Hat) |
1010
| :--------------------- | :-- |
11-
| **Date Last Modified** | 2026-05-17 |
11+
| **Date Last Modified** | 2026-05-27 |
1212
| **AI Assistant(s)** | Claude Code (Opus 4.6) |
1313

1414
# Summary
@@ -386,10 +386,15 @@ address:
386386
and hooks together. But there is no agent-neutral way to represent
387387
these bundles for governance and discovery.
388388

389-
5. **No usage analytics linkage.** MLflow traces can capture skill
390-
metadata, but without a governed registry, there is no way to link
391-
trace data back to a governed record to understand adoption across
392-
an organization.
389+
5. **No trace-to-skill linkage.** MLflow already traces agent
390+
conversations (Claude Code via `mlflow autolog claude`, SDK
391+
applications via framework autologgers such as
392+
`mlflow.langchain.autolog()` and `mlflow.anthropic.autolog()`). These traces capture LLM calls,
393+
tool use, and token consumption, but there is no way to know which
394+
governed, versioned skill was active during any part of a trace.
395+
Without a registry, organizations cannot answer questions like
396+
"which skill versions are most used?" or "show me all traces where
397+
the deprecated code-review v1.0 was loaded."
393398

394399
6. **No pull mechanism.** Once a user discovers a capability in the
395400
registry, there is no standard way to fetch its content from the
@@ -1614,6 +1619,157 @@ separate permissions for scan results, or richer scan metadata),
16141619
structured scan metadata can be added as a first-class entity in a
16151620
follow-up without breaking the tag-based approach.
16161621

1622+
### Trace integration
1623+
1624+
MLflow already traces agent conversations across multiple frameworks:
1625+
Claude Code (via `mlflow autolog claude`), SDK applications (via
1626+
framework autologgers such as `mlflow.langchain.autolog()` and
1627+
`mlflow.anthropic.autolog()`), and others. These
1628+
traces capture LLM calls, tool use, and timing as a tree of spans.
1629+
The skill registry closes the observability loop by letting agent
1630+
developers indicate which registered skill is active during each
1631+
part of a trace.
1632+
1633+
#### `mlflow.skill_context()` context manager
1634+
1635+
The primary instrumentation API is a context manager that creates a
1636+
span of type `SKILL` and attaches registry coordinates as span
1637+
attributes:
1638+
1639+
```python
1640+
with mlflow.skill_context(name="code-review", version="1.0.0") as span:
1641+
# All spans created inside this block (including those from
1642+
# autologgers) become children of this SKILL span.
1643+
result = llm.chat([{"role": "user", "content": "Review this code..."}])
1644+
```
1645+
1646+
The context manager creates a span with the following attributes:
1647+
1648+
| Attribute | Value | Description |
1649+
|---|---|---|
1650+
| `mlflow.skill.name` | Skill name | Registry name of the active skill |
1651+
| `mlflow.skill.version` | Version string | Registered version |
1652+
| `mlflow.skill.registry` | Workspace name | MLflow workspace (defaults to `"default"`) |
1653+
1654+
These three attributes form the `{workspace, name, version}`
1655+
coordinates that link the span back to a specific skill version in
1656+
the registry.
1657+
1658+
#### Skill stacks via nesting
1659+
1660+
Skills can invoke other skills. Because `skill_context()` creates a
1661+
real span, nesting context managers naturally produces a skill stack
1662+
in the trace tree. Consider an agent that uses a "code-review" skill,
1663+
which internally invokes a "style-check" skill:
1664+
1665+
```python
1666+
import mlflow
1667+
1668+
def run_code_review(diff: str):
1669+
with mlflow.skill_context(name="code-review", version="1.0.0"):
1670+
# First LLM call: analyze the diff
1671+
analysis = llm.chat([
1672+
{"role": "user", "content": f"Review this diff:\n{diff}"}
1673+
])
1674+
1675+
# Invoke a sub-skill for style checking
1676+
style_issues = run_style_check(diff)
1677+
1678+
# Second LLM call: synthesize final review
1679+
review = llm.chat([
1680+
{"role": "user", "content": f"Summarize: {analysis}, {style_issues}"}
1681+
])
1682+
return review
1683+
1684+
def run_style_check(code: str):
1685+
with mlflow.skill_context(name="style-check", version="2.0.0"):
1686+
return llm.chat([
1687+
{"role": "user", "content": f"Check style:\n{code}"}
1688+
])
1689+
```
1690+
1691+
The resulting trace tree:
1692+
1693+
```
1694+
Trace: tr-abc123
1695+
|
1696+
+-- Span: "code-review" (type: SKILL)
1697+
| | mlflow.skill.name = "code-review"
1698+
| | mlflow.skill.version = "1.0.0"
1699+
| |
1700+
| +-- Span: ChatCompletion (type: LLM)
1701+
| | "Review this diff: ..."
1702+
| |
1703+
| +-- Span: "style-check" (type: SKILL)
1704+
| | | mlflow.skill.name = "style-check"
1705+
| | | mlflow.skill.version = "2.0.0"
1706+
| | |
1707+
| | +-- Span: ChatCompletion (type: LLM)
1708+
| | "Check style: ..."
1709+
| |
1710+
| +-- Span: ChatCompletion (type: LLM)
1711+
| "Summarize: ..."
1712+
```
1713+
1714+
For any span in the tree, walking up the ancestor chain and
1715+
collecting SKILL-type spans reconstructs the skill stack. For the
1716+
"Check style" LLM call, the stack is
1717+
`[code-review@1.0.0, style-check@2.0.0]`. For the "Summarize" LLM
1718+
call, the stack is just `[code-review@1.0.0]` because it executes
1719+
after the style-check block exits.
1720+
1721+
#### What this enables
1722+
1723+
With skill-annotated traces, organizations can answer questions that
1724+
are impossible without trace-to-registry linkage:
1725+
1726+
- **Adoption tracking.** "Which skill versions are most used across
1727+
the organization?" Query for SKILL spans grouped by name and
1728+
version.
1729+
- **Deprecation impact.** "Show me all traces where the deprecated
1730+
code-review v1.0 was loaded." Filter traces by
1731+
`mlflow.skill.name` and `mlflow.skill.version`.
1732+
- **Per-skill cost attribution.** Each SKILL span contains all child
1733+
spans. Aggregate token usage and latency per skill, including or
1734+
excluding sub-skills.
1735+
- **Regression detection.** "Did error rates change after upgrading
1736+
style-check from v1.0 to v2.0?" Compare trace outcomes across
1737+
skill versions.
1738+
1739+
#### Autologger compatibility
1740+
1741+
Because `skill_context()` creates a standard MLflow span, it works
1742+
with existing autologgers without modification. When an autologger
1743+
(Claude, LangChain, OpenAI, etc.) creates a span inside a
1744+
`skill_context()` block, that span automatically becomes a child of
1745+
the SKILL span. No changes to the autologgers are needed.
1746+
1747+
For harness-specific integration (e.g., Claude Code automatically
1748+
wrapping skill loads in `skill_context()` spans), see RFC-0006.
1749+
1750+
#### Registry validation
1751+
1752+
`skill_context()` does not validate that the named skill exists in
1753+
the registry at call time. Validating on every invocation would add
1754+
latency and create a hard dependency on registry availability. The
1755+
trace records the `{workspace, name, version}` coordinates
1756+
regardless; the MLflow UI performs a best-effort lookup when
1757+
displaying traces and shows a "not found in registry" indicator if
1758+
the coordinates do not resolve.
1759+
1760+
#### Relationship to MCP trace linking
1761+
1762+
The MCP Registry (RFC-0004) provides `link_mcp_server_versions_to_trace()`
1763+
for after-the-fact, trace-level association between traces and MCP
1764+
server versions. Skill trace integration takes a different approach:
1765+
span-level, inline annotation via context managers. The span-based
1766+
approach is a better fit for skills because skills are ambient (active
1767+
during inference rather than handling discrete requests) and can nest
1768+
(a skill invoking a sub-skill). MCP servers have clearer
1769+
request/response boundaries that make after-the-fact linking more
1770+
natural. Both approaches produce trace metadata that the MLflow UI
1771+
can display together.
1772+
16171773
## Drawbacks
16181774

16191775
- **Source pointer validity.** The registry stores source pointers but
@@ -1651,6 +1807,6 @@ The two approaches are complementary.
16511807

16521808
New feature, not a breaking change. Phased rollout:
16531809

1654-
- **Phase 1 (this RFC):** Registry entities, store, REST API, SDK, CLI, UI, and `mlflow skills pull`.
1655-
- **Phase 2 (RFC-0006):** Harness-specific `mlflow skills install` for Claude Code, Codex CLI, and Cursor.
1656-
- **Phase 3 (follow-up):** Trace integration and usage analytics, install count tracking, cross-workspace export/import (following cross-registry patterns), and shared base extraction with the MCP registry.
1810+
- **Phase 1 (this RFC):** Registry entities, store, REST API, SDK, CLI, UI, `mlflow skills pull`, and `mlflow.skill_context()` for trace integration.
1811+
- **Phase 2 (RFC-0006):** Harness-specific `mlflow skills install` for Claude Code, Codex CLI, and Cursor. Automatic `skill_context()` wrapping in harness-specific autologgers.
1812+
- **Phase 3 (follow-up):** Usage analytics dashboards, install count tracking, cross-workspace export/import (following cross-registry patterns), and shared base extraction with the MCP registry.

rfcs/0006-skill-harness-integration/0006-skill-harness-integration.md

Lines changed: 138 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ rfc_pr: https://github.com/mlflow/rfcs/pull/10
88

99
| Author(s) | Bill Murdock (Red Hat) |
1010
| :--------------------- | :-- |
11-
| **Date Last Modified** | 2026-05-17 |
11+
| **Date Last Modified** | 2026-05-27 |
1212
| **AI Assistant(s)** | Claude Code (Opus 4.6) |
1313

1414
# Summary
@@ -457,6 +457,139 @@ mlflow.genai.skills.install(
457457
mlflow.genai.skills.install()
458458
```
459459

460+
### Trace integration
461+
462+
RFC-0005 defines `mlflow.skill_context()`, a context manager that
463+
creates SKILL spans in MLflow traces (see RFC-0005, Trace
464+
integration). Agent developers using the Python SDK can call
465+
`skill_context()` directly in their code. This section describes how
466+
harness-specific installation can automate that instrumentation so
467+
users get skill-annotated traces without writing any tracing code.
468+
469+
#### Install-time manifest
470+
471+
When `mlflow skills install` places files for a harness, it also
472+
writes a manifest that maps installed skill names to their registry
473+
coordinates:
474+
475+
**`mlflow-skills-manifest.json`:**
476+
```json
477+
{
478+
"manifest_version": "1.0",
479+
"skills": {
480+
"code-review": {
481+
"name": "code-review",
482+
"version": "1.0.0",
483+
"registry": "default"
484+
},
485+
"security-auditor": {
486+
"name": "security-auditor",
487+
"version": "1.0.0",
488+
"registry": "default"
489+
}
490+
}
491+
}
492+
```
493+
494+
The manifest is keyed by the skill's local name (the name the harness
495+
uses to invoke it). The value provides the `{registry, name, version}`
496+
coordinates that link back to the skill registry. This file is used
497+
by trace hooks to annotate spans with registry coordinates without
498+
requiring a registry lookup at runtime.
499+
500+
#### Claude Code: hook-based instrumentation
501+
502+
Claude Code invokes skills via a built-in `Skill` tool, which fires
503+
`PreToolUse` and `PostToolUse` hook events. The `mlflow skills install`
504+
command can configure hooks that create SKILL spans automatically
505+
when a registered skill is invoked.
506+
507+
**Hook configuration** (added to `.claude/settings.json`):
508+
509+
```json
510+
{
511+
"hooks": {
512+
"PreToolUse": [
513+
{
514+
"matcher": "Skill",
515+
"hooks": [
516+
{
517+
"type": "command",
518+
"command": "mlflow skills trace-start --manifest .mlflow-skills-manifest.json"
519+
}
520+
]
521+
}
522+
],
523+
"PostToolUse": [
524+
{
525+
"matcher": "Skill",
526+
"hooks": [
527+
{
528+
"type": "command",
529+
"command": "mlflow skills trace-end --manifest .mlflow-skills-manifest.json"
530+
}
531+
]
532+
}
533+
]
534+
}
535+
}
536+
```
537+
538+
The `PreToolUse` hook receives the skill name in its input, looks it
539+
up in the manifest to get registry coordinates, and opens a SKILL
540+
span via the `mlflow autolog claude` trace pipeline. The `PostToolUse`
541+
hook closes the span. Because these hooks integrate with the same
542+
tracing mechanism that `mlflow autolog claude` already uses, SKILL
543+
spans appear as part of the existing trace tree alongside LLM and
544+
tool call spans.
545+
546+
The hook commands shown above are illustrative. The exact CLI
547+
subcommands and their integration with the `mlflow autolog claude`
548+
trace pipeline are implementation details.
549+
550+
**Hook installation behavior.** `mlflow skills install` writes the
551+
manifest automatically. It does not modify `settings.json` by
552+
default. Instead, it prints instructions showing the hook
553+
configuration to add. Users can opt in with `--install-hooks` to
554+
have the installer merge hook entries into `settings.json`. This
555+
follows the same security principle as hook handling for plugin
556+
members: users must explicitly enable hooks.
557+
558+
#### Agent SDK: direct instrumentation
559+
560+
For developers building agents with the Claude Agent SDK or other
561+
Python frameworks, the recommended approach is to use
562+
`mlflow.skill_context()` directly (see RFC-0005). The Agent SDK's
563+
hook system also supports Python callbacks, so a similar automatic
564+
approach is possible:
565+
566+
```python
567+
from claude_code_sdk import ClaudeAgentOptions
568+
569+
async def on_skill_start(input_data, tool_use_id, context):
570+
skill_name = input_data["tool_input"].get("skill")
571+
# Look up registry coordinates from manifest
572+
# Open mlflow.skill_context() span
573+
return {}
574+
575+
options = ClaudeAgentOptions(
576+
hooks={"PreToolUse": [{"matcher": "Skill", "hook": on_skill_start}]}
577+
)
578+
```
579+
580+
Because Agent SDK hooks run in-process, they can call
581+
`mlflow.skill_context()` directly, creating SKILL spans in the
582+
same trace tree as the autologger spans.
583+
584+
#### Other harnesses
585+
586+
Trace integration depends on each harness exposing a hook or event
587+
mechanism for skill invocation. Harnesses that support pre/post tool
588+
use hooks (Codex CLI, GitHub Copilot) can follow the same pattern as
589+
Claude Code. Harnesses without hook support cannot be automatically
590+
instrumented; users of those harnesses can still use
591+
`mlflow.skill_context()` manually in SDK-based agent code.
592+
460593
## Drawbacks
461594

462595
- **Adapter maintenance.** Each harness adapter must be maintained as
@@ -486,6 +619,10 @@ critical for driving adoption.
486619
format).
487620
- Cursor adapter (second-highest priority for MLflow's user base).
488621
- `marketplace.json` generation for Claude Code / Codex CLI.
622+
- Install-time manifest (`mlflow-skills-manifest.json`) for trace
623+
integration.
624+
- Claude Code trace hooks for automatic SKILL span creation via
625+
`PreToolUse`/`PostToolUse` on the `Skill` tool.
489626

490627
**Follow-up:**
491628
- Additional harness adapters based on demand.

0 commit comments

Comments
 (0)