@@ -8,7 +8,7 @@ rfc_pr: https://github.com/mlflow/rfcs/pull/10
88
99| Author(s) | Bill Murdock (Red Hat) |
1010| :--------------------- | :-- |
11- | ** Date Last Modified** | 2026-05-17 |
11+ | ** Date Last Modified** | 2026-05-27 |
1212| ** AI Assistant(s)** | Claude Code (Opus 4.6) |
1313
1414# Summary
@@ -386,10 +386,15 @@ address:
386386 and hooks together. But there is no agent-neutral way to represent
387387 these bundles for governance and discovery.
388388
389- 5 . ** No usage analytics linkage.** MLflow traces can capture skill
390- metadata, but without a governed registry, there is no way to link
391- trace data back to a governed record to understand adoption across
392- an organization.
389+ 5 . ** No trace-to-skill linkage.** MLflow already traces agent
390+ conversations (Claude Code via ` mlflow autolog claude ` , SDK
391+ applications via framework autologgers such as
392+ ` mlflow.langchain.autolog() ` and ` mlflow.anthropic.autolog() ` ). These traces capture LLM calls,
393+ tool use, and token consumption, but there is no way to know which
394+ governed, versioned skill was active during any part of a trace.
395+ Without a registry, organizations cannot answer questions like
396+ "which skill versions are most used?" or "show me all traces where
397+ the deprecated code-review v1.0 was loaded."
393398
3943996 . ** No pull mechanism.** Once a user discovers a capability in the
395400 registry, there is no standard way to fetch its content from the
@@ -1614,6 +1619,157 @@ separate permissions for scan results, or richer scan metadata),
16141619structured scan metadata can be added as a first-class entity in a
16151620follow-up without breaking the tag-based approach.
16161621
1622+ ### Trace integration
1623+
1624+ MLflow already traces agent conversations across multiple frameworks:
1625+ Claude Code (via ` mlflow autolog claude ` ), SDK applications (via
1626+ framework autologgers such as ` mlflow.langchain.autolog() ` and
1627+ ` mlflow.anthropic.autolog() ` ), and others. These
1628+ traces capture LLM calls, tool use, and timing as a tree of spans.
1629+ The skill registry closes the observability loop by letting agent
1630+ developers indicate which registered skill is active during each
1631+ part of a trace.
1632+
1633+ #### ` mlflow.skill_context() ` context manager
1634+
1635+ The primary instrumentation API is a context manager that creates a
1636+ span of type ` SKILL ` and attaches registry coordinates as span
1637+ attributes:
1638+
1639+ ``` python
1640+ with mlflow.skill_context(name = " code-review" , version = " 1.0.0" ) as span:
1641+ # All spans created inside this block (including those from
1642+ # autologgers) become children of this SKILL span.
1643+ result = llm.chat([{" role" : " user" , " content" : " Review this code..." }])
1644+ ```
1645+
1646+ The context manager creates a span with the following attributes:
1647+
1648+ | Attribute | Value | Description |
1649+ | ---| ---| ---|
1650+ | ` mlflow.skill.name ` | Skill name | Registry name of the active skill |
1651+ | ` mlflow.skill.version ` | Version string | Registered version |
1652+ | ` mlflow.skill.registry ` | Workspace name | MLflow workspace (defaults to ` "default" ` ) |
1653+
1654+ These three attributes form the ` {workspace, name, version} `
1655+ coordinates that link the span back to a specific skill version in
1656+ the registry.
1657+
1658+ #### Skill stacks via nesting
1659+
1660+ Skills can invoke other skills. Because ` skill_context() ` creates a
1661+ real span, nesting context managers naturally produces a skill stack
1662+ in the trace tree. Consider an agent that uses a "code-review" skill,
1663+ which internally invokes a "style-check" skill:
1664+
1665+ ``` python
1666+ import mlflow
1667+
1668+ def run_code_review (diff : str ):
1669+ with mlflow.skill_context(name = " code-review" , version = " 1.0.0" ):
1670+ # First LLM call: analyze the diff
1671+ analysis = llm.chat([
1672+ {" role" : " user" , " content" : f " Review this diff: \n { diff} " }
1673+ ])
1674+
1675+ # Invoke a sub-skill for style checking
1676+ style_issues = run_style_check(diff)
1677+
1678+ # Second LLM call: synthesize final review
1679+ review = llm.chat([
1680+ {" role" : " user" , " content" : f " Summarize: { analysis} , { style_issues} " }
1681+ ])
1682+ return review
1683+
1684+ def run_style_check (code : str ):
1685+ with mlflow.skill_context(name = " style-check" , version = " 2.0.0" ):
1686+ return llm.chat([
1687+ {" role" : " user" , " content" : f " Check style: \n { code} " }
1688+ ])
1689+ ```
1690+
1691+ The resulting trace tree:
1692+
1693+ ```
1694+ Trace: tr-abc123
1695+ |
1696+ +-- Span: "code-review" (type: SKILL)
1697+ | | mlflow.skill.name = "code-review"
1698+ | | mlflow.skill.version = "1.0.0"
1699+ | |
1700+ | +-- Span: ChatCompletion (type: LLM)
1701+ | | "Review this diff: ..."
1702+ | |
1703+ | +-- Span: "style-check" (type: SKILL)
1704+ | | | mlflow.skill.name = "style-check"
1705+ | | | mlflow.skill.version = "2.0.0"
1706+ | | |
1707+ | | +-- Span: ChatCompletion (type: LLM)
1708+ | | "Check style: ..."
1709+ | |
1710+ | +-- Span: ChatCompletion (type: LLM)
1711+ | "Summarize: ..."
1712+ ```
1713+
1714+ For any span in the tree, walking up the ancestor chain and
1715+ collecting SKILL-type spans reconstructs the skill stack. For the
1716+ "Check style" LLM call, the stack is
1717+ ` [code-review@1.0.0, style-check@2.0.0] ` . For the "Summarize" LLM
1718+ call, the stack is just ` [code-review@1.0.0] ` because it executes
1719+ after the style-check block exits.
1720+
1721+ #### What this enables
1722+
1723+ With skill-annotated traces, organizations can answer questions that
1724+ are impossible without trace-to-registry linkage:
1725+
1726+ - ** Adoption tracking.** "Which skill versions are most used across
1727+ the organization?" Query for SKILL spans grouped by name and
1728+ version.
1729+ - ** Deprecation impact.** "Show me all traces where the deprecated
1730+ code-review v1.0 was loaded." Filter traces by
1731+ ` mlflow.skill.name ` and ` mlflow.skill.version ` .
1732+ - ** Per-skill cost attribution.** Each SKILL span contains all child
1733+ spans. Aggregate token usage and latency per skill, including or
1734+ excluding sub-skills.
1735+ - ** Regression detection.** "Did error rates change after upgrading
1736+ style-check from v1.0 to v2.0?" Compare trace outcomes across
1737+ skill versions.
1738+
1739+ #### Autologger compatibility
1740+
1741+ Because ` skill_context() ` creates a standard MLflow span, it works
1742+ with existing autologgers without modification. When an autologger
1743+ (Claude, LangChain, OpenAI, etc.) creates a span inside a
1744+ ` skill_context() ` block, that span automatically becomes a child of
1745+ the SKILL span. No changes to the autologgers are needed.
1746+
1747+ For harness-specific integration (e.g., Claude Code automatically
1748+ wrapping skill loads in ` skill_context() ` spans), see RFC-0006.
1749+
1750+ #### Registry validation
1751+
1752+ ` skill_context() ` does not validate that the named skill exists in
1753+ the registry at call time. Validating on every invocation would add
1754+ latency and create a hard dependency on registry availability. The
1755+ trace records the ` {workspace, name, version} ` coordinates
1756+ regardless; the MLflow UI performs a best-effort lookup when
1757+ displaying traces and shows a "not found in registry" indicator if
1758+ the coordinates do not resolve.
1759+
1760+ #### Relationship to MCP trace linking
1761+
1762+ The MCP Registry (RFC-0004) provides ` link_mcp_server_versions_to_trace() `
1763+ for after-the-fact, trace-level association between traces and MCP
1764+ server versions. Skill trace integration takes a different approach:
1765+ span-level, inline annotation via context managers. The span-based
1766+ approach is a better fit for skills because skills are ambient (active
1767+ during inference rather than handling discrete requests) and can nest
1768+ (a skill invoking a sub-skill). MCP servers have clearer
1769+ request/response boundaries that make after-the-fact linking more
1770+ natural. Both approaches produce trace metadata that the MLflow UI
1771+ can display together.
1772+
16171773## Drawbacks
16181774
16191775- ** Source pointer validity.** The registry stores source pointers but
@@ -1651,6 +1807,6 @@ The two approaches are complementary.
16511807
16521808New feature, not a breaking change. Phased rollout:
16531809
1654- - ** Phase 1 (this RFC):** Registry entities, store, REST API, SDK, CLI, UI, and ` mlflow skills pull ` .
1655- - ** Phase 2 (RFC-0006):** Harness-specific ` mlflow skills install ` for Claude Code, Codex CLI, and Cursor.
1656- - ** Phase 3 (follow-up):** Trace integration and usage analytics, install count tracking, cross-workspace export/import (following cross-registry patterns), and shared base extraction with the MCP registry.
1810+ - ** Phase 1 (this RFC):** Registry entities, store, REST API, SDK, CLI, UI, ` mlflow skills pull ` , and ` mlflow.skill_context() ` for trace integration .
1811+ - ** Phase 2 (RFC-0006):** Harness-specific ` mlflow skills install ` for Claude Code, Codex CLI, and Cursor. Automatic ` skill_context() ` wrapping in harness-specific autologgers.
1812+ - ** Phase 3 (follow-up):** Usage analytics dashboards , install count tracking, cross-workspace export/import (following cross-registry patterns), and shared base extraction with the MCP registry.
0 commit comments