You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have verified this would not be more appropriate as a feature request in a specific repository
I have searched existing discussions to avoid duplicates
Your Idea
Problem
MCP servers expose tools but have no standard way to describe how those tools should behave, how well an LLM should invoke them, or whether a full task can be completed with them. Without a protocol-level definition, every client invents its own eval format and
server authors cannot write evals once and have them run everywhere.
The goal is to let MCP servers optionally ship a suite of evaluations alongside their tools — analogous to a library shipping unit tests — so that clients can validate tool-use correctness, tool output quality, and end-to-end task completion against any frontier model.
Goals
In scope:
Capability declaration
evals/list discovery endpoint (parallel to tools/list, resources/list, prompts/list)
listChanged mirrors tools.listChanged — servers that update their eval suite can emit notifications/evals/list_changed. No client capability is required to call evals/list.
level is an optional filter. Clients that only want to run a subset of evals (e.g., only invocation evals in a fast pre-flight check) can filter at the source.
Pagination follows the same cursor model as every other MCP list primitive.
Data Model
Eval
interfaceEval{id: string;// stable identifiername: string;// human-readable labeldescription?: string;// what behavior this eval verifiesgradingType: "exact-match"|"llm-as-judge";input: InvocationInput|ExecutionInput|ScenarioInput;expected: ExactMatchExpected|LLMJudgeExpected;}
Input types (discriminated by type, which also encodes the level)
// level: invocation// LLM receives messages + server's tools/list; must call the right toolinterfaceInvocationInput{type: "invocation";messages: SamplingMessage[];}// level: execution// Tool is called directly with given arguments; output is checkedinterfaceExecutionInput{type: "execution";toolName: string;arguments: {[key: string]: unknown};}// level: scenario// Full agent loop runs from messages; goal completion is assessedinterfaceScenarioInput{type: "scenario";messages: SamplingMessage[];maxTurns?: number;// if omitted, client determines a reasonable default; spec does not mandate a value}
input.type serves as the level discriminant, following MCP's existing Content pattern (TextContent | ImageContent | ...). A separate top-level level field is redundant and omitted.
Expected types (discriminated by type, matching gradingType)
interfaceExactMatchExpected{type: "exact-match";toolName?: string;// invocation evals: expected tool to callarguments?: {[key: string]: unknown};// invocation evals: partial match — only specified keys checkedcontent?: unknown;// execution evals: expected tool result content (JSON deep equality)}// Which fields are meaningful is determined by input.type:// - input.type === "invocation": use toolName + arguments// - input.type === "execution": use contentinterfaceLLMJudgeExpected{type: "llm-as-judge";rubric: string;// natural language criteria for the judge}
Constraint:scenario + exact-match is explicitly invalid. Multi-turn agent trajectories are non-deterministic; exact-match grading cannot apply.
Execution Semantics
Level
Grading
Client behavior
invocation
exact-match
Send messages to LLM with server's tools in context. Pass if model calls toolName and all specified arguments keys match.
invocation
llm-as-judge
Same execution. Judge receives model response + rubric, returns pass/fail + reason.
execution
exact-match
Call tools/call with given arguments. Pass if result content deeply equals expected.
execution
llm-as-judge
Same call. Judge receives tool result + rubric.
scenario
llm-as-judge
Run full agent loop up to maxTurns. Judge receives full transcript + rubric.
Tool list for invocation evals: the client fetches the server's current tools via tools/list before running — tools are not embedded in the eval. This keeps evals decoupled from specific tool schema versions.
Side effects:execution-level evals call real tools against the live server. The spec SHOULD recommend that server authors avoid placing destructive or irreversible operations in execution evals without explicit documentation. Clients SHOULD surface this to users before running.
EvalResult (local to client)
interfaceEvalResult{evalId: string;passed: boolean;reason?: string;// populated by llm-as-judge; optional for exact-match failuresdurationMs: number;}
Results are not sent to the server.
Backward Compatibility
No breaking changes. Servers without the evals capability are unaffected. Clients that do not understand the evals capability ignore it. The new methods and types are purely additive.
Security Considerations
execution evals invoke real tools — clients MUST NOT run evals in untrusted contexts without user acknowledgment.
llm-as-judge results are model-dependent. Pass/fail is meaningful only relative to the model used; the spec must not imply universal pass/fail semantics.
Evals may embed sensitive expected outputs (e.g., API responses with real data). Server authors SHOULD avoid embedding sensitive data in eval fixtures.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Pre-submission Checklist
Your Idea
Problem
MCP servers expose tools but have no standard way to describe how those tools should behave, how well an LLM should invoke them, or whether a full task can be completed with them. Without a protocol-level definition, every client invents its own eval format and
server authors cannot write evals once and have them run everywhere.
The goal is to let MCP servers optionally ship a suite of evaluations alongside their tools — analogous to a library shipping unit tests — so that clients can validate tool-use correctness, tool output quality, and end-to-end task completion against any frontier model.
Goals
In scope:
evals/listdiscovery endpoint (parallel totools/list,resources/list,prompts/list)Evalobject schema (level, grading type, input, expected)EvalResultshape (local to client)Explicitly deferred:
Capability
Servers that expose evals declare the
evalscapability:{ "capabilities": { "evals": { "listChanged": true } } }listChangedmirrorstools.listChanged— servers that update their eval suite can emitnotifications/evals/list_changed. No client capability is required to callevals/list.Protocol Messages
evals/listRequest:
{ "jsonrpc": "2.0", "id": 1, "method": "evals/list", "params": { "cursor": "optional-cursor", "level": "invocation" } }levelis an optional filter. Clients that only want to run a subset of evals (e.g., onlyinvocationevals in a fast pre-flight check) can filter at the source.Response:
{ "jsonrpc": "2.0", "id": 1, "result": { "evals": [ ], // Eval[] array "nextCursor": "optional-next-cursor" } }Pagination follows the same cursor model as every other MCP list primitive.
Data Model
EvalInput types (discriminated by
type, which also encodes the level)input.typeserves as the level discriminant, following MCP's existingContentpattern (TextContent | ImageContent | ...). A separate top-levellevelfield is redundant and omitted.Expected types (discriminated by
type, matchinggradingType)Constraint:
scenario+exact-matchis explicitly invalid. Multi-turn agent trajectories are non-deterministic; exact-match grading cannot apply.Execution Semantics
invocationexact-matchtoolNameand all specifiedargumentskeys match.invocationllm-as-judgeexecutionexact-matchtools/callwith given arguments. Pass if resultcontentdeeply equals expected.executionllm-as-judgescenariollm-as-judgemaxTurns. Judge receives full transcript + rubric.Tool list for
invocationevals: the client fetches the server's current tools viatools/listbefore running — tools are not embedded in the eval. This keeps evals decoupled from specific tool schema versions.Side effects:
execution-level evals call real tools against the live server. The spec SHOULD recommend that server authors avoid placing destructive or irreversible operations inexecutionevals without explicit documentation. Clients SHOULD surface this to users before running.EvalResult (local to client)
Results are not sent to the server.
Backward Compatibility
No breaking changes. Servers without the
evalscapability are unaffected. Clients that do not understand theevalscapability ignore it. The new methods and types are purely additive.Security Considerations
executionevals invoke real tools — clients MUST NOT run evals in untrusted contexts without user acknowledgment.llm-as-judgeresults are model-dependent. Pass/fail is meaningful only relative to the model used; the spec must not imply universal pass/fail semantics.Scope
Beta Was this translation helpful? Give feedback.
All reactions