MCP Server Eval Action

A reusable GitHub Action to run LLM-as-judge evaluations against any MCP server.

Usage

- uses: mcp-use/eval-action@v1
  with:
    server_config: '{"command": "python", "args": ["-m", "my_mcp_server", "--transport", "stdio"]}'
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

Remote server

- uses: mcp-use/eval-action@v1
  with:
    server_config: '{"url": "https://my-server.example.com/mcp"}'
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

With environment variables for the server

- uses: mcp-use/eval-action@v1
  with:
    server_config: |
      {
        "command": "python",
        "args": ["-m", "my_mcp_server", "--transport", "stdio"],
        "env": {
          "MONGO_PASSWORD": "${{ secrets.MONGO_PASSWORD }}",
          "USE_FAKE_DB": "false"
        }
      }
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

Post results as PR comment

- uses: mcp-use/eval-action@v1
  id: evals
  with:
    server_config: '{"command": "python", "args": ["-m", "my_server", "--transport", "stdio"]}'
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

- uses: marocchino/sticky-pull-request-comment@v2
  if: always() && github.event_name == 'pull_request'
  with:
    header: mcp-evals
    path: ${{ steps.evals.outputs.report_md }}

- run: cat ${{ steps.evals.outputs.report_md }} >> "$GITHUB_STEP_SUMMARY"
  if: always()

Inputs

Input	Required	Description
`server_config`	Yes	MCP server config as JSON (`{"command": ...}` or `{"url": ...}`)
`eval_cases`	Yes	Path to `eval_cases.yaml`
`openrouter_api_key`	Yes	OpenRouter API key for agent + judge LLM
`filter`	No	Filter cases by id substring
`max_steps`	No	Max agent steps per case (default: 30)

Outputs

Output	Description
`results_json`	Path to eval results JSON file
`report_md`	Path to markdown report file
`passed`	`true` if all evals passed, `false` otherwise

Eval cases YAML format

# Model used by the LLM judge to score responses
judge_model: openai/gpt-4o-mini

# Models to evaluate the agent with (OpenRouter format)
models:
  - anthropic/claude-sonnet-4
  - openai/gpt-4o-mini

# System prompts — each case runs once per prompt
system_prompts:
  neutral: "You are a helpful assistant."
  domain: "You are a domain expert. Use the available tools."

# Eval cases
cases:
  - id: my_test_case
    prompt: "Ask the agent something"
    rubric: |
      The response should contain relevant information.
      The response should be well-structured.
    threshold: 0.7

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
action.yml		action.yml
example-workflow.yml		example-workflow.yml
format_report.py		format_report.py
run_evals.py		run_evals.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCP Server Eval Action

Usage

Remote server

With environment variables for the server

Post results as PR comment

Inputs

Outputs

Eval cases YAML format

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

mcp-use/eval-action

Folders and files

Latest commit

History

Repository files navigation

MCP Server Eval Action

Usage

Remote server

With environment variables for the server

Post results as PR comment

Inputs

Outputs

Eval cases YAML format

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages