Skip to content

Reusable GitHub Action to run LLM-as-judge evaluations against any MCP server

Notifications You must be signed in to change notification settings

mcp-use/eval-action

Use this GitHub action with your project
Add this Action to an existing workflow or create a new one
View on Marketplace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MCP Server Eval Action

A reusable GitHub Action to run LLM-as-judge evaluations against any MCP server.

Usage

- uses: mcp-use/eval-action@v1
  with:
    server_config: '{"command": "python", "args": ["-m", "my_mcp_server", "--transport", "stdio"]}'
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

Remote server

- uses: mcp-use/eval-action@v1
  with:
    server_config: '{"url": "https://my-server.example.com/mcp"}'
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

With environment variables for the server

- uses: mcp-use/eval-action@v1
  with:
    server_config: |
      {
        "command": "python",
        "args": ["-m", "my_mcp_server", "--transport", "stdio"],
        "env": {
          "MONGO_PASSWORD": "${{ secrets.MONGO_PASSWORD }}",
          "USE_FAKE_DB": "false"
        }
      }
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

Post results as PR comment

- uses: mcp-use/eval-action@v1
  id: evals
  with:
    server_config: '{"command": "python", "args": ["-m", "my_server", "--transport", "stdio"]}'
    eval_cases: "evals/eval_cases.yaml"
    openrouter_api_key: ${{ secrets.OPENROUTER_API_KEY }}

- uses: marocchino/sticky-pull-request-comment@v2
  if: always() && github.event_name == 'pull_request'
  with:
    header: mcp-evals
    path: ${{ steps.evals.outputs.report_md }}

- run: cat ${{ steps.evals.outputs.report_md }} >> "$GITHUB_STEP_SUMMARY"
  if: always()

Inputs

Input Required Description
server_config Yes MCP server config as JSON ({"command": ...} or {"url": ...})
eval_cases Yes Path to eval_cases.yaml
openrouter_api_key Yes OpenRouter API key for agent + judge LLM
filter No Filter cases by id substring
max_steps No Max agent steps per case (default: 30)

Outputs

Output Description
results_json Path to eval results JSON file
report_md Path to markdown report file
passed true if all evals passed, false otherwise

Eval cases YAML format

# Model used by the LLM judge to score responses
judge_model: openai/gpt-4o-mini

# Models to evaluate the agent with (OpenRouter format)
models:
  - anthropic/claude-sonnet-4
  - openai/gpt-4o-mini

# System prompts — each case runs once per prompt
system_prompts:
  neutral: "You are a helpful assistant."
  domain: "You are a domain expert. Use the available tools."

# Eval cases
cases:
  - id: my_test_case
    prompt: "Ask the agent something"
    rubric: |
      The response should contain relevant information.
      The response should be well-structured.
    threshold: 0.7

About

Reusable GitHub Action to run LLM-as-judge evaluations against any MCP server

Resources

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages