-
Notifications
You must be signed in to change notification settings - Fork 9
Dev/steven/parallel eval #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a benchmark mode for the evaluation tool to achieve parity with the Python version. The changes introduce parallel model benchmarking, latency testing, advanced metrics calculation, and visualization capabilities.
- Adds benchmark mode with parallel model evaluation and chunking support
- Implements advanced metrics (ROC AUC, precision at recall thresholds) for model comparison
- Introduces latency testing infrastructure with configurable iteration counts
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/evals/guardrail-evals.ts | Main evaluation runner with new benchmark mode, parallel model evaluation, and multi-stage support |
| src/evals/core/benchmark-calculator.ts | Advanced metrics calculator for ROC AUC, precision-recall curves, and FPR-based recall |
| src/evals/core/benchmark-reporter.ts | Benchmark results reporter with structured output and summary tables |
| src/evals/core/latency-tester.ts | Latency testing module for measuring guardrail performance |
| src/evals/core/visualizer.ts | Stub implementation for benchmark visualization (placeholder) |
| src/evals/core/index.ts | Export updates for new benchmark modules |
| src/evals/core/async-engine.ts | Minor refactoring to use totalSamples variable |
| src/cli.ts | CLI argument parsing for benchmark mode, stages, models, and API configuration |
| src/tests/unit/evals/guardrail-evals.test.ts | Unit tests for chunking and parallel model limit logic |
| docs/evals.md | Documentation updates for new benchmark features |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Updating eval tool