-
-
Notifications
You must be signed in to change notification settings - Fork 800
feat(eval): add dynamic timeout defaults #6344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add EVAL_TIMEOUT_MULTIPLIER constant set to 5x REQUEST_TIMEOUT_MS - Update getEvalTimeoutMs() to calculate dynamic default (25 min) - Allow PROMPTFOO_EVAL_TIMEOUT_MS=0 to explicitly disable timeout - Fix evaluator to use undefined check for timeoutMs option The 5x multiplier (25 min default) accommodates: - Multiple retry cycles (default maxRetries=4) - Multi-turn redteam strategies (iterative, GOAT, Crescendo) - Model-graded assertions after provider calls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add calculateDefaultMaxEvalTimeMs() and getDefaultMaxEvalTimeMs() functions that compute reasonable max eval time based on: - Total eval steps and concurrency (determines batch count) - Expected time per test (1 min regular, 5 min redteam) - Safety multiplier (3x regular, 2x redteam) - Per-test timeout (caps worst case) Example calculations (concurrency=4, 25 min test timeout): - 50 regular tests: ~40 min max - 50 redteam tests: ~2.2 hr max - 100 redteam tests: ~4.2 hr max This provides a "swiss cheese" safety layer to prevent evals from running indefinitely while still allowing legitimate long-running evals. Set PROMPTFOO_MAX_EVAL_TIME_MS=0 to explicitly disable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Calculate estimated total eval steps from testSuite structure - Use getDefaultMaxEvalTimeMs for dynamic timeout calculation - Account for tests, scenarios, providers, prompts, and repeat count - Pass isRedteam flag for longer redteam eval timeouts
- Add tests for getEvalTimeoutMs with 5x multiplier behavior - Add tests for calculateDefaultMaxEvalTimeMs calculation logic - Add tests for getDefaultMaxEvalTimeMs wrapper function - Verify constants for redteam vs regular eval timeouts - Test edge cases: zero concurrency, explicit 0 timeout, cliState priority
- Add section explaining layered timeout protection defaults - Document per-test timeout (5× REQUEST_TIMEOUT_MS by default) - Document auto-calculated total evaluation timeout - Explain how to disable or customize timeouts
|
⏩ No test execution environment matched (1ac0a84) View output ↗ View check history
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds dynamic timeout defaults for evaluation runs. I reviewed the changes for LLM security vulnerabilities (prompt injection, secrets/PII in prompts, excessive agency) and found no security concerns.
Minimum severity threshold for this scan: Medium
|
👍 All Clear No LLM security vulnerabilities were found in this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const isRedteam = testSuite.redteam != null; | ||
|
|
||
| const maxEvalTimeMs = | ||
| options.maxEvalTimeMs ?? | ||
| getDefaultMaxEvalTimeMs(estimatedTotalSteps, maxConcurrency, testCaseTimeoutMs, isRedteam); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honor options.timeoutMs when deriving global eval timeout
The new auto max-eval timeout is calculated with getDefaultMaxEvalTimeMs(...), but the per-test timeout fed into it is always getEvalTimeoutMs(); any caller-supplied options.timeoutMs (including 0 to disable per-test timeouts or a shorter custom limit) is ignored. When maxEvalTimeMs is left unset and timeoutMs is provided, the global timer is computed from the default 5×REQUEST_TIMEOUT_MS instead of the actual per-test timeout being enforced, so runs that were intentionally unlimited or tightly capped will hit an unexpected global timeout or be under-protected. The calculation should use the effective per-test timeout that _runEvaluation applies (respecting options.timeoutMs).
Useful? React with 👍 / 👎.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughThis pull request implements a dynamic timeout defaults framework for evaluations. It introduces per-test timeout configuration (defaulting to 5× REQUEST_TIMEOUT_MS, or 25 minutes) and auto-calculates total evaluation timeouts based on eval batch count, concurrency level, and red-team detection status. New constants define expected test durations, safety multipliers, and buffer times. The implementation includes functions to compute default timeouts while respecting explicit environment variable overrides and allowing timeouts to be set to 0 to disable. Changes span core logic, configuration handling, documentation, and test coverage. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–30 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
CHANGELOG.md (2)
68-72: Add documentation bullet for timeout config.The PR updates troubleshooting/usage docs for new timeout behavior; add an explicit Documentation entry referencing the env vars and PR.
Apply:
### Documentation - - - docs(providers): add comprehensive Anthropic structured outputs documentation covering JSON outputs (`output_format`) and strict tool use (`strict: true`), including usage examples, schema limitations, and feature compatibility (#6226) + - docs(config): document PROMPTFOO_EVAL_TIMEOUT_MS and PROMPTFOO_MAX_EVAL_TIME_MS defaults, overrides, and disabling guidance; add troubleshooting for long‑running evals (#6344) + - docs(providers): add comprehensive Anthropic structured outputs documentation covering JSON outputs (`output_format`) and strict tool use (`strict: true`), including usage examples, schema limitations, and feature compatibility (#6226)As per coding guidelines, documentation updates accompanying user‑facing changes must be reflected in CHANGELOG under Unreleased.
74-78: Add tests bullet for timeout calculations and precedence.The PR adds unit tests; reflect them under Tests with the correct scope and PR number.
Apply:
### Tests - - - test(examples): add structured outputs example demonstrating JSON schema-based responses and strict tool use for Anthropic provider (#6226) + - test(eval): add unit tests for getEvalTimeoutMs (5× behavior), calculateDefaultMaxEvalTimeMs, getDefaultMaxEvalTimeMs, env/config precedence, and 0‑disables semantics (#6344) + - test(examples): add structured outputs example demonstrating JSON schema-based responses and strict tool use for Anthropic provider (#6226)All test changes should be captured under Unreleased/Tests with PR numbers and clear scope.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
CHANGELOG.md(1 hunks)site/docs/usage/troubleshooting.md(1 hunks)src/envars.ts(2 hunks)src/evaluator.ts(3 hunks)test/envars.test.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (9)
site/docs/**/*.{md,mdx}
📄 CodeRabbit inference engine (site/docs/CLAUDE.md)
site/docs/**/*.{md,mdx}: Use 'eval' terminology instead of 'evaluation' in documentation
Use 'Promptfoo' (capitalized) at the start of sentences and in headings; use 'promptfoo' (lowercase) in code, commands, and package names
Include required front matter with 'title' and 'description' fields in all documentation files
Only include 'title=' attribute in code blocks if they represent complete, runnable files
Add empty lines around admonition content to comply with Prettier formatting requirements
Use imperative mood in documentation instructions (e.g., 'Install the package' instead of 'Installing the package')
Never use embellishment words when describing the promptfoo system (e.g., avoid 'sophisticated')
Files:
site/docs/usage/troubleshooting.md
site/**
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
If the change is a feature, update the relevant documentation under
site/
Files:
site/docs/usage/troubleshooting.md
test/**/*.test.ts
📄 CodeRabbit inference engine (test/CLAUDE.md)
test/**/*.test.ts: Always use both--coverageand--randomizeflags when running Jest tests:npm test -- --coverage --randomize
Never increase test timeouts - fix the slow test instead
Never use.only()or.skip()in committed code
Always include mock cleanup in test files usingafterEach(() => { jest.resetAllMocks(); })to prevent test pollution
Test entire objects using.toEqual()rather than testing individual fields with separate assertions
Mock minimally - only mock external dependencies (APIs, databases), not code under test
Mirror thesrc/directory structure in test files:test/providers/mirrorssrc/providers/,test/redteam/mirrorssrc/redteam/, etc.
Use@jest/globalsimports in Jest test files
Files:
test/envars.test.ts
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
**/*.{ts,tsx}: Ensure TypeScript compilation passes by runningnpm run buildornpx tscfrom the root to verify TypeScript is valid
Prefer not to introduce new TypeScript types; use existing interfaces whenever possibleUse TypeScript with strict type checking
Files:
test/envars.test.tssrc/envars.tssrc/evaluator.ts
**/*.test.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)
Avoid disabling or skipping tests unless absolutely necessary and documented
**/*.test.{ts,tsx,js,jsx}: Follow Jest best practices with describe/it blocks
Test both success and error cases for all functionality
Files:
test/envars.test.ts
**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
**/*.{ts,tsx,js,jsx}: Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
**/*.{ts,tsx,js,jsx}: Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging to prevent exposing secrets, API keys, passwords, and other credentials in logs
Files:
test/envars.test.tssrc/envars.tssrc/evaluator.ts
**/test/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (AGENTS.md)
Follow Jest best practices with describe/it blocks
Files:
test/envars.test.ts
src/**/*.{ts,tsx}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Drizzle ORM for database operations
Files:
src/envars.tssrc/evaluator.ts
CHANGELOG.md
📄 CodeRabbit inference engine (.cursor/rules/changelog.mdc)
CHANGELOG.md: All user-facing changes must be documented in CHANGELOG.md. Every PR must update the changelog, with exceptions only for PRs labeled withno-changelogordependencies.
Follow Keep a Changelog format with sections for: Added, Changed, Fixed, Dependencies, Documentation, Tests, and Removed. Each entry must include PR number, conventional commit prefix, scope, and be concise and user-focused.
Each changelog entry must include PR number in format (#XXXX), use conventional commit prefix (feat:, fix:, chore:, docs:, test:, refactor:), include scope in parentheses, and describe changes from a user perspective in one line.
Update CHANGELOG.md for ALL changes including: new features, bug fixes, breaking changes, API changes, provider updates, configuration changes, performance improvements, deprecated features, dependency updates, test changes, CI/CD changes, build configuration changes, code style changes, documentation updates, refactors, and improvements.
Use common scopes for consistency: providers, evaluator, webui/app, cli, redteam, core, assertions, config, and database.
CHANGELOG.md: All user-facing changes require a CHANGELOG.md entry before creating a PR, added under## [Unreleased]in appropriate categories with PR number format(#XXXX)
Use Conventional Commits prefix in CHANGELOG entries matching the change type (feat:,fix:,chore:,docs:,test:,refactor:) with concise, user-focused descriptionsAll user-facing changes must be documented in CHANGELOG.md
Files:
CHANGELOG.md
🧠 Learnings (16)
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Never increase the function timeout - fix the test instead when dealing with timing issues
Applied to files:
site/docs/usage/troubleshooting.mdtest/envars.test.tssrc/envars.tssrc/evaluator.ts
📚 Learning: 2025-11-24T18:16:21.230Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/docusaurus.mdc:0-0
Timestamp: 2025-11-24T18:16:21.230Z
Learning: Applies to site/docs/**/*.{md,mdx},site/blog/**/*.{md,mdx},site/src/pages/**/*.{md,mdx} : Use 'eval' instead of 'evaluation' in all documentation. When referring to command line usage, use `npx promptfoo eval` rather than `npx promptfoo evaluation`. Maintain consistency across all examples, code blocks, and explanations.
Applied to files:
site/docs/usage/troubleshooting.mdsrc/envars.ts
📚 Learning: 2025-11-24T18:15:41.142Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/**/*.test.ts : Never increase test timeouts - fix the slow test instead
Applied to files:
site/docs/usage/troubleshooting.mdtest/envars.test.tssrc/envars.tssrc/evaluator.ts
📚 Learning: 2025-11-24T18:13:58.059Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: examples/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:13:58.059Z
Learning: Applies to examples/**/promptfooconfig.yaml : Keep description field in promptfooconfig.yaml SHORT (3-10 words)
Applied to files:
site/docs/usage/troubleshooting.md
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Ensure test failures are deterministic and do not depend on external state
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:15:23.644Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:23.644Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins in test/redteam/ directory following the reference pattern in src/redteam/plugins/pii.ts
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Test error handling and edge cases including rate limits and timeouts for providers
Applied to files:
test/envars.test.tssrc/envars.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Ensure all tests are independent and can run in any order
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Run the full test suite before committing changes
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Run tests with `--randomize` flag to ensure mocks setup and teardown don't affect other tests
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:14:47.318Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/app/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:14:47.318Z
Learning: Applies to src/app/**/*.test.{ts,tsx} : Use Vitest for testing (not Jest). Vitest is configured in vite.config.ts with jsdom environment and globals enabled
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:15:41.142Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/**/*.test.ts : Use `jest/globals` imports in Jest test files
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:15:03.592Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/commands/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:03.592Z
Learning: Applies to src/commands/**/*.ts : Import frequently used utilities at the top of command files: `logger`, `telemetry`, `setupEnv`, `printBorder`, `loadDefaultConfig`, `resolveConfigs`
Applied to files:
test/envars.test.tssrc/evaluator.ts
📚 Learning: 2025-11-24T18:15:41.142Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/providers/**/*.test.ts : Every provider test must include: success case (normal API response), error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking
Applied to files:
test/envars.test.ts
📚 Learning: 2025-11-24T18:14:13.123Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:14:13.123Z
Learning: Applies to site/docs/**/*.{md,mdx} : Use 'eval' terminology instead of 'evaluation' in documentation
Applied to files:
src/envars.ts
📚 Learning: 2025-11-24T18:15:52.934Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/changelog.mdc:0-0
Timestamp: 2025-11-24T18:15:52.934Z
Learning: Applies to CHANGELOG.md : Update CHANGELOG.md for ALL changes including: new features, bug fixes, breaking changes, API changes, provider updates, configuration changes, performance improvements, deprecated features, dependency updates, test changes, CI/CD changes, build configuration changes, code style changes, documentation updates, refactors, and improvements.
Applied to files:
CHANGELOG.md
🧬 Code graph analysis (1)
test/envars.test.ts (1)
src/envars.ts (9)
getEvalTimeoutMs(469-485)EVAL_TIMEOUT_MULTIPLIER(454-454)calculateDefaultMaxEvalTimeMs(543-569)EXPECTED_TIME_PER_REDTEAM_TEST_MS(507-507)EXPECTED_TIME_PER_TEST_MS(501-501)MAX_EVAL_TIME_REDTEAM_SAFETY_MULTIPLIER(519-519)MAX_EVAL_TIME_SAFETY_MULTIPLIER(513-513)MAX_EVAL_TIME_BUFFER_MS(525-525)getDefaultMaxEvalTimeMs(586-605)
🪛 LanguageTool
site/docs/usage/troubleshooting.md
[grammar] ~270-~270: Ensure spelling is correct
Context: ...cycles and multi-turn strategies (e.g., redteam iterative, GOAT, Crescendo) - Set to `0...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~275-~275: Ensure spelling is correct
Context: ...s size, concurrency, and whether it's a redteam eval - Acts as a safety net to prevent ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~275-~275: Ensure spelling is correct
Context: ...urrency, and whether it's a redteam eval - Acts as a safety net to prevent evaluati...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: Redteam (Staging API)
- GitHub Check: Share Test
- GitHub Check: Redteam (Production API)
- GitHub Check: webui tests
- GitHub Check: Build Docs
- GitHub Check: Test on Node 24.x and ubuntu-latest
- GitHub Check: Test on Node 24.x and windows-latest
- GitHub Check: Test on Node 20.x and ubuntu-latest
- GitHub Check: Test on Node 22.x and windows-latest
- GitHub Check: Test on Node 22.x and ubuntu-latest
- GitHub Check: Test on Node 22.x and macOS-latest
- GitHub Check: Test on Node 20.x and macOS-latest
- GitHub Check: Test on Node 20.x and windows-latest
- GitHub Check: Build on Node 20.x
- GitHub Check: Build on Node 24.x
- GitHub Check: Build on Node 22.x
- GitHub Check: security-scan
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (12)
site/docs/usage/troubleshooting.md (1)
263-296: LGTM!The documentation clearly explains the new layered timeout protection:
- Uses 'eval' terminology consistently as per coding guidelines
- Uses 'Promptfoo' capitalized appropriately at line 265
- Provides clear examples for customizing timeouts via environment variables and config file
- Uses imperative mood in instructions
Note: The static analysis flagging "redteam" is a false positive - this is established domain terminology in the codebase.
src/evaluator.ts (3)
16-16: LGTM!The updated imports align with the new timeout-related functions and utilities exported from
envars.ts.
728-749: LGTM!The dynamic timeout calculation logic is well-structured:
- Correctly estimates total eval steps before variable expansion
- Handles scenarios with appropriate fallbacks for undefined arrays
- Respects explicit
options.maxEvalTimeMsoverride while providing sensible calculated defaults- The
isRedteamdetection usingtestSuite.redteam != nullis appropriateThe estimation is an approximation since variable combinations aren't known at this point, but this is acceptable for safety timeout calculations.
1332-1335: LGTM!The explicit
undefinedcheck (options.timeoutMs !== undefined) correctly allowstimeoutMs=0to disable the per-test timeout, matching the documented behavior. The fallback togetEvalTimeoutMs()provides the calculated default when no explicit value is set.test/envars.test.ts (4)
1-18: LGTM!The imports correctly include all new exports from
envars.ts:
- New constants:
EVAL_TIMEOUT_MULTIPLIER,EXPECTED_TIME_PER_*,MAX_EVAL_TIME_*- New functions:
calculateDefaultMaxEvalTimeMs,getDefaultMaxEvalTimeMs,getEvalTimeoutMsThe existing imports are retained for backward compatibility testing.
398-443: LGTM!Comprehensive test coverage for
getEvalTimeoutMs:
- Default calculation behavior (5× REQUEST_TIMEOUT_MS)
- Custom
REQUEST_TIMEOUT_MSoverride- Explicit
PROMPTFOO_EVAL_TIMEOUT_MSvalue including0to disable- Passed default value handling
- Priority ordering (explicit > passed default)
cliState.config.envpriority overprocess.env- Constant verification (
EVAL_TIMEOUT_MULTIPLIER)Tests follow Jest best practices with clear, descriptive test names.
445-521: LGTM!Excellent test coverage for
calculateDefaultMaxEvalTimeMs:
- Detailed step-by-step calculation verification in comments for each scenario
- Tests regular vs redteam with different expected times and safety multipliers
- Verifies all constants have expected values
- Edge cases for concurrency (1 and 0)
- Worst-case capping when
safeMaxTime > worstCaseThe inline comments documenting the calculation steps make these tests highly maintainable.
523-559: LGTM!Comprehensive test coverage for
getDefaultMaxEvalTimeMs:
- Explicit environment variable override
- Explicit
0to disable the timeout- Fallback to calculated default when unset
cliState.config.envpriority overprocess.envisRedteamflag propagation (verified by comparing regular vs redteam results)The test at lines 552-558 effectively validates that the
isRedteamparameter affects the calculation by assertingredteamResult > regularResult.src/envars.ts (4)
449-485: LGTM!The
EVAL_TIMEOUT_MULTIPLIERconstant andgetEvalTimeoutMsfunction are well-implemented:
- Clear documentation explaining the 5× multiplier rationale (retries + multi-turn strategies)
- Correct priority order: explicit env var → passed default → calculated default
- Explicit check for
undefinedallows0to disable timeouts- Default calculation:
REQUEST_TIMEOUT_MS (300,000ms) × 5 = 1,500,000ms (25 minutes)
497-525: LGTM!The timeout-related constants are well-documented and have reasonable values:
EXPECTED_TIME_PER_TEST_MS(1 min): Appropriate buffer for typical API callsEXPECTED_TIME_PER_REDTEAM_TEST_MS(5 min): Accounts for multi-turn strategiesMAX_EVAL_TIME_SAFETY_MULTIPLIER(3×): Generous buffer for regular evalsMAX_EVAL_TIME_REDTEAM_SAFETY_MULTIPLIER(2×): Lower since expected time is already paddedMAX_EVAL_TIME_BUFFER_MS(1 min): Accounts for setup/teardown overhead
527-569: LGTM!The
calculateDefaultMaxEvalTimeMsfunction implements the "swiss cheese" safety layer correctly:
Math.max(1, maxConcurrency)prevents division by zero- Selects appropriate constants based on
isRedteamflag- Caps at worst case (
batchCount × testCaseTimeoutMs) to prevent unreasonably long timeouts- Clear documentation explains the calculation rationale
571-605: LGTM!The
getDefaultMaxEvalTimeMsfunction correctly implements the priority order:
- Explicit
PROMPTFOO_MAX_EVAL_TIME_MS(including0to disable)- Calculated default via
calculateDefaultMaxEvalTimeMsThe documentation clearly explains the behavior and how to disable the timeout.
CHANGELOG.md
Outdated
|
|
||
| ### Added | ||
|
|
||
| - feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add PR number, tighten wording, and name the disabling env vars.
- Missing PR number violates changelog rules.
- Wording is long; make it concise.
- Explicitly name PROMPTFOO_EVAL_TIMEOUT_MS and PROMPTFOO_MAX_EVAL_TIME_MS for the “0 disables” behavior.
Apply:
- - feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable
+ - feat(eval): dynamic timeout defaults with layered protection — per‑test timeout defaults to 5× REQUEST_TIMEOUT_MS (25 min); auto‑calculated total eval timeout based on eval size, concurrency, and redteam detection; disable via PROMPTFOO_EVAL_TIMEOUT_MS=0 or PROMPTFOO_MAX_EVAL_TIME_MS=0 (#6344)As per coding guidelines, all user‑facing changes must include the PR number and be concise under the Unreleased section.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable | |
| - feat(eval): dynamic timeout defaults with layered protection — per‑test timeout defaults to 5× REQUEST_TIMEOUT_MS (25 min); auto‑calculated total eval timeout based on eval size, concurrency, and redteam detection; disable via PROMPTFOO_EVAL_TIMEOUT_MS=0 or PROMPTFOO_MAX_EVAL_TIME_MS=0 (#6344) |
- Fix bug where per-test timeout signal replaced global max eval time signal - Combine both signals using AbortSignal.any() so both timeouts work - Update evaluator tests to disable timeouts when not testing timeout behavior - Format CHANGELOG.md
- Log maxEvalTime, perTestTimeout, estimatedSteps, concurrency, isRedteam - Include estimated end time as ISO string for easier debugging - Format durations as human-readable (e.g., '25m', '1h 30m') - Log when timeout is disabled (maxEvalTimeMs=0)
Summary
REQUEST_TIMEOUT_MS(25 minutes by default)Changes
Per-test timeout (
PROMPTFOO_EVAL_TIMEOUT_MS)REQUEST_TIMEOUT_MS(25 minutes with default settings)0to explicitly disableTotal evaluation timeout (
PROMPTFOO_MAX_EVAL_TIME_MS)0to explicitly disableDocumentation
Test plan
getEvalTimeoutMswith 5× multiplier behaviorcalculateDefaultMaxEvalTimeMscalculation logicgetDefaultMaxEvalTimeMswrapper function0disables timeoutscliState.config.envpriority overprocess.env🤖 Generated with Claude Code