Skip to content

Conversation

@mldangelo
Copy link
Member

Summary

  • Add dynamic per-test timeout defaulting to 5× REQUEST_TIMEOUT_MS (25 minutes by default)
  • Add auto-calculated total evaluation timeout based on eval size, concurrency, and redteam detection
  • Provide "swiss cheese" layered timeout protection to prevent evals from running indefinitely

Changes

Per-test timeout (PROMPTFOO_EVAL_TIMEOUT_MS)

  • Defaults to 5× REQUEST_TIMEOUT_MS (25 minutes with default settings)
  • Allows for multiple retry cycles (with exponential backoff)
  • Accommodates multi-turn redteam strategies (iterative, GOAT, Crescendo, Hydra, etc.)
  • Set to 0 to explicitly disable

Total evaluation timeout (PROMPTFOO_MAX_EVAL_TIME_MS)

  • Automatically calculated based on:
    • Number of eval steps (tests × providers × prompts × repeat)
    • Max concurrency (batching calculation)
    • Expected time per test (60s regular, 300s redteam)
    • Safety multiplier (3× regular, 2× redteam)
  • Capped at worst-case (all tests timing out)
  • Set to 0 to explicitly disable

Documentation

  • Updated troubleshooting docs to explain default behavior
  • Added guidance on customizing timeouts

Test plan

  • Unit tests for getEvalTimeoutMs with 5× multiplier behavior
  • Unit tests for calculateDefaultMaxEvalTimeMs calculation logic
  • Unit tests for getDefaultMaxEvalTimeMs wrapper function
  • Tests verify explicit 0 disables timeouts
  • Tests verify cliState.config.env priority over process.env
  • Build passes
  • All existing tests pass

🤖 Generated with Claude Code

mldangelo and others added 8 commits November 24, 2025 14:31
- Add EVAL_TIMEOUT_MULTIPLIER constant set to 5x REQUEST_TIMEOUT_MS
- Update getEvalTimeoutMs() to calculate dynamic default (25 min)
- Allow PROMPTFOO_EVAL_TIMEOUT_MS=0 to explicitly disable timeout
- Fix evaluator to use undefined check for timeoutMs option

The 5x multiplier (25 min default) accommodates:
- Multiple retry cycles (default maxRetries=4)
- Multi-turn redteam strategies (iterative, GOAT, Crescendo)
- Model-graded assertions after provider calls

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add calculateDefaultMaxEvalTimeMs() and getDefaultMaxEvalTimeMs() functions
that compute reasonable max eval time based on:

- Total eval steps and concurrency (determines batch count)
- Expected time per test (1 min regular, 5 min redteam)
- Safety multiplier (3x regular, 2x redteam)
- Per-test timeout (caps worst case)

Example calculations (concurrency=4, 25 min test timeout):
- 50 regular tests: ~40 min max
- 50 redteam tests: ~2.2 hr max
- 100 redteam tests: ~4.2 hr max

This provides a "swiss cheese" safety layer to prevent evals from
running indefinitely while still allowing legitimate long-running evals.

Set PROMPTFOO_MAX_EVAL_TIME_MS=0 to explicitly disable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Calculate estimated total eval steps from testSuite structure
- Use getDefaultMaxEvalTimeMs for dynamic timeout calculation
- Account for tests, scenarios, providers, prompts, and repeat count
- Pass isRedteam flag for longer redteam eval timeouts
- Add tests for getEvalTimeoutMs with 5x multiplier behavior
- Add tests for calculateDefaultMaxEvalTimeMs calculation logic
- Add tests for getDefaultMaxEvalTimeMs wrapper function
- Verify constants for redteam vs regular eval timeouts
- Test edge cases: zero concurrency, explicit 0 timeout, cliState priority
- Add section explaining layered timeout protection defaults
- Document per-test timeout (5× REQUEST_TIMEOUT_MS by default)
- Document auto-calculated total evaluation timeout
- Explain how to disable or customize timeouts
@use-tusk
Copy link
Contributor

use-tusk bot commented Nov 24, 2025

⏩ No test execution environment matched (1ac0a84) View output ↗


View check history

Commit Status Output Created (UTC)
e402e49 ⏩ No test execution environment matched Output Nov 24, 2025 10:45PM
9c94147 ⏩ No test execution environment matched Output Nov 24, 2025 11:16PM
7f2e4f6 ⏩ No test execution environment matched Output Nov 25, 2025 1:07AM
f7922dd ⏩ No test execution environment matched Output Nov 25, 2025 1:13AM
b69c7b2 ⏩ No test execution environment matched Output Nov 25, 2025 1:25AM
467e269 ⏩ No test execution environment matched Output Nov 25, 2025 5:20PM
1ac0a84 ⏩ No test execution environment matched Output Nov 25, 2025 5:59PM

View output in GitHub ↗

Copy link
Contributor

@promptfoo-scanner promptfoo-scanner bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds dynamic timeout defaults for evaluation runs. I reviewed the changes for LLM security vulnerabilities (prompt injection, secrets/PII in prompts, excessive agency) and found no security concerns.

Minimum severity threshold for this scan: Medium

@promptfoo-scanner
Copy link
Contributor

👍 All Clear

No LLM security vulnerabilities were found in this PR.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +745 to +749
const isRedteam = testSuite.redteam != null;

const maxEvalTimeMs =
options.maxEvalTimeMs ??
getDefaultMaxEvalTimeMs(estimatedTotalSteps, maxConcurrency, testCaseTimeoutMs, isRedteam);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor options.timeoutMs when deriving global eval timeout

The new auto max-eval timeout is calculated with getDefaultMaxEvalTimeMs(...), but the per-test timeout fed into it is always getEvalTimeoutMs(); any caller-supplied options.timeoutMs (including 0 to disable per-test timeouts or a shorter custom limit) is ignored. When maxEvalTimeMs is left unset and timeoutMs is provided, the global timer is computed from the default 5×REQUEST_TIMEOUT_MS instead of the actual per-test timeout being enforced, so runs that were intentionally unlimited or tightly capped will hit an unexpected global timeout or be under-protected. The calculation should use the effective per-test timeout that _runEvaluation applies (respecting options.timeoutMs).

Useful? React with 👍 / 👎.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request implements a dynamic timeout defaults framework for evaluations. It introduces per-test timeout configuration (defaulting to 5× REQUEST_TIMEOUT_MS, or 25 minutes) and auto-calculates total evaluation timeouts based on eval batch count, concurrency level, and red-team detection status. New constants define expected test durations, safety multipliers, and buffer times. The implementation includes functions to compute default timeouts while respecting explicit environment variable overrides and allowing timeouts to be set to 0 to disable. Changes span core logic, configuration handling, documentation, and test coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

  • src/envars.ts: Review the calculation logic in calculateDefaultMaxEvalTimeMs, particularly the formulas for safety estimates, worst-case scenarios, and how the isRedteam flag affects multipliers and expected times
  • src/evaluator.ts: Verify that getDefaultMaxEvalTimeMs is called with correct parameters (estimatedTotalSteps, maxConcurrency, testCaseTimeoutMs, isRedteam), and that the explicit undefined check for timeoutMs correctly allows 0 as a disable value
  • Timeout constant definitions: Confirm the semantic correctness of EXPECTED_TIME_PER_TEST_MS, EXPECTED_TIME_PER_REDTEAM_TEST_MS, safety multipliers, and buffer values
  • test/envars.test.ts: Verify comprehensive test coverage for edge cases (concurrency = 0/1, isRedteam flag propagation, environment variable prioritization where cliState.config.env overrides process.env)

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description is directly related to the changeset, providing a comprehensive summary of per-test timeouts, total evaluation timeouts, documentation updates, and test coverage that align with the actual changes made.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title 'feat(eval): add dynamic timeout defaults' directly and accurately captures the primary change: introducing dynamic default timeout behavior for evaluations, as confirmed by the changelog, documentation, and code changes across multiple files.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dynamic-eval-timeouts

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
CHANGELOG.md (2)

68-72: Add documentation bullet for timeout config.

The PR updates troubleshooting/usage docs for new timeout behavior; add an explicit Documentation entry referencing the env vars and PR.

Apply:

 ### Documentation
-
- - docs(providers): add comprehensive Anthropic structured outputs documentation covering JSON outputs (`output_format`) and strict tool use (`strict: true`), including usage examples, schema limitations, and feature compatibility (#6226)
+ - docs(config): document PROMPTFOO_EVAL_TIMEOUT_MS and PROMPTFOO_MAX_EVAL_TIME_MS defaults, overrides, and disabling guidance; add troubleshooting for long‑running evals (#6344)
+ - docs(providers): add comprehensive Anthropic structured outputs documentation covering JSON outputs (`output_format`) and strict tool use (`strict: true`), including usage examples, schema limitations, and feature compatibility (#6226)

As per coding guidelines, documentation updates accompanying user‑facing changes must be reflected in CHANGELOG under Unreleased.


74-78: Add tests bullet for timeout calculations and precedence.

The PR adds unit tests; reflect them under Tests with the correct scope and PR number.

Apply:

 ### Tests
-
- - test(examples): add structured outputs example demonstrating JSON schema-based responses and strict tool use for Anthropic provider (#6226)
+ - test(eval): add unit tests for getEvalTimeoutMs (5× behavior), calculateDefaultMaxEvalTimeMs, getDefaultMaxEvalTimeMs, env/config precedence, and 0‑disables semantics (#6344)
+ - test(examples): add structured outputs example demonstrating JSON schema-based responses and strict tool use for Anthropic provider (#6226)

All test changes should be captured under Unreleased/Tests with PR numbers and clear scope.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efeccab and e402e49.

📒 Files selected for processing (5)
  • CHANGELOG.md (1 hunks)
  • site/docs/usage/troubleshooting.md (1 hunks)
  • src/envars.ts (2 hunks)
  • src/evaluator.ts (3 hunks)
  • test/envars.test.ts (2 hunks)
🧰 Additional context used
📓 Path-based instructions (9)
site/docs/**/*.{md,mdx}

📄 CodeRabbit inference engine (site/docs/CLAUDE.md)

site/docs/**/*.{md,mdx}: Use 'eval' terminology instead of 'evaluation' in documentation
Use 'Promptfoo' (capitalized) at the start of sentences and in headings; use 'promptfoo' (lowercase) in code, commands, and package names
Include required front matter with 'title' and 'description' fields in all documentation files
Only include 'title=' attribute in code blocks if they represent complete, runnable files
Add empty lines around admonition content to comply with Prettier formatting requirements
Use imperative mood in documentation instructions (e.g., 'Install the package' instead of 'Installing the package')
Never use embellishment words when describing the promptfoo system (e.g., avoid 'sophisticated')

Files:

  • site/docs/usage/troubleshooting.md
site/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

If the change is a feature, update the relevant documentation under site/

Files:

  • site/docs/usage/troubleshooting.md
test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Always use both --coverage and --randomize flags when running Jest tests: npm test -- --coverage --randomize
Never increase test timeouts - fix the slow test instead
Never use .only() or .skip() in committed code
Always include mock cleanup in test files using afterEach(() => { jest.resetAllMocks(); }) to prevent test pollution
Test entire objects using .toEqual() rather than testing individual fields with separate assertions
Mock minimally - only mock external dependencies (APIs, databases), not code under test
Mirror the src/ directory structure in test files: test/providers/ mirrors src/providers/, test/redteam/ mirrors src/redteam/, etc.
Use @jest/globals imports in Jest test files

Files:

  • test/envars.test.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

**/*.{ts,tsx}: Ensure TypeScript compilation passes by running npm run build or npx tsc from the root to verify TypeScript is valid
Prefer not to introduce new TypeScript types; use existing interfaces whenever possible

Use TypeScript with strict type checking

Files:

  • test/envars.test.ts
  • src/envars.ts
  • src/evaluator.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Avoid disabling or skipping tests unless absolutely necessary and documented

**/*.test.{ts,tsx,js,jsx}: Follow Jest best practices with describe/it blocks
Test both success and error cases for all functionality

Files:

  • test/envars.test.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{ts,tsx,js,jsx}: Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks

**/*.{ts,tsx,js,jsx}: Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging to prevent exposing secrets, API keys, passwords, and other credentials in logs

Files:

  • test/envars.test.ts
  • src/envars.ts
  • src/evaluator.ts
**/test/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

Follow Jest best practices with describe/it blocks

Files:

  • test/envars.test.ts
src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Drizzle ORM for database operations

Files:

  • src/envars.ts
  • src/evaluator.ts
CHANGELOG.md

📄 CodeRabbit inference engine (.cursor/rules/changelog.mdc)

CHANGELOG.md: All user-facing changes must be documented in CHANGELOG.md. Every PR must update the changelog, with exceptions only for PRs labeled with no-changelog or dependencies.
Follow Keep a Changelog format with sections for: Added, Changed, Fixed, Dependencies, Documentation, Tests, and Removed. Each entry must include PR number, conventional commit prefix, scope, and be concise and user-focused.
Each changelog entry must include PR number in format (#XXXX), use conventional commit prefix (feat:, fix:, chore:, docs:, test:, refactor:), include scope in parentheses, and describe changes from a user perspective in one line.
Update CHANGELOG.md for ALL changes including: new features, bug fixes, breaking changes, API changes, provider updates, configuration changes, performance improvements, deprecated features, dependency updates, test changes, CI/CD changes, build configuration changes, code style changes, documentation updates, refactors, and improvements.
Use common scopes for consistency: providers, evaluator, webui/app, cli, redteam, core, assertions, config, and database.

CHANGELOG.md: All user-facing changes require a CHANGELOG.md entry before creating a PR, added under ## [Unreleased] in appropriate categories with PR number format (#XXXX)
Use Conventional Commits prefix in CHANGELOG entries matching the change type (feat:, fix:, chore:, docs:, test:, refactor:) with concise, user-focused descriptions

All user-facing changes must be documented in CHANGELOG.md

Files:

  • CHANGELOG.md
🧠 Learnings (16)
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Never increase the function timeout - fix the test instead when dealing with timing issues

Applied to files:

  • site/docs/usage/troubleshooting.md
  • test/envars.test.ts
  • src/envars.ts
  • src/evaluator.ts
📚 Learning: 2025-11-24T18:16:21.230Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/docusaurus.mdc:0-0
Timestamp: 2025-11-24T18:16:21.230Z
Learning: Applies to site/docs/**/*.{md,mdx},site/blog/**/*.{md,mdx},site/src/pages/**/*.{md,mdx} : Use 'eval' instead of 'evaluation' in all documentation. When referring to command line usage, use `npx promptfoo eval` rather than `npx promptfoo evaluation`. Maintain consistency across all examples, code blocks, and explanations.

Applied to files:

  • site/docs/usage/troubleshooting.md
  • src/envars.ts
📚 Learning: 2025-11-24T18:15:41.142Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/**/*.test.ts : Never increase test timeouts - fix the slow test instead

Applied to files:

  • site/docs/usage/troubleshooting.md
  • test/envars.test.ts
  • src/envars.ts
  • src/evaluator.ts
📚 Learning: 2025-11-24T18:13:58.059Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: examples/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:13:58.059Z
Learning: Applies to examples/**/promptfooconfig.yaml : Keep description field in promptfooconfig.yaml SHORT (3-10 words)

Applied to files:

  • site/docs/usage/troubleshooting.md
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Ensure test failures are deterministic and do not depend on external state

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:15:23.644Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:23.644Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins in test/redteam/ directory following the reference pattern in src/redteam/plugins/pii.ts

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Test error handling and edge cases including rate limits and timeouts for providers

Applied to files:

  • test/envars.test.ts
  • src/envars.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Ensure all tests are independent and can run in any order

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Run the full test suite before committing changes

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:17:17.843Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Run tests with `--randomize` flag to ensure mocks setup and teardown don't affect other tests

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:14:47.318Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/app/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:14:47.318Z
Learning: Applies to src/app/**/*.test.{ts,tsx} : Use Vitest for testing (not Jest). Vitest is configured in vite.config.ts with jsdom environment and globals enabled

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:15:41.142Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/**/*.test.ts : Use `jest/globals` imports in Jest test files

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:15:03.592Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/commands/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:03.592Z
Learning: Applies to src/commands/**/*.ts : Import frequently used utilities at the top of command files: `logger`, `telemetry`, `setupEnv`, `printBorder`, `loadDefaultConfig`, `resolveConfigs`

Applied to files:

  • test/envars.test.ts
  • src/evaluator.ts
📚 Learning: 2025-11-24T18:15:41.142Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/providers/**/*.test.ts : Every provider test must include: success case (normal API response), error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking

Applied to files:

  • test/envars.test.ts
📚 Learning: 2025-11-24T18:14:13.123Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:14:13.123Z
Learning: Applies to site/docs/**/*.{md,mdx} : Use 'eval' terminology instead of 'evaluation' in documentation

Applied to files:

  • src/envars.ts
📚 Learning: 2025-11-24T18:15:52.934Z
Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/changelog.mdc:0-0
Timestamp: 2025-11-24T18:15:52.934Z
Learning: Applies to CHANGELOG.md : Update CHANGELOG.md for ALL changes including: new features, bug fixes, breaking changes, API changes, provider updates, configuration changes, performance improvements, deprecated features, dependency updates, test changes, CI/CD changes, build configuration changes, code style changes, documentation updates, refactors, and improvements.

Applied to files:

  • CHANGELOG.md
🧬 Code graph analysis (1)
test/envars.test.ts (1)
src/envars.ts (9)
  • getEvalTimeoutMs (469-485)
  • EVAL_TIMEOUT_MULTIPLIER (454-454)
  • calculateDefaultMaxEvalTimeMs (543-569)
  • EXPECTED_TIME_PER_REDTEAM_TEST_MS (507-507)
  • EXPECTED_TIME_PER_TEST_MS (501-501)
  • MAX_EVAL_TIME_REDTEAM_SAFETY_MULTIPLIER (519-519)
  • MAX_EVAL_TIME_SAFETY_MULTIPLIER (513-513)
  • MAX_EVAL_TIME_BUFFER_MS (525-525)
  • getDefaultMaxEvalTimeMs (586-605)
🪛 LanguageTool
site/docs/usage/troubleshooting.md

[grammar] ~270-~270: Ensure spelling is correct
Context: ...cycles and multi-turn strategies (e.g., redteam iterative, GOAT, Crescendo) - Set to `0...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~275-~275: Ensure spelling is correct
Context: ...s size, concurrency, and whether it's a redteam eval - Acts as a safety net to prevent ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~275-~275: Ensure spelling is correct
Context: ...urrency, and whether it's a redteam eval - Acts as a safety net to prevent evaluati...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: Redteam (Staging API)
  • GitHub Check: Share Test
  • GitHub Check: Redteam (Production API)
  • GitHub Check: webui tests
  • GitHub Check: Build Docs
  • GitHub Check: Test on Node 24.x and ubuntu-latest
  • GitHub Check: Test on Node 24.x and windows-latest
  • GitHub Check: Test on Node 20.x and ubuntu-latest
  • GitHub Check: Test on Node 22.x and windows-latest
  • GitHub Check: Test on Node 22.x and ubuntu-latest
  • GitHub Check: Test on Node 22.x and macOS-latest
  • GitHub Check: Test on Node 20.x and macOS-latest
  • GitHub Check: Test on Node 20.x and windows-latest
  • GitHub Check: Build on Node 20.x
  • GitHub Check: Build on Node 24.x
  • GitHub Check: Build on Node 22.x
  • GitHub Check: security-scan
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (12)
site/docs/usage/troubleshooting.md (1)

263-296: LGTM!

The documentation clearly explains the new layered timeout protection:

  • Uses 'eval' terminology consistently as per coding guidelines
  • Uses 'Promptfoo' capitalized appropriately at line 265
  • Provides clear examples for customizing timeouts via environment variables and config file
  • Uses imperative mood in instructions

Note: The static analysis flagging "redteam" is a false positive - this is established domain terminology in the codebase.

src/evaluator.ts (3)

16-16: LGTM!

The updated imports align with the new timeout-related functions and utilities exported from envars.ts.


728-749: LGTM!

The dynamic timeout calculation logic is well-structured:

  • Correctly estimates total eval steps before variable expansion
  • Handles scenarios with appropriate fallbacks for undefined arrays
  • Respects explicit options.maxEvalTimeMs override while providing sensible calculated defaults
  • The isRedteam detection using testSuite.redteam != null is appropriate

The estimation is an approximation since variable combinations aren't known at this point, but this is acceptable for safety timeout calculations.


1332-1335: LGTM!

The explicit undefined check (options.timeoutMs !== undefined) correctly allows timeoutMs=0 to disable the per-test timeout, matching the documented behavior. The fallback to getEvalTimeoutMs() provides the calculated default when no explicit value is set.

test/envars.test.ts (4)

1-18: LGTM!

The imports correctly include all new exports from envars.ts:

  • New constants: EVAL_TIMEOUT_MULTIPLIER, EXPECTED_TIME_PER_*, MAX_EVAL_TIME_*
  • New functions: calculateDefaultMaxEvalTimeMs, getDefaultMaxEvalTimeMs, getEvalTimeoutMs

The existing imports are retained for backward compatibility testing.


398-443: LGTM!

Comprehensive test coverage for getEvalTimeoutMs:

  • Default calculation behavior (5× REQUEST_TIMEOUT_MS)
  • Custom REQUEST_TIMEOUT_MS override
  • Explicit PROMPTFOO_EVAL_TIMEOUT_MS value including 0 to disable
  • Passed default value handling
  • Priority ordering (explicit > passed default)
  • cliState.config.env priority over process.env
  • Constant verification (EVAL_TIMEOUT_MULTIPLIER)

Tests follow Jest best practices with clear, descriptive test names.


445-521: LGTM!

Excellent test coverage for calculateDefaultMaxEvalTimeMs:

  • Detailed step-by-step calculation verification in comments for each scenario
  • Tests regular vs redteam with different expected times and safety multipliers
  • Verifies all constants have expected values
  • Edge cases for concurrency (1 and 0)
  • Worst-case capping when safeMaxTime > worstCase

The inline comments documenting the calculation steps make these tests highly maintainable.


523-559: LGTM!

Comprehensive test coverage for getDefaultMaxEvalTimeMs:

  • Explicit environment variable override
  • Explicit 0 to disable the timeout
  • Fallback to calculated default when unset
  • cliState.config.env priority over process.env
  • isRedteam flag propagation (verified by comparing regular vs redteam results)

The test at lines 552-558 effectively validates that the isRedteam parameter affects the calculation by asserting redteamResult > regularResult.

src/envars.ts (4)

449-485: LGTM!

The EVAL_TIMEOUT_MULTIPLIER constant and getEvalTimeoutMs function are well-implemented:

  • Clear documentation explaining the 5× multiplier rationale (retries + multi-turn strategies)
  • Correct priority order: explicit env var → passed default → calculated default
  • Explicit check for undefined allows 0 to disable timeouts
  • Default calculation: REQUEST_TIMEOUT_MS (300,000ms) × 5 = 1,500,000ms (25 minutes)

497-525: LGTM!

The timeout-related constants are well-documented and have reasonable values:

  • EXPECTED_TIME_PER_TEST_MS (1 min): Appropriate buffer for typical API calls
  • EXPECTED_TIME_PER_REDTEAM_TEST_MS (5 min): Accounts for multi-turn strategies
  • MAX_EVAL_TIME_SAFETY_MULTIPLIER (3×): Generous buffer for regular evals
  • MAX_EVAL_TIME_REDTEAM_SAFETY_MULTIPLIER (2×): Lower since expected time is already padded
  • MAX_EVAL_TIME_BUFFER_MS (1 min): Accounts for setup/teardown overhead

527-569: LGTM!

The calculateDefaultMaxEvalTimeMs function implements the "swiss cheese" safety layer correctly:

  • Math.max(1, maxConcurrency) prevents division by zero
  • Selects appropriate constants based on isRedteam flag
  • Caps at worst case (batchCount × testCaseTimeoutMs) to prevent unreasonably long timeouts
  • Clear documentation explains the calculation rationale

571-605: LGTM!

The getDefaultMaxEvalTimeMs function correctly implements the priority order:

  1. Explicit PROMPTFOO_MAX_EVAL_TIME_MS (including 0 to disable)
  2. Calculated default via calculateDefaultMaxEvalTimeMs

The documentation clearly explains the behavior and how to disable the timeout.

CHANGELOG.md Outdated

### Added

- feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add PR number, tighten wording, and name the disabling env vars.

  • Missing PR number violates changelog rules.
  • Wording is long; make it concise.
  • Explicitly name PROMPTFOO_EVAL_TIMEOUT_MS and PROMPTFOO_MAX_EVAL_TIME_MS for the “0 disables” behavior.

Apply:

- - feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable
+ - feat(eval): dynamic timeout defaults with layered protection — per‑test timeout defaults to 5× REQUEST_TIMEOUT_MS (25 min); auto‑calculated total eval timeout based on eval size, concurrency, and redteam detection; disable via PROMPTFOO_EVAL_TIMEOUT_MS=0 or PROMPTFOO_MAX_EVAL_TIME_MS=0 (#6344)

As per coding guidelines, all user‑facing changes must include the PR number and be concise under the Unreleased section.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable
- feat(eval): dynamic timeout defaults with layered protection — per‑test timeout defaults to 5× REQUEST_TIMEOUT_MS (25 min); auto‑calculated total eval timeout based on eval size, concurrency, and redteam detection; disable via PROMPTFOO_EVAL_TIMEOUT_MS=0 or PROMPTFOO_MAX_EVAL_TIME_MS=0 (#6344)

- Fix bug where per-test timeout signal replaced global max eval time signal
- Combine both signals using AbortSignal.any() so both timeouts work
- Update evaluator tests to disable timeouts when not testing timeout behavior
- Format CHANGELOG.md
- Log maxEvalTime, perTestTimeout, estimatedSteps, concurrency, isRedteam
- Include estimated end time as ISO string for easier debugging
- Format durations as human-readable (e.g., '25m', '1h 30m')
- Log when timeout is disabled (maxEvalTimeMs=0)
@mldangelo mldangelo changed the title feat(eval): add dynamic timeout defaults with layered protection feat(eval): add dynamic timeout defaults Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants