feat(eval): add dynamic timeout defaults #6344

mldangelo · 2025-11-24T22:44:56Z

Summary

Add dynamic per-test timeout defaulting to 5× REQUEST_TIMEOUT_MS (25 minutes by default)
Add auto-calculated total evaluation timeout based on eval size, concurrency, and redteam detection
Provide "swiss cheese" layered timeout protection to prevent evals from running indefinitely

Changes

Per-test timeout (`PROMPTFOO_EVAL_TIMEOUT_MS`)

Defaults to 5× REQUEST_TIMEOUT_MS (25 minutes with default settings)
Allows for multiple retry cycles (with exponential backoff)
Accommodates multi-turn redteam strategies (iterative, GOAT, Crescendo, Hydra, etc.)
Set to 0 to explicitly disable

Total evaluation timeout (`PROMPTFOO_MAX_EVAL_TIME_MS`)

Automatically calculated based on:
- Number of eval steps (tests × providers × prompts × repeat)
- Max concurrency (batching calculation)
- Expected time per test (60s regular, 300s redteam)
- Safety multiplier (3× regular, 2× redteam)
Capped at worst-case (all tests timing out)
Set to 0 to explicitly disable

Documentation

Updated troubleshooting docs to explain default behavior
Added guidance on customizing timeouts

Test plan

Unit tests for getEvalTimeoutMs with 5× multiplier behavior
Unit tests for calculateDefaultMaxEvalTimeMs calculation logic
Unit tests for getDefaultMaxEvalTimeMs wrapper function
Tests verify explicit 0 disables timeouts
Tests verify cliState.config.env priority over process.env
Build passes
All existing tests pass

🤖 Generated with Claude Code

- Add EVAL_TIMEOUT_MULTIPLIER constant set to 5x REQUEST_TIMEOUT_MS - Update getEvalTimeoutMs() to calculate dynamic default (25 min) - Allow PROMPTFOO_EVAL_TIMEOUT_MS=0 to explicitly disable timeout - Fix evaluator to use undefined check for timeoutMs option The 5x multiplier (25 min default) accommodates: - Multiple retry cycles (default maxRetries=4) - Multi-turn redteam strategies (iterative, GOAT, Crescendo) - Model-graded assertions after provider calls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add calculateDefaultMaxEvalTimeMs() and getDefaultMaxEvalTimeMs() functions that compute reasonable max eval time based on: - Total eval steps and concurrency (determines batch count) - Expected time per test (1 min regular, 5 min redteam) - Safety multiplier (3x regular, 2x redteam) - Per-test timeout (caps worst case) Example calculations (concurrency=4, 25 min test timeout): - 50 regular tests: ~40 min max - 50 redteam tests: ~2.2 hr max - 100 redteam tests: ~4.2 hr max This provides a "swiss cheese" safety layer to prevent evals from running indefinitely while still allowing legitimate long-running evals. Set PROMPTFOO_MAX_EVAL_TIME_MS=0 to explicitly disable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Calculate estimated total eval steps from testSuite structure - Use getDefaultMaxEvalTimeMs for dynamic timeout calculation - Account for tests, scenarios, providers, prompts, and repeat count - Pass isRedteam flag for longer redteam eval timeouts

- Add tests for getEvalTimeoutMs with 5x multiplier behavior - Add tests for calculateDefaultMaxEvalTimeMs calculation logic - Add tests for getDefaultMaxEvalTimeMs wrapper function - Verify constants for redteam vs regular eval timeouts - Test edge cases: zero concurrency, explicit 0 timeout, cliState priority

- Add section explaining layered timeout protection defaults - Document per-test timeout (5× REQUEST_TIMEOUT_MS by default) - Document auto-calculated total evaluation timeout - Explain how to disable or customize timeouts

…eouts

use-tusk · 2025-11-24T22:45:08Z

⏩ No test execution environment matched (1ac0a84) View output ↗

View check history

Commit	Status	Output	Created (UTC)
`e402e49`	⏩ No test execution environment matched	Output	Nov 24, 2025 10:45PM
`9c94147`	⏩ No test execution environment matched	Output	Nov 24, 2025 11:16PM
`7f2e4f6`	⏩ No test execution environment matched	Output	Nov 25, 2025 1:07AM
`f7922dd`	⏩ No test execution environment matched	Output	Nov 25, 2025 1:13AM
`b69c7b2`	⏩ No test execution environment matched	Output	Nov 25, 2025 1:25AM
`467e269`	⏩ No test execution environment matched	Output	Nov 25, 2025 5:20PM
`1ac0a84`	⏩ No test execution environment matched	Output	Nov 25, 2025 5:59PM

View output in GitHub ↗

promptfoo-scanner

This PR adds dynamic timeout defaults for evaluation runs. I reviewed the changes for LLM security vulnerabilities (prompt injection, secrets/PII in prompts, excessive agency) and found no security concerns.

Minimum severity threshold for this scan: Medium

promptfoo-scanner · 2025-11-24T22:46:54Z

👍 All Clear

No LLM security vulnerabilities were found in this PR.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-24T22:48:16Z

src/evaluator.ts

+    const isRedteam = testSuite.redteam != null;
+
+    const maxEvalTimeMs =
+      options.maxEvalTimeMs ??
+      getDefaultMaxEvalTimeMs(estimatedTotalSteps, maxConcurrency, testCaseTimeoutMs, isRedteam);


Honor options.timeoutMs when deriving global eval timeout

The new auto max-eval timeout is calculated with getDefaultMaxEvalTimeMs(...), but the per-test timeout fed into it is always getEvalTimeoutMs(); any caller-supplied options.timeoutMs (including 0 to disable per-test timeouts or a shorter custom limit) is ignored. When maxEvalTimeMs is left unset and timeoutMs is provided, the global timer is computed from the default 5×REQUEST_TIMEOUT_MS instead of the actual per-test timeout being enforced, so runs that were intentionally unlimited or tightly capped will hit an unexpected global timeout or be under-protected. The calculation should use the effective per-test timeout that _runEvaluation applies (respecting options.timeoutMs).

Useful? React with 👍 / 👎.

coderabbitai · 2025-11-24T22:48:31Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

This pull request implements a dynamic timeout defaults framework for evaluations. It introduces per-test timeout configuration (defaulting to 5× REQUEST_TIMEOUT_MS, or 25 minutes) and auto-calculates total evaluation timeouts based on eval batch count, concurrency level, and red-team detection status. New constants define expected test durations, safety multipliers, and buffer times. The implementation includes functions to compute default timeouts while respecting explicit environment variable overrides and allowing timeouts to be set to 0 to disable. Changes span core logic, configuration handling, documentation, and test coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–30 minutes

src/envars.ts: Review the calculation logic in calculateDefaultMaxEvalTimeMs, particularly the formulas for safety estimates, worst-case scenarios, and how the isRedteam flag affects multipliers and expected times
src/evaluator.ts: Verify that getDefaultMaxEvalTimeMs is called with correct parameters (estimatedTotalSteps, maxConcurrency, testCaseTimeoutMs, isRedteam), and that the explicit undefined check for timeoutMs correctly allows 0 as a disable value
Timeout constant definitions: Confirm the semantic correctness of EXPECTED_TIME_PER_TEST_MS, EXPECTED_TIME_PER_REDTEAM_TEST_MS, safety multipliers, and buffer values
test/envars.test.ts: Verify comprehensive test coverage for edge cases (concurrency = 0/1, isRedteam flag propagation, environment variable prioritization where cliState.config.env overrides process.env)

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is directly related to the changeset, providing a comprehensive summary of per-test timeouts, total evaluation timeouts, documentation updates, and test coverage that align with the actual changes made.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title 'feat(eval): add dynamic timeout defaults' directly and accurately captures the primary change: introducing dynamic default timeout behavior for evaluations, as confirmed by the changelog, documentation, and code changes across multiple files.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/dynamic-eval-timeouts

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

CHANGELOG.md (2)
68-72: Add documentation bullet for timeout config.

The PR updates troubleshooting/usage docs for new timeout behavior; add an explicit Documentation entry referencing the env vars and PR.

Apply:
 ### Documentation
-
- - docs(providers): add comprehensive Anthropic structured outputs documentation covering JSON outputs (`output_format`) and strict tool use (`strict: true`), including usage examples, schema limitations, and feature compatibility (#6226)
+ - docs(config): document PROMPTFOO_EVAL_TIMEOUT_MS and PROMPTFOO_MAX_EVAL_TIME_MS defaults, overrides, and disabling guidance; add troubleshooting for long‑running evals (#6344)
+ - docs(providers): add comprehensive Anthropic structured outputs documentation covering JSON outputs (`output_format`) and strict tool use (`strict: true`), including usage examples, schema limitations, and feature compatibility (#6226)
As per coding guidelines, documentation updates accompanying user‑facing changes must be reflected in CHANGELOG under Unreleased.

74-78: Add tests bullet for timeout calculations and precedence.

The PR adds unit tests; reflect them under Tests with the correct scope and PR number.

Apply:
 ### Tests
-
- - test(examples): add structured outputs example demonstrating JSON schema-based responses and strict tool use for Anthropic provider (#6226)
+ - test(eval): add unit tests for getEvalTimeoutMs (5× behavior), calculateDefaultMaxEvalTimeMs, getDefaultMaxEvalTimeMs, env/config precedence, and 0‑disables semantics (#6344)
+ - test(examples): add structured outputs example demonstrating JSON schema-based responses and strict tool use for Anthropic provider (#6226)
All test changes should be captured under Unreleased/Tests with PR numbers and clear scope.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efeccab and e402e49.

📒 Files selected for processing (5)

CHANGELOG.md (1 hunks)
site/docs/usage/troubleshooting.md (1 hunks)
src/envars.ts (2 hunks)
src/evaluator.ts (3 hunks)
test/envars.test.ts (2 hunks)

🧰 Additional context used

📓 Path-based instructions (9)

site/docs/**/*.{md,mdx}

📄 CodeRabbit inference engine (site/docs/CLAUDE.md)

site/docs/**/*.{md,mdx}: Use 'eval' terminology instead of 'evaluation' in documentation
Use 'Promptfoo' (capitalized) at the start of sentences and in headings; use 'promptfoo' (lowercase) in code, commands, and package names
Include required front matter with 'title' and 'description' fields in all documentation files
Only include 'title=' attribute in code blocks if they represent complete, runnable files
Add empty lines around admonition content to comply with Prettier formatting requirements
Use imperative mood in documentation instructions (e.g., 'Install the package' instead of 'Installing the package')
Never use embellishment words when describing the promptfoo system (e.g., avoid 'sophisticated')

Files:

site/docs/usage/troubleshooting.md

site/**

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

If the change is a feature, update the relevant documentation under site/

Files:

site/docs/usage/troubleshooting.md

test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Always use both --coverage and --randomize flags when running Jest tests: npm test -- --coverage --randomize
Never increase test timeouts - fix the slow test instead
Never use .only() or .skip() in committed code
Always include mock cleanup in test files using afterEach(() => { jest.resetAllMocks(); }) to prevent test pollution
Test entire objects using .toEqual() rather than testing individual fields with separate assertions
Mock minimally - only mock external dependencies (APIs, databases), not code under test
Mirror the src/ directory structure in test files: test/providers/ mirrors src/providers/, test/redteam/ mirrors src/redteam/, etc.
Use @jest/globals imports in Jest test files

Files:

test/envars.test.ts

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

**/*.{ts,tsx}: Ensure TypeScript compilation passes by running npm run build or npx tsc from the root to verify TypeScript is valid
Prefer not to introduce new TypeScript types; use existing interfaces whenever possible

Use TypeScript with strict type checking

Files:

test/envars.test.ts
src/envars.ts
src/evaluator.ts

**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Avoid disabling or skipping tests unless absolutely necessary and documented

**/*.test.{ts,tsx,js,jsx}: Follow Jest best practices with describe/it blocks
Test both success and error cases for all functionality

Files:

test/envars.test.ts

**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{ts,tsx,js,jsx}: Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks

**/*.{ts,tsx,js,jsx}: Follow consistent import order (Biome will handle import sorting)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object shorthand syntax whenever possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks
Always sanitize sensitive data before logging to prevent exposing secrets, API keys, passwords, and other credentials in logs

Files:

test/envars.test.ts
src/envars.ts
src/evaluator.ts

**/test/**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

Follow Jest best practices with describe/it blocks

Files:

test/envars.test.ts

src/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Drizzle ORM for database operations

Files:

src/envars.ts
src/evaluator.ts

CHANGELOG.md

📄 CodeRabbit inference engine (.cursor/rules/changelog.mdc)

CHANGELOG.md: All user-facing changes must be documented in CHANGELOG.md. Every PR must update the changelog, with exceptions only for PRs labeled with no-changelog or dependencies.
Follow Keep a Changelog format with sections for: Added, Changed, Fixed, Dependencies, Documentation, Tests, and Removed. Each entry must include PR number, conventional commit prefix, scope, and be concise and user-focused.
Each changelog entry must include PR number in format (#XXXX), use conventional commit prefix (feat:, fix:, chore:, docs:, test:, refactor:), include scope in parentheses, and describe changes from a user perspective in one line.
Update CHANGELOG.md for ALL changes including: new features, bug fixes, breaking changes, API changes, provider updates, configuration changes, performance improvements, deprecated features, dependency updates, test changes, CI/CD changes, build configuration changes, code style changes, documentation updates, refactors, and improvements.
Use common scopes for consistency: providers, evaluator, webui/app, cli, redteam, core, assertions, config, and database.

CHANGELOG.md: All user-facing changes require a CHANGELOG.md entry before creating a PR, added under ## [Unreleased] in appropriate categories with PR number format (#XXXX)
Use Conventional Commits prefix in CHANGELOG entries matching the change type (feat:, fix:, chore:, docs:, test:, refactor:) with concise, user-focused descriptions

All user-facing changes must be documented in CHANGELOG.md

Files:

CHANGELOG.md

🧠 Learnings (16)

📚 Learning: 2025-11-24T18:17:17.843Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Never increase the function timeout - fix the test instead when dealing with timing issues

Applied to files:

site/docs/usage/troubleshooting.md
test/envars.test.ts
src/envars.ts
src/evaluator.ts

📚 Learning: 2025-11-24T18:16:21.230Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/docusaurus.mdc:0-0
Timestamp: 2025-11-24T18:16:21.230Z
Learning: Applies to site/docs/**/*.{md,mdx},site/blog/**/*.{md,mdx},site/src/pages/**/*.{md,mdx} : Use 'eval' instead of 'evaluation' in all documentation. When referring to command line usage, use `npx promptfoo eval` rather than `npx promptfoo evaluation`. Maintain consistency across all examples, code blocks, and explanations.

Applied to files:

site/docs/usage/troubleshooting.md
src/envars.ts

📚 Learning: 2025-11-24T18:15:41.142Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/**/*.test.ts : Never increase test timeouts - fix the slow test instead

Applied to files:

site/docs/usage/troubleshooting.md
test/envars.test.ts
src/envars.ts
src/evaluator.ts

📚 Learning: 2025-11-24T18:13:58.059Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: examples/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:13:58.059Z
Learning: Applies to examples/**/promptfooconfig.yaml : Keep description field in promptfooconfig.yaml SHORT (3-10 words)

Applied to files:

site/docs/usage/troubleshooting.md

📚 Learning: 2025-11-24T18:17:17.843Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Ensure test failures are deterministic and do not depend on external state

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:15:23.644Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/redteam/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:23.644Z
Learning: Applies to src/redteam/test/redteam/**/*.ts : Add tests for new plugins in test/redteam/ directory following the reference pattern in src/redteam/plugins/pii.ts

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:17:17.843Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Test error handling and edge cases including rate limits and timeouts for providers

Applied to files:

test/envars.test.ts
src/envars.ts

📚 Learning: 2025-11-24T18:17:17.843Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Ensure all tests are independent and can run in any order

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:17:17.843Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Run the full test suite before committing changes

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:17:17.843Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-11-24T18:17:17.843Z
Learning: Applies to test/**/*.test.ts,test/**/*.spec.ts : Run tests with `--randomize` flag to ensure mocks setup and teardown don't affect other tests

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:14:47.318Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/app/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:14:47.318Z
Learning: Applies to src/app/**/*.test.{ts,tsx} : Use Vitest for testing (not Jest). Vitest is configured in vite.config.ts with jsdom environment and globals enabled

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:15:41.142Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/**/*.test.ts : Use `jest/globals` imports in Jest test files

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:15:03.592Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: src/commands/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:03.592Z
Learning: Applies to src/commands/**/*.ts : Import frequently used utilities at the top of command files: `logger`, `telemetry`, `setupEnv`, `printBorder`, `loadDefaultConfig`, `resolveConfigs`

Applied to files:

test/envars.test.ts
src/evaluator.ts

📚 Learning: 2025-11-24T18:15:41.142Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: test/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:15:41.142Z
Learning: Applies to test/providers/**/*.test.ts : Every provider test must include: success case (normal API response), error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking

Applied to files:

test/envars.test.ts

📚 Learning: 2025-11-24T18:14:13.123Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: site/docs/CLAUDE.md:0-0
Timestamp: 2025-11-24T18:14:13.123Z
Learning: Applies to site/docs/**/*.{md,mdx} : Use 'eval' terminology instead of 'evaluation' in documentation

Applied to files:

src/envars.ts

📚 Learning: 2025-11-24T18:15:52.934Z

Learnt from: CR
Repo: promptfoo/promptfoo PR: 0
File: .cursor/rules/changelog.mdc:0-0
Timestamp: 2025-11-24T18:15:52.934Z
Learning: Applies to CHANGELOG.md : Update CHANGELOG.md for ALL changes including: new features, bug fixes, breaking changes, API changes, provider updates, configuration changes, performance improvements, deprecated features, dependency updates, test changes, CI/CD changes, build configuration changes, code style changes, documentation updates, refactors, and improvements.

Applied to files:

CHANGELOG.md

🧬 Code graph analysis (1)

test/envars.test.ts (1)

src/envars.ts (9)

getEvalTimeoutMs (469-485)

EVAL_TIMEOUT_MULTIPLIER (454-454)

calculateDefaultMaxEvalTimeMs (543-569)

EXPECTED_TIME_PER_REDTEAM_TEST_MS (507-507)

EXPECTED_TIME_PER_TEST_MS (501-501)

MAX_EVAL_TIME_REDTEAM_SAFETY_MULTIPLIER (519-519)

MAX_EVAL_TIME_SAFETY_MULTIPLIER (513-513)

MAX_EVAL_TIME_BUFFER_MS (525-525)

getDefaultMaxEvalTimeMs (586-605)

🪛 LanguageTool

site/docs/usage/troubleshooting.md

[grammar] ~270-~270: Ensure spelling is correct
Context: ...cycles and multi-turn strategies (e.g., redteam iterative, GOAT, Crescendo) - Set to `0...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~275-~275: Ensure spelling is correct
Context: ...s size, concurrency, and whether it's a redteam eval - Acts as a safety net to prevent ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~275-~275: Ensure spelling is correct
Context: ...urrency, and whether it's a redteam eval - Acts as a safety net to prevent evaluati...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: Redteam (Staging API)
GitHub Check: Share Test
GitHub Check: Redteam (Production API)
GitHub Check: webui tests
GitHub Check: Build Docs
GitHub Check: Test on Node 24.x and ubuntu-latest
GitHub Check: Test on Node 24.x and windows-latest
GitHub Check: Test on Node 20.x and ubuntu-latest
GitHub Check: Test on Node 22.x and windows-latest
GitHub Check: Test on Node 22.x and ubuntu-latest
GitHub Check: Test on Node 22.x and macOS-latest
GitHub Check: Test on Node 20.x and macOS-latest
GitHub Check: Test on Node 20.x and windows-latest
GitHub Check: Build on Node 20.x
GitHub Check: Build on Node 24.x
GitHub Check: Build on Node 22.x
GitHub Check: security-scan
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (12)

site/docs/usage/troubleshooting.md (1)

263-296: LGTM!

The documentation clearly explains the new layered timeout protection:

Uses 'eval' terminology consistently as per coding guidelines

Uses 'Promptfoo' capitalized appropriately at line 265

Provides clear examples for customizing timeouts via environment variables and config file

Uses imperative mood in instructions

Note: The static analysis flagging "redteam" is a false positive - this is established domain terminology in the codebase.

src/evaluator.ts (3)

16-16: LGTM!

The updated imports align with the new timeout-related functions and utilities exported from envars.ts.

728-749: LGTM!

The dynamic timeout calculation logic is well-structured:

Correctly estimates total eval steps before variable expansion

Handles scenarios with appropriate fallbacks for undefined arrays

Respects explicit options.maxEvalTimeMs override while providing sensible calculated defaults

The isRedteam detection using testSuite.redteam != null is appropriate

The estimation is an approximation since variable combinations aren't known at this point, but this is acceptable for safety timeout calculations.

1332-1335: LGTM!

The explicit undefined check (options.timeoutMs !== undefined) correctly allows timeoutMs=0 to disable the per-test timeout, matching the documented behavior. The fallback to getEvalTimeoutMs() provides the calculated default when no explicit value is set.

test/envars.test.ts (4)

1-18: LGTM!

The imports correctly include all new exports from envars.ts:

New constants: EVAL_TIMEOUT_MULTIPLIER, EXPECTED_TIME_PER_*, MAX_EVAL_TIME_*

New functions: calculateDefaultMaxEvalTimeMs, getDefaultMaxEvalTimeMs, getEvalTimeoutMs

The existing imports are retained for backward compatibility testing.

398-443: LGTM!

Comprehensive test coverage for getEvalTimeoutMs:

Default calculation behavior (5× REQUEST_TIMEOUT_MS)

Custom REQUEST_TIMEOUT_MS override

Explicit PROMPTFOO_EVAL_TIMEOUT_MS value including 0 to disable

Passed default value handling

Priority ordering (explicit > passed default)

cliState.config.env priority over process.env

Constant verification (EVAL_TIMEOUT_MULTIPLIER)

Tests follow Jest best practices with clear, descriptive test names.

445-521: LGTM!

Excellent test coverage for calculateDefaultMaxEvalTimeMs:

Detailed step-by-step calculation verification in comments for each scenario

Tests regular vs redteam with different expected times and safety multipliers

Verifies all constants have expected values

Edge cases for concurrency (1 and 0)

Worst-case capping when safeMaxTime > worstCase

The inline comments documenting the calculation steps make these tests highly maintainable.

523-559: LGTM!

Comprehensive test coverage for getDefaultMaxEvalTimeMs:

Explicit environment variable override

Explicit 0 to disable the timeout

Fallback to calculated default when unset

cliState.config.env priority over process.env

isRedteam flag propagation (verified by comparing regular vs redteam results)

The test at lines 552-558 effectively validates that the isRedteam parameter affects the calculation by asserting redteamResult > regularResult.

src/envars.ts (4)

449-485: LGTM!

The EVAL_TIMEOUT_MULTIPLIER constant and getEvalTimeoutMs function are well-implemented:

Clear documentation explaining the 5× multiplier rationale (retries + multi-turn strategies)

Correct priority order: explicit env var → passed default → calculated default

Explicit check for undefined allows 0 to disable timeouts

Default calculation: REQUEST_TIMEOUT_MS (300,000ms) × 5 = 1,500,000ms (25 minutes)

497-525: LGTM!

The timeout-related constants are well-documented and have reasonable values:

EXPECTED_TIME_PER_TEST_MS (1 min): Appropriate buffer for typical API calls

EXPECTED_TIME_PER_REDTEAM_TEST_MS (5 min): Accounts for multi-turn strategies

MAX_EVAL_TIME_SAFETY_MULTIPLIER (3×): Generous buffer for regular evals

MAX_EVAL_TIME_REDTEAM_SAFETY_MULTIPLIER (2×): Lower since expected time is already padded

MAX_EVAL_TIME_BUFFER_MS (1 min): Accounts for setup/teardown overhead

527-569: LGTM!

The calculateDefaultMaxEvalTimeMs function implements the "swiss cheese" safety layer correctly:

Math.max(1, maxConcurrency) prevents division by zero

Selects appropriate constants based on isRedteam flag

Caps at worst case (batchCount × testCaseTimeoutMs) to prevent unreasonably long timeouts

Clear documentation explains the calculation rationale

571-605: LGTM!

The getDefaultMaxEvalTimeMs function correctly implements the priority order:

Explicit PROMPTFOO_MAX_EVAL_TIME_MS (including 0 to disable)

Calculated default via calculateDefaultMaxEvalTimeMs

The documentation clearly explains the behavior and how to disable the timeout.

coderabbitai · 2025-11-24T22:48:35Z

CHANGELOG.md


 ### Added

+- feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable


⚠️ Potential issue | 🟠 Major

Add PR number, tighten wording, and name the disabling env vars.

Missing PR number violates changelog rules.

Wording is long; make it concise.

Explicitly name PROMPTFOO_EVAL_TIMEOUT_MS and PROMPTFOO_MAX_EVAL_TIME_MS for the “0 disables” behavior.

Apply:

- - feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable + - feat(eval): dynamic timeout defaults with layered protection — per‑test timeout defaults to 5× REQUEST_TIMEOUT_MS (25 min); auto‑calculated total eval timeout based on eval size, concurrency, and redteam detection; disable via PROMPTFOO_EVAL_TIMEOUT_MS=0 or PROMPTFOO_MAX_EVAL_TIME_MS=0 (#6344)

As per coding guidelines, all user‑facing changes must include the PR number and be concise under the Unreleased section.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable

- feat(eval): dynamic timeout defaults with layered protection — per‑test timeout defaults to 5× REQUEST_TIMEOUT_MS (25 min); auto‑calculated total eval timeout based on eval size, concurrency, and redteam detection; disable via PROMPTFOO_EVAL_TIMEOUT_MS=0 or PROMPTFOO_MAX_EVAL_TIME_MS=0 (#6344)

- Fix bug where per-test timeout signal replaced global max eval time signal - Combine both signals using AbortSignal.any() so both timeouts work - Update evaluator tests to disable timeouts when not testing timeout behavior - Format CHANGELOG.md

- Log maxEvalTime, perTestTimeout, estimatedSteps, concurrency, isRedteam - Include estimated end time as ISO string for easier debugging - Format durations as human-readable (e.g., '25m', '1h 30m') - Log when timeout is disabled (maxEvalTimeMs=0)

…valTimeMs

…eouts

mldangelo and others added 8 commits November 24, 2025 14:31

docs: document new default timeout behavior

e928fe0

- Add section explaining layered timeout protection defaults - Document per-test timeout (5× REQUEST_TIMEOUT_MS by default) - Document auto-calculated total evaluation timeout - Explain how to disable or customize timeouts

chore: format code

5cdb42c

docs: add changelog entry for dynamic timeout defaults

a99439e

Merge remote-tracking branch 'origin/main' into feat/dynamic-eval-tim…

e402e49

…eouts

promptfoo-scanner bot reviewed Nov 24, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 24, 2025

View reviewed changes

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

mldangelo added 3 commits November 24, 2025 15:16

chore: merge main and resolve CHANGELOG.md conflict

f7922dd

mldangelo changed the title ~~feat(eval): add dynamic timeout defaults with layered protection~~ feat(eval): add dynamic timeout defaults Nov 25, 2025

mldangelo added 3 commits November 24, 2025 17:25

chore(eval): add PR number to changelog and improve JSDoc for getMaxE…

b69c7b2

…valTimeMs

Merge remote-tracking branch 'origin/main' into feat/dynamic-eval-tim…

e53ac50

…eouts

chore: revert CHANGELOG.md to origin/main

467e269

mldangelo requested review from MrFlounder and jameshiester November 25, 2025 17:48

Merge remote-tracking branch 'origin/main' into feat/dynamic-eval-tim…

1ac0a84

…eouts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(eval): add dynamic timeout defaults #6344

feat(eval): add dynamic timeout defaults #6344

mldangelo commented Nov 24, 2025

Uh oh!

use-tusk bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

promptfoo-scanner bot left a comment

Uh oh!

promptfoo-scanner bot commented Nov 24, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 24, 2025

Uh oh!

coderabbitai bot commented Nov 24, 2025 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Added

		- feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable

	- feat(eval): add dynamic timeout defaults with layered protection - test case timeout defaults to 5× REQUEST_TIMEOUT_MS (25 minutes) to allow for retries and multi-turn redteam strategies; total eval timeout auto-calculated based on eval size, concurrency, and redteam detection; set either to 0 to disable
	- feat(eval): dynamic timeout defaults with layered protection — per‑test timeout defaults to 5× REQUEST_TIMEOUT_MS (25 min); auto‑calculated total eval timeout based on eval size, concurrency, and redteam detection; disable via PROMPTFOO_EVAL_TIMEOUT_MS=0 or PROMPTFOO_MAX_EVAL_TIME_MS=0 (#6344)

Uh oh!

feat(eval): add dynamic timeout defaults #6344

Are you sure you want to change the base?

feat(eval): add dynamic timeout defaults #6344

Conversation

mldangelo commented Nov 24, 2025

Summary

Changes

Per-test timeout (PROMPTFOO_EVAL_TIMEOUT_MS)

Total evaluation timeout (PROMPTFOO_MAX_EVAL_TIME_MS)

Documentation

Test plan

Uh oh!

use-tusk bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

promptfoo-scanner bot left a comment

Choose a reason for hiding this comment

Uh oh!

promptfoo-scanner bot commented Nov 24, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Per-test timeout (`PROMPTFOO_EVAL_TIMEOUT_MS`)

Total evaluation timeout (`PROMPTFOO_MAX_EVAL_TIME_MS`)

use-tusk bot commented Nov 24, 2025 •

edited

Loading

coderabbitai bot commented Nov 24, 2025 •

edited

Loading