fix(providers): improve Python provider reliability and logging #6034

mldangelo · 2025-10-27T21:46:37Z

Summary

Fixes several Python provider issues that were causing failures and confusion:

spawn python ENOENT errors - Systems with only python3 (not python) were failing
Lingering Python processes - Worker pools weren't being cleaned up after evaluations
Request count showing "(0 requests)" - Token usage wasn't tracking number of requests
Noisy stderr logging - Module loading messages cluttering output

Changes

Core Fixes

src/python/worker.ts: Added automatic python3/python detection using validatePythonPath()
src/python/persistent_wrapper.py: Removed noisy "Loading module" print statements
src/evaluator.ts: Added providerRegistry.shutdownAll() to cleanup Python workers in finally block
src/providers/pythonCompletion.ts: Added numRequests field to tokenUsage for both cached and fresh results

Tests

test/providers/pythonCompletion.test.ts: Updated test expectations to include numRequests field

Testing

✅ All 7703 tests pass
✅ Python formatting checks pass (ruff)
✅ Verified no lingering Python processes after evaluations
✅ Confirmed request counts now display correctly (e.g., "(3 requests)" instead of "(0 requests)")
✅ Tested all Python examples in examples/ directory

Before/After

Before:

Token Usage Summary:
  file://provider.py: 158 (0 requests)

After:

Token Usage Summary:
  file://provider.py: 158 (3 requests)

- Fix spawn python ENOENT by detecting python3 first, then fallback to python - Remove noisy 'Loading module' messages from stderr - Add proper cleanup of Python worker pools after evaluations - Fix token usage display showing (0 requests) for Python providers

use-tusk · 2025-10-27T21:46:48Z

⏩ No test execution environment matched (dbe5f9f) View output ↗

View check history

Commit	Status	Output	Created (UTC)
`08b09e4`	⏩ No test execution environment matched	Output	Oct 27, 2025 9:46PM
`b4922ea`	⏩ No test execution environment matched	Output	Oct 27, 2025 9:47PM
`025fc15`	⏩ No test execution environment matched	Output	Oct 27, 2025 9:49PM
`34cf006`	⏩ No test execution environment matched	Output	Oct 27, 2025 10:11PM
`dbe5f9f`	⏩ No test execution environment matched	Output	Oct 27, 2025 10:23PM

View output in GitHub ↗

coderabbitai · 2025-10-27T21:48:36Z

📝 Walkthrough

Walkthrough

This change enhances Python provider reliability through three main improvements: (1) automatic Python executable detection (python3 with fallback to python) via path validation in the worker, (2) request count tracking by adding a numRequests field to tokenUsage metrics in completion results, and (3) resource lifecycle management by invoking providerRegistry.shutdownAll() during evaluator cleanup. Additionally, a debug print statement was removed to reduce logging noise. Tests were updated to reflect the new tokenUsage structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

src/providers/pythonCompletion.ts: Verify that numRequests is correctly populated in both cached and fresh result paths, and confirm the token usage enrichment doesn't have unintended side effects
src/python/worker.ts: Review the validatePythonPath integration and ensure the Python executable resolution (python3 vs python fallback) behaves correctly across different environments
src/evaluator.ts: Confirm that providerRegistry.shutdownAll() properly cleans up worker pools without leaving dangling resources or causing issues during shutdown sequencing
test/providers/pythonCompletion.test.ts: Validate that numRequests expectations (set to 1) align with the actual implementation in both cached and fresh result scenarios

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "fix(providers): improve Python provider reliability and logging" is directly related to the main changes in the changeset. The PR addresses four specific issues affecting Python provider reliability (spawn ENOENT errors, lingering processes, token count tracking) and logging (noisy stderr messages), all of which are accurately captured by the title. The title is concise, uses the conventional "fix()" prefix with scope, and is clear enough that a teammate reviewing history would understand the primary change without confusion.
Description Check	✅ Passed	The pull request description is well-related to the changeset, providing a comprehensive overview of all changes made. It clearly outlines the four issues being fixed, specifies which files were modified and why, includes testing validation results (7703 tests passing, formatting checks passing, verification of no lingering processes), and provides a concrete before/after example demonstrating the improvements. The description goes beyond the minimum requirements and directly connects to each of the changes mentioned in the raw summary.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/python-provider-improvements

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…mprovements

…f dynamic import

src/evaluator.ts

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/evaluator.ts (1)
1915-1917: Good addition for resource cleanup!

The cleanup logic correctly shuts down Python worker pools in the finally block. Consider wrapping the shutdown call in a try-catch to prevent cleanup failures from propagating.

Apply this diff to add defensive error handling:
       await stopOtlpReceiverIfNeeded();

       // Clean up Python worker pools to prevent resource leaks
-      await providerRegistry.shutdownAll();
+      try {
+        await providerRegistry.shutdownAll();
+      } catch (err) {
+        logger.warn(`Error during provider shutdown: ${err}`);
+      }
     }
CHANGELOG.md (1)
57-57: Changelog entry looks good; add Tests bullet and split logging noise into Changed.

The Fixed line meets format and scope. Nice.

Please also add a Tests entry for the updated pythonCompletion tests.

Per guidelines, “reduced logging noise” is a non-bug change and belongs under Changed. Trim it from the Fixed line and add a Changed bullet.

Suggested diffs:

Split logging noise into Changed and trim Fixed:
- - fix(providers): improve Python provider reliability with automatic python3/python detection, worker cleanup, request count tracking, and reduced logging noise (#6034)
+ - fix(providers): improve Python provider reliability with automatic python3/python detection, worker cleanup, and request count tracking (#6034)
Insert under “### Changed” in [Unreleased]:
+ - chore(providers): reduce Python provider stderr noise during module loading (#6034)
Add Tests entry in [Unreleased] → “### Tests”:
+ - test(providers): update Python provider tests to expect tokenUsage.numRequests (#6034)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6b2565 and 025fc15.

📒 Files selected for processing (6)

CHANGELOG.md (1 hunks)
src/evaluator.ts (2 hunks)
src/providers/pythonCompletion.ts (2 hunks)
src/python/persistent_wrapper.py (0 hunks)
src/python/worker.ts (2 hunks)
test/providers/pythonCompletion.test.ts (3 hunks)

💤 Files with no reviewable changes (1)

src/python/persistent_wrapper.py

🧰 Additional context used

📓 Path-based instructions (11)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Prefer not to introduce new TypeScript types; reuse existing interfaces where possible

**/*.{ts,tsx}: Maintain consistent import order (Biome handles sorting)
Use consistent curly braces for all control statements
Prefer const over let and avoid var
Use object shorthand syntax when possible
Use async/await for asynchronous code
Use consistent error handling with proper type checks

**/*.{ts,tsx}: Use TypeScript with strict type checking enabled
Follow consistent import order (Biome will sort imports)
Use consistent curly braces for all control statements
Prefer const over let; avoid var
Use object property shorthand when possible
Use async/await for asynchronous code instead of raw promises/callbacks
When logging, pass sensitive data via the logger context object so it is auto-sanitized; avoid interpolating secrets into message strings
Manually sanitize sensitive objects with sanitizeObject before storing or emitting outside logging contexts

Files:

src/python/worker.ts
src/evaluator.ts
test/providers/pythonCompletion.test.ts
src/providers/pythonCompletion.ts

src/**

📄 CodeRabbit inference engine (AGENTS.md)

Place core application/library logic under src/

Files:

src/python/worker.ts
src/evaluator.ts
src/providers/pythonCompletion.ts

CHANGELOG.md

📄 CodeRabbit inference engine (.cursor/rules/changelog.mdc)

CHANGELOG.md: Document all user-facing changes in CHANGELOG.md
Every pull request must add or update an entry in CHANGELOG.md under the [Unreleased] section
Follow Keep a Changelog structure under [Unreleased] with sections: Added, Changed, Fixed, Dependencies, Documentation, Tests, Removed
Each changelog entry must include the PR number formatted as (#1234) or temporary placeholder (#XXXX)
Each changelog entry must use a Conventional Commit prefix: feat:, fix:, chore:, docs:, test:, or refactor:
Each changelog entry must be concise and on a single line
Each changelog entry must be user-focused, describing what changed and why it matters to users
Each changelog entry must include a scope in parentheses, e.g., feat(providers): or fix(evaluator):
Use common scopes for consistency: providers, evaluator, webui or app, cli, redteam, core, assertions, config, database
Place all dependency updates under the Dependencies category
Place all test changes under the Tests category
Use categories consistently: Added for new features, Changed for modifications/refactors/CI, Fixed for bug fixes, Removed for removed features
After a PR number is assigned, replace (#XXXX) placeholders with the actual PR number
Be specific, use active voice, include context, and avoid repeating the PR title in changelog entries
Group related changes with multiple bullets in the same category when needed; use one entry per logical change

CHANGELOG.md: All user-facing changes require a CHANGELOG.md entry before creating a PR
Add entries under [Unreleased] in appropriate category (Added, Changed, Fixed, Dependencies, Documentation, Tests)
Each changelog entry must include PR number (#1234) or placeholder (#XXXX)
Use conventional commit prefixes in changelog entries (feat:, fix:, chore:, docs:, test:, refactor:)

CHANGELOG.md: Document all user-facing changes in CHANGELOG.md
Changelog entries must include the PR number in format (#1234)
Use conventional commit prefixes in changelog entries: feat:,...

Files:

CHANGELOG.md

test/**/*.{test,spec}.ts

📄 CodeRabbit inference engine (.cursor/rules/jest.mdc)

test/**/*.{test,spec}.ts: Mock as few functions as possible to keep tests realistic
Never increase the function timeout - fix the test instead
Organize tests in descriptive describe and it blocks
Prefer assertions on entire objects rather than individual keys when writing expectations
Clean up after tests to prevent side effects (e.g., use afterEach(() => { jest.resetAllMocks(); }))
Run tests with --randomize flag to ensure your mocks setup and teardown don't affect other tests
Use Jest's mocking utilities rather than complex custom mocks
Prefer shallow mocking over deep mocking
Mock external dependencies but not the code being tested
Reset mocks between tests to prevent test pollution
For database tests, use in-memory instances or proper test fixtures
Test both success and error cases for each provider
Mock API responses to avoid external dependencies in tests
Validate that provider options are properly passed to the underlying service
Test error handling and edge cases (rate limits, timeouts, etc.)
Ensure provider caching behaves as expected
Always include both --coverage and --randomize flags when running tests
Run tests in a single pass (no watch mode for CI)
Ensure all tests are independent and can run in any order
Clean up any test data or mocks after each test

Files:

test/providers/pythonCompletion.test.ts

test/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

test/**/*.test.ts: Never increase Jest test timeouts; fix slow tests instead (avoid jest.setTimeout or large timeouts in tests)
Do not use .only() or .skip() in committed tests
Add afterEach(() => { jest.resetAllMocks(); }) to ensure mock cleanup
Prefer asserting entire objects (toEqual on whole result) rather than individual fields
Mock minimally: only external dependencies (APIs, databases), not code under test
Use Jest (not Vitest) APIs in this suite; avoid importing vitest
Import from @jest/globals in tests

Files:

test/providers/pythonCompletion.test.ts

test/**

📄 CodeRabbit inference engine (test/CLAUDE.md)

Organize tests to mirror src/ structure (e.g., test/providers → src/providers, test/redteam → src/redteam)

Place tests under test/

Files:

test/providers/pythonCompletion.test.ts

test/providers/**/*.test.ts

📄 CodeRabbit inference engine (test/CLAUDE.md)

Provider tests must cover: success case, error cases (4xx, 5xx, rate limits), configuration validation, and token usage tracking

Files:

test/providers/pythonCompletion.test.ts

**/*.{test,spec}.{js,jsx,ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/gh-cli-workflow.mdc)

Avoid disabling or skipping tests (e.g., .skip, .only) unless absolutely necessary and documented

Files:

test/providers/pythonCompletion.test.ts

test/**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

test/**/*.{test,spec}.{ts,tsx}: Follow Jest best practices using describe/it blocks
Test both success and error cases for all functionality

Files:

test/providers/pythonCompletion.test.ts

test/**/*.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Follow Jest best practices with describe/it blocks in tests

Files:

test/providers/pythonCompletion.test.ts

src/providers/**/*.ts

📄 CodeRabbit inference engine (src/providers/CLAUDE.md)

src/providers/**/*.ts: Each provider must implement the ApiProvider interface (see src/types/providers.ts)
Providers must transform promptfoo prompts into the provider-specific API request format
Providers must return a normalized ProviderResponse for evaluation
Providers must handle authentication, rate limits, retries, and streaming
Always sanitize logs to prevent leaking API keys; never stringify configs directly in logs
For OpenAI-compatible providers, extend OpenAiChatCompletionProvider and configure apiBaseUrl and options via super(...)
Follow configuration priority: (1) explicit config options, (2) environment variables (PROVIDER_API_KEY), (3) provider defaults

Files:

src/providers/pythonCompletion.ts

🧠 Learnings (1)

📚 Learning: 2025-07-18T17:26:23.087Z

Learnt from: CR
PR: promptfoo/promptfoo#0
File: .cursor/rules/jest.mdc:0-0
Timestamp: 2025-07-18T17:26:23.087Z
Learning: Applies to test/**/*.{test,spec}.ts : Ensure provider caching behaves as expected

Applied to files:

test/providers/pythonCompletion.test.ts

🧬 Code graph analysis (2)

src/python/worker.ts (1)

src/python/pythonUtils.ts (1)

validatePythonPath (207-262)

src/evaluator.ts (1)

src/providers/providerRegistry.ts (1)

providerRegistry (65-65)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: Test on Node 24.x and windows-latest
GitHub Check: Test on Node 24.x and ubuntu-latest
GitHub Check: Test on Node 20.x and windows-latest
GitHub Check: Test on Node 20.x and ubuntu-latest
GitHub Check: Test on Node 22.x and windows-latest
GitHub Check: Test on Node 20.x and macOS-latest
GitHub Check: Test on Node 22.x and macOS-latest
GitHub Check: Build Docs
GitHub Check: Share Test
GitHub Check: Style Check
GitHub Check: Run Integration Tests
GitHub Check: webui tests
GitHub Check: Redteam (Staging API)
GitHub Check: Generate Assets
GitHub Check: Redteam (Production API)
GitHub Check: Build on Node 20.x
GitHub Check: Build on Node 24.x
GitHub Check: Build on Node 22.x
GitHub Check: Analyze (javascript-typescript)

🔇 Additional comments (8)

src/evaluator.ts (1)

23-23: LGTM!

The import is clean and correctly references the providerRegistry singleton for Python worker cleanup.

src/python/worker.ts (2)

7-7: LGTM: Import added for Python path validation.

The import of validatePythonPath enables automatic detection of Python executables (python3 with fallback to python), addressing ENOENT errors on systems with only python3.

37-45: LGTM: Python path validation properly integrated.

The validation logic correctly:

Attempts python3 first, then falls back to python

Passes isExplicit flag to distinguish user-provided paths from defaults

Uses the resolved path in PythonShell instantiation

Leverages caching from validatePythonPath to avoid redundant checks

This should eliminate "spawn python ENOENT" errors on python3-only systems.

src/providers/pythonCompletion.ts (2)

196-207: LGTM: Token usage transformation for cached results includes numRequests.

The cached result path correctly preserves or defaults numRequests to 1 using nullish coalescing. This ensures consistent token usage tracking for cached responses and addresses the "(0 requests)" display issue.

314-326: LGTM: Fresh results enriched with numRequests.

The logic correctly:

Sets cached=false for fresh results

Adds numRequests=1 if tokenUsage exists but numRequests doesn't

Includes debug logging to trace the addition

This ensures consistent token usage structure across both cached and fresh result paths.

test/providers/pythonCompletion.test.ts (3)

262-292: LGTM: Test updated for cached result token usage transformation.

The test expectation correctly includes numRequests: 1 in the transformed token usage object for cached results, matching the implementation behavior in src/providers/pythonCompletion.ts.

294-323: LGTM: Test updated for fresh result token usage.

The test expectation correctly includes numRequests: 1 alongside the original token usage fields for fresh results, aligning with the implementation that enriches tokenUsage with numRequests.

345-372: LGTM: Test covers zero token usage edge case.

The test correctly verifies that numRequests: 1 is added even when token counts are zero, ensuring comprehensive coverage of the token usage transformation logic.

Address code review feedback by using Promise.allSettled instead of Promise.all to ensure cleanup errors don't crash successful evaluations. Individual provider shutdown failures are logged but don't prevent other cleanups or throw errors.

sklein12 · 2025-10-27T22:16:34Z

src/providers/providerRegistry.ts

+    // Log any failures but don't throw - cleanup should be defensive
+    for (const result of results) {
+      if (result.status === 'rejected') {
+        logger.error(`Error shutting down provider: ${result.reason}`);


sorry to be a nit, should this be a Warning?

sklein12

One nit on the log message, looks good

Cleanup failures should be warnings not errors since they don't affect the evaluation result and we're intentionally being defensive.

mldangelo added 3 commits October 27, 2025 14:41

docs: add changelog entry for Python provider improvements

ab7e949

test(providers): update Python provider tests for numRequests field

08b09e4

docs: update changelog with PR #6034

b4922ea

mldangelo added 2 commits October 27, 2025 14:48

Merge remote-tracking branch 'origin/main' into fix/python-provider-i…

517402d

…mprovements

refactor(evaluator): use static import for providerRegistry instead o…

025fc15

…f dynamic import

sklein12 reviewed Oct 27, 2025

View reviewed changes

src/evaluator.ts Show resolved Hide resolved

coderabbitai bot reviewed Oct 27, 2025

View reviewed changes

sklein12 reviewed Oct 27, 2025

View reviewed changes

sklein12 previously approved these changes Oct 27, 2025

View reviewed changes

fix(providers): use logger.warn for cleanup failures instead of error

dbe5f9f

Cleanup failures should be warnings not errors since they don't affect the evaluation result and we're intentionally being defensive.

mldangelo dismissed sklein12’s stale review via dbe5f9f October 27, 2025 22:23

sklein12 approved these changes Oct 27, 2025

View reviewed changes

mldangelo merged commit 6da6a8f into main Oct 27, 2025
36 checks passed

mldangelo deleted the fix/python-provider-improvements branch October 27, 2025 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(providers): improve Python provider reliability and logging #6034

fix(providers): improve Python provider reliability and logging #6034

mldangelo commented Oct 27, 2025

Uh oh!

use-tusk bot commented Oct 27, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 27, 2025 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

sklein12 Oct 27, 2025

Uh oh!

sklein12 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

fix(providers): improve Python provider reliability and logging #6034

fix(providers): improve Python provider reliability and logging #6034

Conversation

mldangelo commented Oct 27, 2025

Summary

Changes

Core Fixes

Tests

Testing

Before/After

Uh oh!

use-tusk bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sklein12 Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

sklein12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

use-tusk bot commented Oct 27, 2025 •

edited

Loading

coderabbitai bot commented Oct 27, 2025 •

edited

Loading