Skip to content

fix: replace per-test databases with shared DB + per-test pools#429

Merged
PierreLeGuen merged 11 commits intomainfrom
fix/300-slow-e2e-tests
Feb 16, 2026
Merged

fix: replace per-test databases with shared DB + per-test pools#429
PierreLeGuen merged 11 commits intomainfrom
fix/300-slow-e2e-tests

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

@ilblackdragon ilblackdragon commented Feb 12, 2026

Summary

  • Replaces heavyweight per-test CREATE DATABASE ... TEMPLATE / DROP DATABASE pattern with a single shared database (platform_api_e2e) and per-test connection pools
  • Eliminates all admin connections for test setup/teardown, reducing peak PG connections from ~300 to ~64
  • Fixes shared-state concurrency bugs where tests using models.data[0] or mutating shared model names would interfere with parallel tests
  • Deletes cleanup_test_databases.rs (no per-test DBs to clean up)
  • Adds nextest config for capping parallel e2e threads at 16

Performance

Metric Before (main) After Change
Sum of test runtimes 116s 25s 4.6x faster
Slowest binary 18.4s 3.4s 5.4x faster
Test binary failures 23/37 1/37 (pre-existing) stable

Test plan

  • cargo fmt --all -- --check
  • cargo clippy -p api --tests (zero warnings)
  • cargo test -p api --tests (all pass except 1 pre-existing failure unrelated to this PR)
  • No individual test over 4s, no timeouts

🤖 Generated with Claude Code


Note

Medium Risk
Moderate risk because it changes test infrastructure and database bootstrapping/concurrency behavior; failures are likely to manifest as flaky or deadlocked CI tests rather than production runtime issues.

Overview
Switches API e2e/integration tests from per-test database cloning/teardown to a single shared test database bootstrapped once with migrations, with each test creating its own 4-connection pool (via deadpool-postgres) and using advisory locks to serialize bootstrap across nextest binaries.

Updates CI/nextest to cap e2e parallelism at 16 and increases PostgreSQL max_connections (plus --shm-size) to prevent connection exhaustion; removes the orphaned test DB cleanup test and adjusts e2e tests to avoid shared-state flakiness (e.g., using explicit model IDs / higher list limits) and drops the now-unneeded DB guard plumbing.

Written by Cursor Bugbot for commit 1d45d41. This will update automatically on new commits. Configure here.

E2E tests intermittently stalled for ~60s in CI due to PostgreSQL
connection exhaustion. With num-cpus parallelism on 32+ core runners,
448+ tests each creating 4-6 admin connections exceeded PG's default
max_connections=100.

- Cache admin DB name probe in a static OnceCell (saves 1-2 connections per test)
- Remove redundant pg_terminate_backend/DROP on UUID-named test DBs (can't pre-exist)
- Use DROP DATABASE ... WITH (FORCE) instead of separate terminate + drop
- Remove unnecessary 50ms sleep after template migration
- Cap e2e tests at 16 parallel threads via nextest test group
- Increase PG max_connections to 300 in CI as safety net
- Remove debug println that dumped full config on every test server init

Per-test admin connections: 4-6 -> 2. Peak connections: unbounded -> ~64.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 12, 2026 17:07
@claude
Copy link
Copy Markdown

claude bot commented Feb 12, 2026

Code Review

I've reviewed the changes for critical issues.

✅ No Critical Issues Found

The PR effectively addresses PostgreSQL connection exhaustion during parallel test execution. The implementation is sound with good safety measures.

Key Validations:

PostgreSQL Compatibility - CI uses postgres:15, and DROP DATABASE ... WITH (FORCE) requires PG 13+, so this is safe.

Thread Safety - ADMIN_DB_NAME caching with tokio::sync::OnceCell ensures only one probe happens across concurrent tests.

Nextest Filter - The pattern test(/^e2e/) correctly matches test functions starting with e2e, properly capping parallelism to 16 threads.

Service Restart - The workflow correctly restarts PostgreSQL after ALTER SYSTEM SET max_connections with a 30s retry loop using pg_isready.

Connection Math - "16 parallel tests × 4 connections = 64" is validated by the optimization reducing admin connections from 4-6 to 2 per test.

Safety Preserved - validate_db_identifier() checks remain in place, preventing SQL injection.

Test-Only Changes - No production impact except removing debug println! in api/src/lib.rs:142.

Summary

Well-designed optimization that:

  • Reduces connection pressure through caching and removing redundant operations
  • Uses modern PostgreSQL features correctly
  • Maintains all safety validations
  • Properly bounds parallelism via nextest test-groups

Approved - Ready to merge pending CI validation.


Code review by Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces PostgreSQL connection pressure during api e2e/integration tests to prevent intermittent CI stalls caused by exhausting Postgres’ default max_connections.

Changes:

  • Cache the “admin DB” name probe (postgres vs template1) and remove a post-migration sleep to cut per-test admin connections.
  • Simplify test DB lifecycle: create UUID-named DBs without redundant terminate/drop, and drop DBs using DROP DATABASE ... WITH (FORCE).
  • Bound parallelism for DB-heavy tests via nextest test groups and raise Postgres max_connections in CI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
crates/api/tests/common/db_setup.rs Lowers per-test admin DB connections and simplifies create/drop operations for test databases.
crates/api/src/lib.rs Removes debug println! dumping config during auth init.
.github/workflows/test.yml Increases CI Postgres resources (--shm-size) and raises max_connections to reduce stalls.
.config/nextest.toml Caps parallelism for api e2e tests to limit peak DB connections.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for i in $(seq 1 30); do
docker exec ${{ job.services.postgres.id }} pg_isready -U postgres && break
sleep 1
done
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readiness wait loop doesn’t fail the step if PostgreSQL never becomes ready after the restart (the for loop always exits 0). Add an explicit check after the loop (or use a timeout helper) to exit 1 when pg_isready still fails, so the job doesn’t proceed with a broken DB and produce misleading test failures/timeouts.

Suggested change
done
done
if ! docker exec ${{ job.services.postgres.id }} pg_isready -U postgres; then
echo "PostgreSQL did not become ready after restart within 30 seconds" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ilblackdragon, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses intermittent CI stalls caused by PostgreSQL connection exhaustion during parallel E2E tests. It introduces several optimizations to reduce the number of database connections used by tests and improves the efficiency of test database setup and teardown. The changes aim to stabilize CI by bounding peak connection usage and streamlining database operations, ensuring a more robust testing environment.

Highlights

  • Reduced E2E Test Connection Pressure: Implemented a nextest test group to cap e2e/integration tests at 16 parallel threads, significantly reducing peak PostgreSQL connections and preventing CI stalls.
  • Optimized Admin Database Connection: The get_admin_db_name function now caches the admin database name (postgres or template1) using OnceCell, reducing connection overhead by 1-2 connections per test.
  • Streamlined Test Database Lifecycle: Removed redundant pg_terminate_backend and DROP DATABASE IF EXISTS calls during the creation of UUID-named test databases, as they cannot pre-exist. Also, adopted DROP DATABASE ... WITH (FORCE) for cleanup, simplifying the process and automatically terminating active connections (PostgreSQL 13+).
  • Removed Unnecessary Delay and Debug Output: Eliminated an unnecessary 50ms sleep after template database migration and removed a debug println! statement that previously dumped configuration on every test server initialization.
Changelog
  • .config/nextest.toml
    • Added a test-groups.e2e-db section to define a maximum of 16 threads for this group.
    • Applied the e2e-db test group override to api package tests starting with e2e/.
  • crates/api/src/lib.rs
    • Removed a println! debug statement within the init_auth_services function.
  • crates/api/tests/common/db_setup.rs
    • Introduced a static OnceCell<String> named ADMIN_DB_NAME for caching the admin database name.
    • Refactored get_admin_db_name to use ADMIN_DB_NAME.get_or_try_init for caching the admin database name.
    • Removed a tokio::time::sleep call from create_template_database_internal.
    • Eliminated pg_terminate_backend and DROP DATABASE IF EXISTS calls from create_test_database_from_template.
    • Modified drop_test_database to use DROP DATABASE IF EXISTS {db_name} WITH (FORCE).
    • Modified drop_all_test_databases to use DROP DATABASE IF EXISTS {db_name} WITH (FORCE) for each database.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/test.yml
Activity
  • The pull request summary and initial description were generated by Claude Code and Cursor Bugbot, indicating automated assistance in drafting the PR.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses CI stalls from PostgreSQL connection exhaustion through optimizations like caching the admin database name, capping test parallelism via nextest, and streamlining database operations with DROP DATABASE ... WITH (FORCE). Crucially, it also enhances security by removing a debug statement that logged sensitive application configuration. These improvements lead to cleaner, more efficient code, a more stable test pipeline, and a confirmed absence of other vulnerabilities, with SQL query modifications safely implemented to prevent injection.


let auth_service: Arc<dyn AuthServiceTrait> = if config.auth.mock {
// TODO: fix this, it should not use the database pool
println!("config: {config:?}");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The application previously logged the entire ApiConfig struct. This configuration object contains sensitive information, including database credentials, API keys, and encryption keys. Logging this data, even in a test context, poses a risk as such code could be accidentally activated in production or secrets could be exposed in CI/CD logs. This change correctly removes the sensitive logging statement, mitigating the vulnerability.

- Use binary(/^e2e/) instead of test(/^e2e/) so the 16-thread cap
  actually applies to e2e test binaries (test() matches function names)
- Fail the CI step if PostgreSQL never becomes ready after restart

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ilblackdragon
Copy link
Copy Markdown
Member Author

Addressed both review comments in 227d53b:

  1. Nextest filter (Cursor Bugbot): Changed test(/^e2e/) to binary(/^e2e/). The test() predicate matches function names (which start with test_), not binary names. binary() correctly matches the integration test binaries derived from e2e_*.rs filenames. Great catch, this was the most important part of the PR.

  2. pg_isready guard (Copilot): Added explicit readiness check after the wait loop that exits 1 if PostgreSQL never becomes ready after restart.

ilblackdragon and others added 3 commits February 12, 2026 11:04
Eliminates heavyweight per-test CREATE/DROP DATABASE operations by sharing
a single database across all e2e tests, with each test getting its own
4-connection pool. Tests are already isolated via UUID-scoped orgs,
workspaces, and API keys.

Also fixes shared-state concurrency issues where tests using
models.data[0] or mutating shared model names would interfere with
parallel tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ilblackdragon ilblackdragon changed the title fix: reduce e2e test connection pressure to prevent CI stalls fix: replace per-test databases with shared DB + per-test pools Feb 12, 2026
- Replace check-then-create with idempotent CREATE DATABASE that catches
  SQLSTATE 42P04 (duplicate_database), fixing the TOCTOU race when
  multiple test binaries bootstrap concurrently
- Bump admin model list queries to limit=500 so tests find their models
  in the shared DB which accumulates entries across runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous fix checked for SQLSTATE 42P04 (DUPLICATE_DATABASE), but
in CI the error doesn't carry a SQL state code, causing a panic. Since
multiple test binaries race to create the shared DB (OnceCell is
per-process), just swallow any CREATE DATABASE error and let the
migration pool connection be the real validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

The previous fix swallowed CREATE DATABASE errors but multiple binaries
still raced on refinery's schema_history table creation during migrations.
Now the entire bootstrap (CREATE DATABASE + migrations) is serialized
with a PostgreSQL advisory lock on the admin connection. OnceCell still
gates within a single binary; the advisory lock handles cross-process
coordination.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI runs clippy with -D warnings and caught unusual_byte_groupings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…with_pool return type

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@PierreLeGuen PierreLeGuen merged commit f8a3cfc into main Feb 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants