fix: replace per-test databases with shared DB + per-test pools#429
fix: replace per-test databases with shared DB + per-test pools#429PierreLeGuen merged 11 commits intomainfrom
Conversation
E2E tests intermittently stalled for ~60s in CI due to PostgreSQL connection exhaustion. With num-cpus parallelism on 32+ core runners, 448+ tests each creating 4-6 admin connections exceeded PG's default max_connections=100. - Cache admin DB name probe in a static OnceCell (saves 1-2 connections per test) - Remove redundant pg_terminate_backend/DROP on UUID-named test DBs (can't pre-exist) - Use DROP DATABASE ... WITH (FORCE) instead of separate terminate + drop - Remove unnecessary 50ms sleep after template migration - Cap e2e tests at 16 parallel threads via nextest test group - Increase PG max_connections to 300 in CI as safety net - Remove debug println that dumped full config on every test server init Per-test admin connections: 4-6 -> 2. Peak connections: unbounded -> ~64. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code ReviewI've reviewed the changes for critical issues. ✅ No Critical Issues FoundThe PR effectively addresses PostgreSQL connection exhaustion during parallel test execution. The implementation is sound with good safety measures. Key Validations:✅ PostgreSQL Compatibility - CI uses postgres:15, and ✅ Thread Safety - ✅ Nextest Filter - The pattern ✅ Service Restart - The workflow correctly restarts PostgreSQL after ✅ Connection Math - "16 parallel tests × 4 connections = 64" is validated by the optimization reducing admin connections from 4-6 to 2 per test. ✅ Safety Preserved - ✅ Test-Only Changes - No production impact except removing debug SummaryWell-designed optimization that:
✅ Approved - Ready to merge pending CI validation. Code review by Claude Code |
There was a problem hiding this comment.
Pull request overview
Reduces PostgreSQL connection pressure during api e2e/integration tests to prevent intermittent CI stalls caused by exhausting Postgres’ default max_connections.
Changes:
- Cache the “admin DB” name probe (
postgresvstemplate1) and remove a post-migration sleep to cut per-test admin connections. - Simplify test DB lifecycle: create UUID-named DBs without redundant terminate/drop, and drop DBs using
DROP DATABASE ... WITH (FORCE). - Bound parallelism for DB-heavy tests via nextest test groups and raise Postgres
max_connectionsin CI.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
crates/api/tests/common/db_setup.rs |
Lowers per-test admin DB connections and simplifies create/drop operations for test databases. |
crates/api/src/lib.rs |
Removes debug println! dumping config during auth init. |
.github/workflows/test.yml |
Increases CI Postgres resources (--shm-size) and raises max_connections to reduce stalls. |
.config/nextest.toml |
Caps parallelism for api e2e tests to limit peak DB connections. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i in $(seq 1 30); do | ||
| docker exec ${{ job.services.postgres.id }} pg_isready -U postgres && break | ||
| sleep 1 | ||
| done |
There was a problem hiding this comment.
The readiness wait loop doesn’t fail the step if PostgreSQL never becomes ready after the restart (the for loop always exits 0). Add an explicit check after the loop (or use a timeout helper) to exit 1 when pg_isready still fails, so the job doesn’t proceed with a broken DB and produce misleading test failures/timeouts.
| done | |
| done | |
| if ! docker exec ${{ job.services.postgres.id }} pg_isready -U postgres; then | |
| echo "PostgreSQL did not become ready after restart within 30 seconds" >&2 | |
| exit 1 | |
| fi |
Summary of ChangesHello @ilblackdragon, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses intermittent CI stalls caused by PostgreSQL connection exhaustion during parallel E2E tests. It introduces several optimizations to reduce the number of database connections used by tests and improves the efficiency of test database setup and teardown. The changes aim to stabilize CI by bounding peak connection usage and streamlining database operations, ensuring a more robust testing environment. Highlights
Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses CI stalls from PostgreSQL connection exhaustion through optimizations like caching the admin database name, capping test parallelism via nextest, and streamlining database operations with DROP DATABASE ... WITH (FORCE). Crucially, it also enhances security by removing a debug statement that logged sensitive application configuration. These improvements lead to cleaner, more efficient code, a more stable test pipeline, and a confirmed absence of other vulnerabilities, with SQL query modifications safely implemented to prevent injection.
|
|
||
| let auth_service: Arc<dyn AuthServiceTrait> = if config.auth.mock { | ||
| // TODO: fix this, it should not use the database pool | ||
| println!("config: {config:?}"); |
There was a problem hiding this comment.
The application previously logged the entire ApiConfig struct. This configuration object contains sensitive information, including database credentials, API keys, and encryption keys. Logging this data, even in a test context, poses a risk as such code could be accidentally activated in production or secrets could be exposed in CI/CD logs. This change correctly removes the sensitive logging statement, mitigating the vulnerability.
- Use binary(/^e2e/) instead of test(/^e2e/) so the 16-thread cap actually applies to e2e test binaries (test() matches function names) - Fail the CI step if PostgreSQL never becomes ready after restart Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Addressed both review comments in 227d53b:
|
Eliminates heavyweight per-test CREATE/DROP DATABASE operations by sharing a single database across all e2e tests, with each test getting its own 4-connection pool. Tests are already isolated via UUID-scoped orgs, workspaces, and API keys. Also fixes shared-state concurrency issues where tests using models.data[0] or mutating shared model names would interfere with parallel tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace check-then-create with idempotent CREATE DATABASE that catches SQLSTATE 42P04 (duplicate_database), fixing the TOCTOU race when multiple test binaries bootstrap concurrently - Bump admin model list queries to limit=500 so tests find their models in the shared DB which accumulates entries across runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous fix checked for SQLSTATE 42P04 (DUPLICATE_DATABASE), but in CI the error doesn't carry a SQL state code, causing a panic. Since multiple test binaries race to create the shared DB (OnceCell is per-process), just swallow any CREATE DATABASE error and let the migration pool connection be the real validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
The previous fix swallowed CREATE DATABASE errors but multiple binaries still raced on refinery's schema_history table creation during migrations. Now the entire bootstrap (CREATE DATABASE + migrations) is serialized with a PostgreSQL advisory lock on the admin connection. OnceCell still gates within a single binary; the advisory lock handles cross-process coordination. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI runs clippy with -D warnings and caught unusual_byte_groupings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…with_pool return type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
CREATE DATABASE ... TEMPLATE/DROP DATABASEpattern with a single shared database (platform_api_e2e) and per-test connection poolsmodels.data[0]or mutating shared model names would interfere with parallel testscleanup_test_databases.rs(no per-test DBs to clean up)Performance
Test plan
cargo fmt --all -- --checkcargo clippy -p api --tests(zero warnings)cargo test -p api --tests(all pass except 1 pre-existing failure unrelated to this PR)🤖 Generated with Claude Code
Note
Medium Risk
Moderate risk because it changes test infrastructure and database bootstrapping/concurrency behavior; failures are likely to manifest as flaky or deadlocked CI tests rather than production runtime issues.
Overview
Switches API e2e/integration tests from per-test database cloning/teardown to a single shared test database bootstrapped once with migrations, with each test creating its own 4-connection pool (via
deadpool-postgres) and using advisory locks to serialize bootstrap across nextest binaries.Updates CI/nextest to cap e2e parallelism at 16 and increases PostgreSQL
max_connections(plus--shm-size) to prevent connection exhaustion; removes the orphaned test DB cleanup test and adjusts e2e tests to avoid shared-state flakiness (e.g., using explicit model IDs / higher list limits) and drops the now-unneeded DB guard plumbing.Written by Cursor Bugbot for commit 1d45d41. This will update automatically on new commits. Configure here.