Managed-Postgres docs + producer-side enqueue histograms#265
Conversation
Two related additions from the awa-bench-driver staging run:
1. docs/deploying-on-managed-postgres.md — Cloud SQL / AlloyDB specifics
that aren't in the existing benchmarking/configuration/deployment docs:
* Per-vCPU sustained completion + burst enqueue from the staging
report.
* 'Prefer PG18 on AlloyDB' — direct measurement, not theory: the same
4 vCPU AlloyDB instance went from ~2k jobs/s sustained on PG17 to
~5k jobs/s on PG18 (matching Cloud SQL PG18 exactly), even after
v021 was in place.
* IAM gotchas: Cloud SQL IAM SAs need cloudsqlsuperuser GRANT before
prepare_schema can run; AlloyDB grants it automatically.
* Auth-proxy as a native sidecar (initContainer + restartPolicy:
Always) — k8s example.
* enqueue_shards recommendation: set explicitly per queue, >=4.
Cross-links to configuration.md and ADR-025.
* Cross-links to issue #264 (prepare_schema race) as a known footgun
with mitigation.
2. Producer-side timing observability — awa.enqueue.batch_size and
awa.enqueue.duration histograms in AwaMetrics, recorded by the Python
binding's enqueue_many_copy paths. Lets producers see whether a
target enqueue rate misses because batches are tiny or batches are
slow — neither was distinguishable from existing metrics. Rust users
calling QueueStorage::enqueue_params_copy directly can call
AwaMetrics::from_global().record_enqueue_batch(...) themselves; a
thin awa::enqueue_many_copy facade that wraps this is a natural
follow-up.
Validation:
- SQLX_OFFLINE=true cargo check -p awa-metrics
- SQLX_OFFLINE=true cargo check -p awa-worker
- cargo check on awa-python (excluded from workspace, ran in its own dir)
- cargo fmt --all --check
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds OpenTelemetry histograms to measure batch enqueue size and duration in COPY operations, instruments async and sync enqueue paths to emit these metrics, and introduces comprehensive documentation for deploying awa workers on managed PostgreSQL with sizing, IAM/auth, tuning, and troubleshooting guidance. ChangesManaged Postgres Deployment and Instrumentation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0c33d1ec47
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .enqueue_params_copy(&pool, &insert_params) | ||
| .await | ||
| .map_err(map_awa_error)?; | ||
| awa_worker::AwaMetrics::from_global().record_enqueue_batch( |
There was a problem hiding this comment.
Cache enqueue metrics outside the batch hot path
For high-throughput Python producers that call enqueue_many_copy once per chunk, this constructs a fresh AwaMetrics via from_global() after every successful batch, and AwaMetrics::new registers/builds the entire metric set (not just these two histograms). That adds dozens of instrument builder calls and allocations directly to the COPY enqueue hot path that the docs now recommend for million-job spikes; cache an AwaMetrics on PyClient or otherwise initialize the enqueue instruments once and reuse the handles.
Useful? React with 👍 / 👎.
Codex feedback on PR #265: AwaMetrics::from_global() runs the full Self::new(meter) which registers and builds every counter, histogram, and gauge in the metric set — calling it once per enqueue_many_copy batch puts ~30 instrument-builder allocations directly on the COPY producer hot path the new docs explicitly recommend for million-job spikes. Cache AwaMetrics on PyClient and clone it into the async block (the struct is already #[derive(Clone)]; the clone is cheap — internally just bumps Arc refcounts on the underlying OTel instruments). This only touches the two paths added in this PR; the pre-existing DLQ paths in awa-python still call AwaMetrics::from_global() per operation but those aren't producer-hot and are out of scope here. A follow-up could fold them in.
Codex flagged this on the new enqueue_many_copy paths but the same anti-pattern was present elsewhere — AwaMetrics::from_global() rebuilds the full instrument set (~30 counters/histograms/gauges) on every call. Cached sites: - awa-ui::AppState now carries an AwaMetrics; all 5 DLQ HTTP handlers use state.metrics instead of from_global() per request. - awa-python::PyClient.metrics is now shared across the 12 DLQ async/ sync method pairs (in addition to the enqueue_many_copy paths from the prior commit). - awa-cli::Commands::Dlq constructs AwaMetrics once before the inner match arm and reuses it across the four DLQ subcommand branches. Lower impact since each CLI invocation hits at most one branch, but consistent with the other call sites. Intentionally left alone: - awa-worker::ClientBuilder::build (one-shot at startup). - AwaMetrics::Default::default (used as a fallback handle; callers shouldn't call it repeatedly). - All test code.
Promote the Unreleased section to a tagged 0.6.0 entry dated today, prepended with an operator-facing 'Highlights' section that covers the release-readiness umbrella's outstanding doc-pass items: queue-storage default, staged upgrade, rollback limits, enqueue_shards as a per-queue knob, the direct COPY producer path, the managed-Postgres deploy guide, benchmark caveats, and the known MVCC operational limit. Folds in the three PRs that landed after the umbrella was last synchronised: - #263 (direct COPY producer path, sub-second wait-duration buckets, v021 shard-aware lane indexes) - #265 (managed-Postgres deploy doc, awa.enqueue.{batch_size,duration} histograms, AwaMetrics caching cleanup) - #266 (prepare_schema startup race fix, queue_storage_schema_ready tightening; closes #264) No code or version-string changes — Cargo.toml version bump is the release-tag step and intentionally not in this PR.
Adds two entries pulled from the staging-benchmark + #265 / #266 arc that aren't surfaced anywhere else in troubleshooting: - 'Producer Enqueue Is Slower Than Expected' — full diagnostic flow starting from the new awa.enqueue.batch_size / awa.enqueue.duration histograms (added in #265), with three diagnoses: 1. compat insert path vs direct queue-storage COPY (~100-150 ms/row through a real DB proxy) 2. enqueue_shards=1 contention on the head row, with the upsert to fix it (the plain UPDATE silently affects zero rows is a footgun) 3. WAL/commit pressure on undersized managed Postgres, pointing at the per-vCPU sizing table - 'relation awa.job_id_seq does not exist' under Common Error Cases. Post-#266 the prepare_schema race is gone, but a producer that races the very first worker startup, or an environment rebuild that drops the schema, can still surface this. Names the cause (queue-storage substrate is created by prepare_schema, not by awa migrate alone) and three fixes. Cross-links the existing managed-Postgres doc, configuration.md 'Producer path choice', and ADR-025.
Summary
Two related additions falling out of an awa-bench-driver staging run.
docs/deploying-on-managed-postgres.md— Cloud SQL / AlloyDB specifics that aren't covered in the existing benchmarking / configuration / deployment docs. Per-vCPU sizing numbers, PG18-on-AlloyDB recommendation (with measurement), IAMcloudsqlsuperuserGRANT for first-connect, auth-proxy native-sidecar pattern,enqueue_shardsrecommendation (linking to existing config docs), and a "things that broke for us, worth pre-empting" section linking to prepare_schema concurrent-startup race on fresh DB (especially PG18) #264.Producer-side timing observability.
awa.enqueue.batch_sizeandawa.enqueue.durationhistograms onAwaMetrics, recorded by the Python binding'senqueue_many_copy/enqueue_many_copy_sync. Producers using the direct queue-storage COPY path couldn't tell from existing metrics whether a missed enqueue rate was "batches are tiny" or "batches are slow" — these histograms split that signal directly. Rust callers usingQueueStorage::enqueue_params_copydirectly can callAwaMetrics::from_global().record_enqueue_batch(queue, count, duration)themselves; a thinawa::enqueue_many_copyRust facade that wraps this is a natural follow-up (out of scope here).Headline numbers powering the docs page
The headline PG17→PG18 finding on AlloyDB: same instance, same v021 schema, same image, same vCPU —
~2kjobs/s sustained on PG17 (withAccessShareLockwaiters queueing behind ring-rotationACCESS EXCLUSIVE) became~5.2kjobs/s on PG18, matching Cloud SQL PG18 at the same vCPU exactly. PG17 had a real ceiling AlloyDB-side that PG18 cleared.Validation
SQLX_OFFLINE=true cargo check -p awa-metricsSQLX_OFFLINE=true cargo check -p awa-workercargo checkonawa-python(excluded from workspace, ran in its own dir)cargo fmt --all -- --checkTest plan
cargo test --workspacecargo testinsideawa-python/(ormaturin develop+ the Python test suite)docs/deploying-on-managed-postgres.mdon GitHub and click every cross-linkOut of scope (deliberately)
awa::enqueue_many_copyfacade that records the metric automatically — left for a follow-up so this PR doesn't touch the awa-model → awa-metrics dependency graph.awa/tests/queue_storage_benchmark_test.rs. Could be added in the same follow-up.awa.enqueue.shardattribute on the new histograms — would let dashboards split p99 enqueue duration per shard. Tracked mentally; happy to add if reviewers want.Summary by CodeRabbit
New Features
Documentation