Managed-Postgres docs + producer-side enqueue histograms by hardbyte · Pull Request #265 · hardbyte/awa

hardbyte · 2026-05-19T01:12:10Z

Summary

Two related additions falling out of an awa-bench-driver staging run.

docs/deploying-on-managed-postgres.md — Cloud SQL / AlloyDB specifics that aren't covered in the existing benchmarking / configuration / deployment docs. Per-vCPU sizing numbers, PG18-on-AlloyDB recommendation (with measurement), IAM cloudsqlsuperuser GRANT for first-connect, auth-proxy native-sidecar pattern, enqueue_shards recommendation (linking to existing config docs), and a "things that broke for us, worth pre-empting" section linking to prepare_schema concurrent-startup race on fresh DB (especially PG18) #264.
Producer-side timing observability. awa.enqueue.batch_size and awa.enqueue.duration histograms on AwaMetrics, recorded by the Python binding's enqueue_many_copy / enqueue_many_copy_sync. Producers using the direct queue-storage COPY path couldn't tell from existing metrics whether a missed enqueue rate was "batches are tiny" or "batches are slow" — these histograms split that signal directly. Rust callers using QueueStorage::enqueue_params_copy directly can call AwaMetrics::from_global().record_enqueue_batch(queue, count, duration) themselves; a thin awa::enqueue_many_copy Rust facade that wraps this is a natural follow-up (out of scope here).

Headline numbers powering the docs page

Engine	vCPU	PG	Sustained completed	1M spike enqueue
AlloyDB	4	18	~5,200 jobs/s	~52,000 inserts/s
Cloud SQL	1	17	~500 jobs/s	~40,000 inserts/s
Cloud SQL	4	18	~5,000 jobs/s	~70,000 inserts/s
Cloud SQL	16	18	~14,000 jobs/s	~77,000 inserts/s

The headline PG17→PG18 finding on AlloyDB: same instance, same v021 schema, same image, same vCPU — ~2k jobs/s sustained on PG17 (with AccessShareLock waiters queueing behind ring-rotation ACCESS EXCLUSIVE) became ~5.2k jobs/s on PG18, matching Cloud SQL PG18 at the same vCPU exactly. PG17 had a real ceiling AlloyDB-side that PG18 cleared.

Validation

SQLX_OFFLINE=true cargo check -p awa-metrics
SQLX_OFFLINE=true cargo check -p awa-worker
cargo check on awa-python (excluded from workspace, ran in its own dir)
cargo fmt --all -- --check
Existing tests not run in this branch — Brian to run the full suite locally before merge.

Test plan

cargo test --workspace
cargo test inside awa-python/ (or maturin develop + the Python test suite)
Sanity-render docs/deploying-on-managed-postgres.md on GitHub and click every cross-link

Out of scope (deliberately)

A Rust awa::enqueue_many_copy facade that records the metric automatically — left for a follow-up so this PR doesn't touch the awa-model → awa-metrics dependency graph.
Wiring the new histograms into the in-repo benchmark harness in awa/tests/queue_storage_benchmark_test.rs. Could be added in the same follow-up.
Filing the awa.enqueue.shard attribute on the new histograms — would let dashboards split p99 enqueue duration per shard. Tracked mentally; happy to add if reviewers want.

Summary by CodeRabbit

New Features
- Added observability metrics for bulk enqueue operations to track batch job counts and enqueue durations.
Documentation
- New comprehensive guide for deploying on managed Postgres (Cloud SQL/AlloyDB) covering sizing, IAM requirements, auth-proxy setup, and operational gotchas.
- Updated deployment documentation with navigation to managed Postgres guidance.

Two related additions from the awa-bench-driver staging run: 1. docs/deploying-on-managed-postgres.md — Cloud SQL / AlloyDB specifics that aren't in the existing benchmarking/configuration/deployment docs: * Per-vCPU sustained completion + burst enqueue from the staging report. * 'Prefer PG18 on AlloyDB' — direct measurement, not theory: the same 4 vCPU AlloyDB instance went from ~2k jobs/s sustained on PG17 to ~5k jobs/s on PG18 (matching Cloud SQL PG18 exactly), even after v021 was in place. * IAM gotchas: Cloud SQL IAM SAs need cloudsqlsuperuser GRANT before prepare_schema can run; AlloyDB grants it automatically. * Auth-proxy as a native sidecar (initContainer + restartPolicy: Always) — k8s example. * enqueue_shards recommendation: set explicitly per queue, >=4. Cross-links to configuration.md and ADR-025. * Cross-links to issue #264 (prepare_schema race) as a known footgun with mitigation. 2. Producer-side timing observability — awa.enqueue.batch_size and awa.enqueue.duration histograms in AwaMetrics, recorded by the Python binding's enqueue_many_copy paths. Lets producers see whether a target enqueue rate misses because batches are tiny or batches are slow — neither was distinguishable from existing metrics. Rust users calling QueueStorage::enqueue_params_copy directly can call AwaMetrics::from_global().record_enqueue_batch(...) themselves; a thin awa::enqueue_many_copy facade that wraps this is a natural follow-up. Validation: - SQLX_OFFLINE=true cargo check -p awa-metrics - SQLX_OFFLINE=true cargo check -p awa-worker - cargo check on awa-python (excluded from workspace, ran in its own dir) - cargo fmt --all --check

coderabbitai · 2026-05-19T01:12:23Z

Warning

Rate limit exceeded

@hardbyte has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes and 48 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6223d2d7-cf9e-4a46-b071-a0c416dd507e

📥 Commits

Reviewing files that changed from the base of the PR and between 0c33d1e and 6fe5836.

📒 Files selected for processing (4)

awa-cli/src/main.rs
awa-python/src/client.rs
awa-ui/src/handlers/dlq.rs
awa-ui/src/state.rs

📝 Walkthrough

Walkthrough

Adds OpenTelemetry histograms to measure batch enqueue size and duration in COPY operations, instruments async and sync enqueue paths to emit these metrics, and introduces comprehensive documentation for deploying awa workers on managed PostgreSQL with sizing, IAM/auth, tuning, and troubleshooting guidance.

Changes

Managed Postgres Deployment and Instrumentation

Layer / File(s)	Summary
Enqueue batch metrics foundation `awa-metrics/src/lib.rs`	New metric name constants `ENQUEUE_BATCH_SIZE` and `ENQUEUE_DURATION` added to the `names` module. `AwaMetrics` struct extended with `enqueue_batch_size: Histogram<u64>` and `enqueue_duration_seconds: Histogram<f64>` fields. `AwaMetrics::new` registers both histograms with descriptions and bucket boundaries reusing job-wait duration boundaries. Public `record_enqueue_batch` method emits samples to both histograms with `awa.job.queue` attribute, no-op when batch size is zero.
Record batch metrics in COPY enqueue paths `awa-python/src/client.rs`	Async `enqueue_many_copy` and sync `enqueue_many_copy_sync` now measure elapsed time around `enqueue_params_copy` call and record enqueue batch metric with queue name, enqueued count, and duration.
Managed Postgres deployment operational guide `docs/deploying-on-managed-postgres.md`	New documentation page covering PostgreSQL version selection (Postgres 18 recommended), vCPU sizing tables and interpretation for sustained throughput and backlog drain, IAM role requirements and grant workflow for Cloud SQL with auth-proxy sidecar examples and AlloyDB auto-granted roles, explicit `enqueue_shards` tuning via `awa.queue_meta` upserts addressing head-row contention, producer-path selection recommending native COPY entry points over SQL-function fallback, and troubleshooting for staging failure modes (schema-prep startup hang, `pg_stat_activity` visibility, PG18 upgrade behavior, cleanup ownership errors).
Documentation navigation links `README.md`, `docs/deployment.md`	README "Configuring real workloads" list adds reference to managed Postgres deployment doc. Deployment index adds "Next" link to managed Postgres guide covering Cloud SQL and AlloyDB sizing, IAM, and auth-proxy setup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hardbyte/awa#206: Introduced QueueStorage::enqueue_params_copy as the COPY ingestion producer that this PR now instruments with batch enqueue metrics.
hardbyte/awa#235: Extended AwaMetrics OpenTelemetry instrumentation framework that this PR builds upon to add enqueue batch histograms.

Poem

🐰 Batches measured, metrics gleam,
Postgres managed, a steady stream,
COPY'd swift with shard-tuned grace,
Cloud SQL thrives in awa's space!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the two main changes: managed Postgres documentation and producer-side enqueue histograms for metrics collection.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch brian/managed-postgres-deploy-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c33d1ec47

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T01:17:09Z

                .enqueue_params_copy(&pool, &insert_params)
                .await
                .map_err(map_awa_error)?;
+            awa_worker::AwaMetrics::from_global().record_enqueue_batch(


Cache enqueue metrics outside the batch hot path

For high-throughput Python producers that call enqueue_many_copy once per chunk, this constructs a fresh AwaMetrics via from_global() after every successful batch, and AwaMetrics::new registers/builds the entire metric set (not just these two histograms). That adds dozens of instrument builder calls and allocations directly to the COPY enqueue hot path that the docs now recommend for million-job spikes; cache an AwaMetrics on PyClient or otherwise initialize the enqueue instruments once and reuse the handles.

Useful? React with 👍 / 👎.

Codex feedback on PR #265: AwaMetrics::from_global() runs the full Self::new(meter) which registers and builds every counter, histogram, and gauge in the metric set — calling it once per enqueue_many_copy batch puts ~30 instrument-builder allocations directly on the COPY producer hot path the new docs explicitly recommend for million-job spikes. Cache AwaMetrics on PyClient and clone it into the async block (the struct is already #[derive(Clone)]; the clone is cheap — internally just bumps Arc refcounts on the underlying OTel instruments). This only touches the two paths added in this PR; the pre-existing DLQ paths in awa-python still call AwaMetrics::from_global() per operation but those aren't producer-hot and are out of scope here. A follow-up could fold them in.

Codex flagged this on the new enqueue_many_copy paths but the same anti-pattern was present elsewhere — AwaMetrics::from_global() rebuilds the full instrument set (~30 counters/histograms/gauges) on every call. Cached sites: - awa-ui::AppState now carries an AwaMetrics; all 5 DLQ HTTP handlers use state.metrics instead of from_global() per request. - awa-python::PyClient.metrics is now shared across the 12 DLQ async/ sync method pairs (in addition to the enqueue_many_copy paths from the prior commit). - awa-cli::Commands::Dlq constructs AwaMetrics once before the inner match arm and reuses it across the four DLQ subcommand branches. Lower impact since each CLI invocation hits at most one branch, but consistent with the other call sites. Intentionally left alone: - awa-worker::ClientBuilder::build (one-shot at startup). - AwaMetrics::Default::default (used as a fallback handle; callers shouldn't call it repeatedly). - All test code.

Promote the Unreleased section to a tagged 0.6.0 entry dated today, prepended with an operator-facing 'Highlights' section that covers the release-readiness umbrella's outstanding doc-pass items: queue-storage default, staged upgrade, rollback limits, enqueue_shards as a per-queue knob, the direct COPY producer path, the managed-Postgres deploy guide, benchmark caveats, and the known MVCC operational limit. Folds in the three PRs that landed after the umbrella was last synchronised: - #263 (direct COPY producer path, sub-second wait-duration buckets, v021 shard-aware lane indexes) - #265 (managed-Postgres deploy doc, awa.enqueue.{batch_size,duration} histograms, AwaMetrics caching cleanup) - #266 (prepare_schema startup race fix, queue_storage_schema_ready tightening; closes #264) No code or version-string changes — Cargo.toml version bump is the release-tag step and intentionally not in this PR.

Adds two entries pulled from the staging-benchmark + #265 / #266 arc that aren't surfaced anywhere else in troubleshooting: - 'Producer Enqueue Is Slower Than Expected' — full diagnostic flow starting from the new awa.enqueue.batch_size / awa.enqueue.duration histograms (added in #265), with three diagnoses: 1. compat insert path vs direct queue-storage COPY (~100-150 ms/row through a real DB proxy) 2. enqueue_shards=1 contention on the head row, with the upsert to fix it (the plain UPDATE silently affects zero rows is a footgun) 3. WAL/commit pressure on undersized managed Postgres, pointing at the per-vCPU sizing table - 'relation awa.job_id_seq does not exist' under Common Error Cases. Post-#266 the prepare_schema race is gone, but a producer that races the very first worker startup, or an environment rebuild that drops the schema, can still surface this. Names the cause (queue-storage substrate is created by prepare_schema, not by awa migrate alone) and three fixes. Cross-links the existing managed-Postgres doc, configuration.md 'Producer path choice', and ADR-025.

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

hardbyte added 2 commits May 19, 2026 13:25

hardbyte merged commit 4469f3b into main May 19, 2026
13 checks passed

hardbyte deleted the brian/managed-postgres-deploy-docs branch May 19, 2026 01:46

This was referenced May 19, 2026

Cut 0.6.0-beta.1 #267

Merged

0.6 release readiness: align TLA+ models, docs, and queue-storage replacement gates #197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed-Postgres docs + producer-side enqueue histograms#265

Managed-Postgres docs + producer-side enqueue histograms#265
hardbyte merged 3 commits into
mainfrom
brian/managed-postgres-deploy-docs

hardbyte commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hardbyte commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline numbers powering the docs page

Validation

Test plan

Out of scope (deliberately)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hardbyte commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading