Skip to content

Managed-Postgres docs + producer-side enqueue histograms#265

Merged
hardbyte merged 3 commits into
mainfrom
brian/managed-postgres-deploy-docs
May 19, 2026
Merged

Managed-Postgres docs + producer-side enqueue histograms#265
hardbyte merged 3 commits into
mainfrom
brian/managed-postgres-deploy-docs

Conversation

@hardbyte
Copy link
Copy Markdown
Owner

@hardbyte hardbyte commented May 19, 2026

Summary

Two related additions falling out of an awa-bench-driver staging run.

  • docs/deploying-on-managed-postgres.md — Cloud SQL / AlloyDB specifics that aren't covered in the existing benchmarking / configuration / deployment docs. Per-vCPU sizing numbers, PG18-on-AlloyDB recommendation (with measurement), IAM cloudsqlsuperuser GRANT for first-connect, auth-proxy native-sidecar pattern, enqueue_shards recommendation (linking to existing config docs), and a "things that broke for us, worth pre-empting" section linking to prepare_schema concurrent-startup race on fresh DB (especially PG18) #264.

  • Producer-side timing observability. awa.enqueue.batch_size and awa.enqueue.duration histograms on AwaMetrics, recorded by the Python binding's enqueue_many_copy / enqueue_many_copy_sync. Producers using the direct queue-storage COPY path couldn't tell from existing metrics whether a missed enqueue rate was "batches are tiny" or "batches are slow" — these histograms split that signal directly. Rust callers using QueueStorage::enqueue_params_copy directly can call AwaMetrics::from_global().record_enqueue_batch(queue, count, duration) themselves; a thin awa::enqueue_many_copy Rust facade that wraps this is a natural follow-up (out of scope here).

Headline numbers powering the docs page

Engine vCPU PG Sustained completed 1M spike enqueue
AlloyDB 4 18 ~5,200 jobs/s ~52,000 inserts/s
Cloud SQL 1 17 ~500 jobs/s ~40,000 inserts/s
Cloud SQL 4 18 ~5,000 jobs/s ~70,000 inserts/s
Cloud SQL 16 18 ~14,000 jobs/s ~77,000 inserts/s

The headline PG17→PG18 finding on AlloyDB: same instance, same v021 schema, same image, same vCPU — ~2k jobs/s sustained on PG17 (with AccessShareLock waiters queueing behind ring-rotation ACCESS EXCLUSIVE) became ~5.2k jobs/s on PG18, matching Cloud SQL PG18 at the same vCPU exactly. PG17 had a real ceiling AlloyDB-side that PG18 cleared.

Validation

  • SQLX_OFFLINE=true cargo check -p awa-metrics
  • SQLX_OFFLINE=true cargo check -p awa-worker
  • cargo check on awa-python (excluded from workspace, ran in its own dir)
  • cargo fmt --all -- --check
  • Existing tests not run in this branch — Brian to run the full suite locally before merge.

Test plan

  • cargo test --workspace
  • cargo test inside awa-python/ (or maturin develop + the Python test suite)
  • Sanity-render docs/deploying-on-managed-postgres.md on GitHub and click every cross-link

Out of scope (deliberately)

  • A Rust awa::enqueue_many_copy facade that records the metric automatically — left for a follow-up so this PR doesn't touch the awa-model → awa-metrics dependency graph.
  • Wiring the new histograms into the in-repo benchmark harness in awa/tests/queue_storage_benchmark_test.rs. Could be added in the same follow-up.
  • Filing the awa.enqueue.shard attribute on the new histograms — would let dashboards split p99 enqueue duration per shard. Tracked mentally; happy to add if reviewers want.

Summary by CodeRabbit

  • New Features

    • Added observability metrics for bulk enqueue operations to track batch job counts and enqueue durations.
  • Documentation

    • New comprehensive guide for deploying on managed Postgres (Cloud SQL/AlloyDB) covering sizing, IAM requirements, auth-proxy setup, and operational gotchas.
    • Updated deployment documentation with navigation to managed Postgres guidance.

Review Change Stack

Two related additions from the awa-bench-driver staging run:

1. docs/deploying-on-managed-postgres.md — Cloud SQL / AlloyDB specifics
   that aren't in the existing benchmarking/configuration/deployment docs:

   * Per-vCPU sustained completion + burst enqueue from the staging
     report.
   * 'Prefer PG18 on AlloyDB' — direct measurement, not theory: the same
     4 vCPU AlloyDB instance went from ~2k jobs/s sustained on PG17 to
     ~5k jobs/s on PG18 (matching Cloud SQL PG18 exactly), even after
     v021 was in place.
   * IAM gotchas: Cloud SQL IAM SAs need cloudsqlsuperuser GRANT before
     prepare_schema can run; AlloyDB grants it automatically.
   * Auth-proxy as a native sidecar (initContainer + restartPolicy:
     Always) — k8s example.
   * enqueue_shards recommendation: set explicitly per queue, >=4.
     Cross-links to configuration.md and ADR-025.
   * Cross-links to issue #264 (prepare_schema race) as a known footgun
     with mitigation.

2. Producer-side timing observability — awa.enqueue.batch_size and
   awa.enqueue.duration histograms in AwaMetrics, recorded by the Python
   binding's enqueue_many_copy paths. Lets producers see whether a
   target enqueue rate misses because batches are tiny or batches are
   slow — neither was distinguishable from existing metrics. Rust users
   calling QueueStorage::enqueue_params_copy directly can call
   AwaMetrics::from_global().record_enqueue_batch(...) themselves; a
   thin awa::enqueue_many_copy facade that wraps this is a natural
   follow-up.

Validation:
- SQLX_OFFLINE=true cargo check -p awa-metrics
- SQLX_OFFLINE=true cargo check -p awa-worker
- cargo check on awa-python (excluded from workspace, ran in its own dir)
- cargo fmt --all --check
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Warning

Rate limit exceeded

@hardbyte has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes and 48 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6223d2d7-cf9e-4a46-b071-a0c416dd507e

📥 Commits

Reviewing files that changed from the base of the PR and between 0c33d1e and 6fe5836.

📒 Files selected for processing (4)
  • awa-cli/src/main.rs
  • awa-python/src/client.rs
  • awa-ui/src/handlers/dlq.rs
  • awa-ui/src/state.rs
📝 Walkthrough

Walkthrough

Adds OpenTelemetry histograms to measure batch enqueue size and duration in COPY operations, instruments async and sync enqueue paths to emit these metrics, and introduces comprehensive documentation for deploying awa workers on managed PostgreSQL with sizing, IAM/auth, tuning, and troubleshooting guidance.

Changes

Managed Postgres Deployment and Instrumentation

Layer / File(s) Summary
Enqueue batch metrics foundation
awa-metrics/src/lib.rs
New metric name constants ENQUEUE_BATCH_SIZE and ENQUEUE_DURATION added to the names module. AwaMetrics struct extended with enqueue_batch_size: Histogram<u64> and enqueue_duration_seconds: Histogram<f64> fields. AwaMetrics::new registers both histograms with descriptions and bucket boundaries reusing job-wait duration boundaries. Public record_enqueue_batch method emits samples to both histograms with awa.job.queue attribute, no-op when batch size is zero.
Record batch metrics in COPY enqueue paths
awa-python/src/client.rs
Async enqueue_many_copy and sync enqueue_many_copy_sync now measure elapsed time around enqueue_params_copy call and record enqueue batch metric with queue name, enqueued count, and duration.
Managed Postgres deployment operational guide
docs/deploying-on-managed-postgres.md
New documentation page covering PostgreSQL version selection (Postgres 18 recommended), vCPU sizing tables and interpretation for sustained throughput and backlog drain, IAM role requirements and grant workflow for Cloud SQL with auth-proxy sidecar examples and AlloyDB auto-granted roles, explicit enqueue_shards tuning via awa.queue_meta upserts addressing head-row contention, producer-path selection recommending native COPY entry points over SQL-function fallback, and troubleshooting for staging failure modes (schema-prep startup hang, pg_stat_activity visibility, PG18 upgrade behavior, cleanup ownership errors).
Documentation navigation links
README.md, docs/deployment.md
README "Configuring real workloads" list adds reference to managed Postgres deployment doc. Deployment index adds "Next" link to managed Postgres guide covering Cloud SQL and AlloyDB sizing, IAM, and auth-proxy setup.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hardbyte/awa#206: Introduced QueueStorage::enqueue_params_copy as the COPY ingestion producer that this PR now instruments with batch enqueue metrics.
  • hardbyte/awa#235: Extended AwaMetrics OpenTelemetry instrumentation framework that this PR builds upon to add enqueue batch histograms.

Poem

🐰 Batches measured, metrics gleam,
Postgres managed, a steady stream,
COPY'd swift with shard-tuned grace,
Cloud SQL thrives in awa's space!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the two main changes: managed Postgres documentation and producer-side enqueue histograms for metrics collection.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch brian/managed-postgres-deploy-docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c33d1ec47

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread awa-python/src/client.rs Outdated
.enqueue_params_copy(&pool, &insert_params)
.await
.map_err(map_awa_error)?;
awa_worker::AwaMetrics::from_global().record_enqueue_batch(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Cache enqueue metrics outside the batch hot path

For high-throughput Python producers that call enqueue_many_copy once per chunk, this constructs a fresh AwaMetrics via from_global() after every successful batch, and AwaMetrics::new registers/builds the entire metric set (not just these two histograms). That adds dozens of instrument builder calls and allocations directly to the COPY enqueue hot path that the docs now recommend for million-job spikes; cache an AwaMetrics on PyClient or otherwise initialize the enqueue instruments once and reuse the handles.

Useful? React with 👍 / 👎.

hardbyte added 2 commits May 19, 2026 13:25
Codex feedback on PR #265: AwaMetrics::from_global() runs the full
Self::new(meter) which registers and builds every counter, histogram,
and gauge in the metric set — calling it once per enqueue_many_copy
batch puts ~30 instrument-builder allocations directly on the COPY
producer hot path the new docs explicitly recommend for million-job
spikes.

Cache AwaMetrics on PyClient and clone it into the async block (the
struct is already #[derive(Clone)]; the clone is cheap — internally
just bumps Arc refcounts on the underlying OTel instruments).

This only touches the two paths added in this PR; the pre-existing
DLQ paths in awa-python still call AwaMetrics::from_global() per
operation but those aren't producer-hot and are out of scope here.
A follow-up could fold them in.
Codex flagged this on the new enqueue_many_copy paths but the same
anti-pattern was present elsewhere — AwaMetrics::from_global() rebuilds
the full instrument set (~30 counters/histograms/gauges) on every call.

Cached sites:
- awa-ui::AppState now carries an AwaMetrics; all 5 DLQ HTTP handlers
  use state.metrics instead of from_global() per request.
- awa-python::PyClient.metrics is now shared across the 12 DLQ async/
  sync method pairs (in addition to the enqueue_many_copy paths from
  the prior commit).
- awa-cli::Commands::Dlq constructs AwaMetrics once before the inner
  match arm and reuses it across the four DLQ subcommand branches.
  Lower impact since each CLI invocation hits at most one branch, but
  consistent with the other call sites.

Intentionally left alone:
- awa-worker::ClientBuilder::build (one-shot at startup).
- AwaMetrics::Default::default (used as a fallback handle; callers
  shouldn't call it repeatedly).
- All test code.
@hardbyte hardbyte merged commit 4469f3b into main May 19, 2026
13 checks passed
@hardbyte hardbyte deleted the brian/managed-postgres-deploy-docs branch May 19, 2026 01:46
hardbyte added a commit that referenced this pull request May 19, 2026
Promote the Unreleased section to a tagged 0.6.0 entry dated today,
prepended with an operator-facing 'Highlights' section that covers the
release-readiness umbrella's outstanding doc-pass items: queue-storage
default, staged upgrade, rollback limits, enqueue_shards as a per-queue
knob, the direct COPY producer path, the managed-Postgres deploy guide,
benchmark caveats, and the known MVCC operational limit.

Folds in the three PRs that landed after the umbrella was last
synchronised:

- #263 (direct COPY producer path, sub-second wait-duration buckets,
  v021 shard-aware lane indexes)
- #265 (managed-Postgres deploy doc, awa.enqueue.{batch_size,duration}
  histograms, AwaMetrics caching cleanup)
- #266 (prepare_schema startup race fix, queue_storage_schema_ready
  tightening; closes #264)

No code or version-string changes — Cargo.toml version bump is the
release-tag step and intentionally not in this PR.
hardbyte added a commit that referenced this pull request May 19, 2026
Adds two entries pulled from the staging-benchmark + #265 / #266
arc that aren't surfaced anywhere else in troubleshooting:

- 'Producer Enqueue Is Slower Than Expected' — full diagnostic flow
  starting from the new awa.enqueue.batch_size / awa.enqueue.duration
  histograms (added in #265), with three diagnoses:
    1. compat insert path vs direct queue-storage COPY
       (~100-150 ms/row through a real DB proxy)
    2. enqueue_shards=1 contention on the head row, with the upsert
       to fix it (the plain UPDATE silently affects zero rows is a
       footgun)
    3. WAL/commit pressure on undersized managed Postgres, pointing
       at the per-vCPU sizing table

- 'relation awa.job_id_seq does not exist' under Common Error Cases.
  Post-#266 the prepare_schema race is gone, but a producer that
  races the very first worker startup, or an environment rebuild
  that drops the schema, can still surface this. Names the cause
  (queue-storage substrate is created by prepare_schema, not by
  awa migrate alone) and three fixes.

Cross-links the existing managed-Postgres doc, configuration.md
'Producer path choice', and ADR-025.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant