feat(loadgen): add room-member add load test (members-sustained, members-capacity) by hmchangw · Pull Request #203 · hmchangw/chat

hmchangw · 2026-05-19T10:07:43Z

Adds the brainstorming spec for two new tools/loadgen subcommands —
members-sustained (open-loop throughput) and members-capacity
(sequential per-room growth) — covering frontdoor and canonical inject
modes, fixture/preset shape, E1/E2 correlation, and testing plan.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Summary by CodeRabbit

Release Notes

New Features
- Added a new members load-test workload with members-sustained and members-capacity subcommands to benchmark member-addition operations.
- Introduced configurable load profiles with preset sizes, adjustable rates, warmup periods, and target capacity limits.
- Added support for multiple injection and topology configuration options to test different scenarios.
- Implemented latency percentile reporting and bucketed capacity analysis for detailed performance insights.
Documentation
- Added comprehensive design specification for the members load-test workload.
- Updated load-test README with quick-start commands and detailed flag documentation for new member-focused testing workflows.
Tests
- Added extensive test coverage for member-add load generation, publishing, collection, and end-to-end integration scenarios.

Adds the brainstorming spec for two new tools/loadgen subcommands — members-sustained (open-loop throughput) and members-capacity (sequential per-room growth) — covering frontdoor and canonical inject modes, fixture/preset shape, E1/E2 correlation, and testing plan. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Scopes v1 to --shape=users only (per spec amendment); plan covers fixtures, publishers (frontdoor + canonical), collector with (roomID, sortedAccounts) E2 correlation, sustained + capacity generators, summary printers, subcommand wiring, Makefile targets, and integration test. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Implements the deterministic members-workload fixture builder: pure function of (preset, seed, siteID) producing Users, Rooms, Subscriptions, RoomKeys, and a CandidatePools map (accounts not yet seeded per room) for downstream add-member generators. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Implements MemberPublisher interface and canonicalMemberPublisher that publishes AddMembersRequest events directly to JetStream ROOMS stream, bypassing room-service for isolated room-worker benchmarking. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Adds OnBroadcast callback to MemberCollector so per-room goroutines in CapacityMembersGenerator can gate each publish on the previous broadcast arriving, enabling closed-loop sequential room growth to a target size. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Adds --workload=messages|members (default: messages) to the seed and teardown subcommands, dispatching to per-workload helpers. The messages path preserves existing behavior except slog messages now include the workload label ("seed complete (messages)" / "teardown complete (messages)"). The members path uses BuiltinMembersPreset and BuildMembersFixtures and logs a candidatePoolTotal stat. Also extracts roomIDsOf helper shared by both teardown variants. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Wires up the members-sustained CLI subcommand, plumbing together the NATS/JetStream client, metrics HTTP server, E2 broadcast subscription, MemberPublisher (frontdoor and canonical), ConsumerSampler against the ROOMS stream, SustainedMembersGenerator, and end-of-run finalize with summary printing and optional CSV export. Also adds stub for members-capacity (to be implemented in Task 15). https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Replaces the stub runMembersCapacity with a real implementation that drives each room to --target-size using CapacityMembersGenerator, then reports per-bucket E1/E2 latency via CapacitySummary and computeSizeBuckets. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Verifies the full members-sustained pipeline with simulated room-service and room-worker stages via NATS subscriptions, asserting clean exit and non-zero traffic through both simulated stages. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Replaces fragile error-string comparison with the typed sentinel check per CLAUDE.md's "Never compare errors by string" rule. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

- Deduplicate snapshotLatencies: delete Collector.snapshotLatenciesLocked and route both Collector + MemberCollector through the package-level snapshotLatencies helper. - Extract ParseInjectMode helper; collapse three duplicate inject parse blocks (runRun, runMembersSustained, runMembersCapacity) into a single call site each. - Remove dead SiteID field from SustainedMembersConfig + CapacityMembersConfig (never read inside the generators; subjects are baked into the publisher). - NewMemberCollector now takes InjectMode (typed) instead of raw string. - Single time.Now() per publish in both generators (was called twice, causing skew in E1 latency at high rates). - Cache string(Inject) / string(Shape) labels on generator construction rather than reconverting every publish. - Trim spec-doc path from ErrPoolsExhausted and shape-validation error messages; the path will rot before the messages do. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

coderabbitai · 2026-05-19T10:08:04Z

Warning

Rate limit exceeded

@hmchangw has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 55 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16798987-c519-436b-ad27-e5b0c36313e8

📥 Commits

Reviewing files that changed from the base of the PR and between 074992d and eee926b.

📒 Files selected for processing (3)

tools/loadgen/generator_test.go
tools/loadgen/members_collector.go
tools/loadgen/members_generator.go

📝 Walkthrough

Walkthrough

This PR adds a complete load-test workload for benchmarking room member-addition operations. It introduces two execution modes (sustained rate-limited publishing and capacity-driven per-room growth), support for two request paths (frontdoor NATS request/reply and canonical JetStream), E1/E2 event correlation, and size-bucketed reporting.

Changes

Members workload benchmark

Layer / File(s)	Summary
Design specification `docs/superpowers/specs/2026-05-19-load-test-room-members-design.md`	Complete architecture document defining the workload modes, fixture model, injection paths, E1/E2 correlation strategy, metrics, error handling, test strategy, and file structure.
Subject filters and metrics foundation `pkg/subject/subject.go`, `pkg/subject/subject_test.go`, `tools/loadgen/metrics.go`	New NATS wildcard filter for member events; Prometheus metrics struct extension with member-add counters, latency histograms, and room size gauge.
Members presets and fixture generation `tools/loadgen/members.go`, `tools/loadgen/members_test.go`	Shape enum and validator; preset registry (small/medium/capacity) with `BuildMembersFixtures` for deterministic in-memory fixture generation including baseline subscriptions and per-room candidate pools.
Publisher and collector implementations `tools/loadgen/members_publisher.go`, `tools/loadgen/members_publisher_test.go`, `tools/loadgen/members_collector.go`, `tools/loadgen/members_collector_test.go`	Canonical JetStream and frontdoor NATS publishers; thread-safe collector matching E1 replies and E2 broadcasts by corrID and room/accounts, with latency sampling and error counting.
Generator engines `tools/loadgen/members_generator.go`, `tools/loadgen/members_generator_test.go`	SustainedMembersGenerator (open-loop rate-limited round-robin); CapacityMembersGenerator (sequential per-room targeting with ack-based progression).
Main CLI integration `tools/loadgen/generator.go`, `tools/loadgen/main.go`	Dispatch routing for members subcommands; seed/teardown workload selection; runSeedMembers fixture staging; runMembersSustained and runMembersCapacity with NATS/JetStream wiring and metrics collection.
Reporting, validation, and deployment `tools/loadgen/collector.go`, `tools/loadgen/main_test.go`, `tools/loadgen/members_integration_test.go`, `tools/loadgen/members_report.go`, `tools/loadgen/members_report_test.go`, `tools/loadgen/README.md`, `tools/loadgen/deploy/Makefile`	MembersSummary and CapacitySummary reporting with latency percentiles; CSV export; main_test.go CLI validation (metrics, error codes, presets); integration test with simulated NATS/JetStream pipeline; user documentation and Makefile deployment targets.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hmchangw/chat#84: Introduces member-add and member-event NATS subjects that this PR's loadgen workload uses for E1 reply correlation and E2 broadcast subscription.
hmchangw/chat#117: Initial tools/loadgen implementation of generator, collector, and metrics infrastructure that this PR extends with member-add benchmarking.

Suggested reviewers

mliu33
GITMateuszCharczuk

Poem

🐰 Hop, hop, the members hop in!
Rooms grow wide with ack-based grin,
Sustained paces, buckets measure fate,
Latencies dance from E1 to E2, first to late. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The PR title accurately and specifically describes the main feature being added: load test implementations for room-member add operations with two workload modes (sustained and capacity).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/load-test-room-members-7o4o4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md`:
- Around line 283-285: The docs use two different test filenames —
`collector_member_test.go` and `members_collector_test.go` — causing
inconsistency; pick the canonical name (use `members_collector_test.go` to match
the layout) and replace all occurrences of `collector_member_test.go` in this
spec (including the testing section and the other spot around the later mention)
with `members_collector_test.go` so the spec and implementation tasks reference
the same filename.
- Around line 30-34: The doc claims v1 implements only shape=users but later
sections describe multi-shape behavior; update the spec to make the v1 contract
explicit by either (A) adding a clear parse-time rejection for non-users shapes
(referencing the Shape enum and the --shape flag and validation logic) and
calling out that the request builder (which currently only emits user-account
lists) will error on orgs/channels/mixed, or (B) marking the multi-shape
behavior in sections referencing runtime/tests (lines describing multi-shape) as
"post-v1" only; pick one approach and apply it consistently to the Shape enum,
--shape flag/validation, and the request builder description so there are no
unsupported runtime paths implied.
- Around line 12-19: Several fenced code blocks (for example the block that
begins with "client →
chat.user.{account}.request.room.{roomID}.{siteID}.member.add →
room-service.handleAddMembers ..." and the other blocks at ranges 50-55,
238-251, 299-315) are missing language identifiers; edit those fenced code
blocks in docs/superpowers/specs/2026-05-19-load-test-room-members-design.md and
add an appropriate language tag (e.g., `text`, `bash`, or `go`) after the
opening backticks so each block is annotated for markdownlint/MD040 and tooling.

In `@tools/loadgen/main.go`:
- Around line 384-387: The shutdown code currently discards errors from
metricsSrv.Shutdown(shutCtx) and nc.Drain(); instead, capture and handle those
errors explicitly: call metricsSrv.Shutdown(shutCtx) and check its error (log
via the existing logger or return it), and do the same for nc.Drain() (log error
or propagate), and avoid discarding cancelShut() results if it can return an
error; update both occurrences (the Shutdown/Drain calls around
metricsSrv.Shutdown, cancelShut, nc.Drain and the repeated calls at the other
location flagged) to surface failures (use process logger, ctx-aware messages,
and/or return non-nil error) so cleanup failures are not silently ignored.
- Around line 631-637: The code currently computes a single global E1/E2 via
ComputePercentiles(c.E1Samples()) and assigns those same percentiles to every
non-empty bucket in the out slice; instead, compute percentiles per-bucket: for
each bucket index i where out[i].Count > 0, gather or request that bucket's
samples (e.g., use a bucket-aware samples accessor or filter
c.E1Samples()/c.E2Samples() by bucket index) and call ComputePercentiles for
that bucket, then assign those bucket-specific e1/e2 to out[i].E1 and out[i].E2;
do not reuse the global e1/e2 variables. Ensure you only compute percentiles
when out[i].Count > 0 to avoid unnecessary work.
- Around line 376-377: Replace the blind time.Sleep calls (the ones before
collector.DiscardBefore(warmupDeadline) and the similar sleep at the second
location) with an event-driven drain: stop using time.Sleep and instead wait for
a deterministic quiescence signal from the collector or an in-flight counter.
Add or use a method like collector.WaitUntilEmpty(ctx, timeout) or expose an
in-flight counter/channel on collector (e.g., collector.InFlight() + a channel
closed when zero) and block on that (or a context timeout) before calling
collector.DiscardBefore(warmupDeadline); ensure both occurrences (the sleep
before DiscardBefore and the other sleep at the second location) are replaced
with this wait and that a timeout/ctx is used to avoid indefinite blocking.

In `@tools/loadgen/members_integration_test.go`:
- Around line 57-67: The callback registered in rsSub (nc.Subscribe callback) is
silently discarding errors from json.Unmarshal, m.Respond, and js.Publish which
can hide pipeline failures; update the callback to check and handle each error
returned by m.Respond and js.Publish (and the outer nc.Subscribe err) — log or
t.Errorf/fatal the error or increment a failure counter so the test surface
fails when any Respond/Publish fails, and add a short comment only if you
intentionally choose to ignore an error; apply the same treatment to the other
simulation callbacks referenced around lines 80-97 to avoid silently swallowing
errors.
- Around line 121-126: Replace the fixed time.Sleep with a deterministic
wait-based assertion: remove the 500ms Sleep and instead use a polling/assertion
that retries until both counters are > 0 (for example use assert.Eventually or a
small loop with time.After and polling) checking rsCalls.Load() > 0 and
rwCalls.Load() > 0; reference the existing symbols rsCalls.Load() and
rwCalls.Load() and fail the test if the condition does not become true within a
reasonable timeout (e.g. a few seconds) to avoid flakiness.

In `@tools/loadgen/members_publisher_test.go`:
- Around line 20-38: The tests currently start an actual NATS server via
startEmbeddedJetStream which makes them integration tests; either move these
tests into an explicit integration test package/suite (so startEmbeddedJetStream
remains but only used by integration tests) or refactor the publisher
construction to accept an injectable publish client/function (e.g., add a
PublishFunc or JetStreamer interface parameter to the publisher constructor used
by MembersPublisher or NewMembersPublisher) so unit tests can supply a fake/mock
instead of calling startEmbeddedJetStream; update tests in the 40-141 range to
use the fake PublishFunc/JetStreamer for unit tests and keep a separate
integration test that uses startEmbeddedJetStream if end-to-end verification is
needed.

In `@tools/loadgen/members_publisher.go`:
- Around line 79-80: In Close(), don't ignore the result of p.sub.Unsubscribe():
capture the returned error from p.sub.Unsubscribe(), and either return it (or
log it via the publisher's logger) before setting p.sub = nil so teardown
failures are observable; if you intentionally want to discard the error, add a
clear comment stating why. Update the Close method to handle the unsubscribe
error from p.sub.Unsubscribe(), referencing the Close method, the p.sub field
and the Unsubscribe() call.
- Around line 65-68: Guard the p.onReply call inside the nc.Subscribe callback
to avoid panics when no handler is set: in the subscription created around
inboxPrefix+".*" (the callback that computes corr :=
m.Subject[len(inboxPrefix)+1:]) check if p.onReply != nil before calling
p.onReply(corr, m.Data, time.Now()), or alternatively enforce a non-nil handler
in newFrontdoorMemberPublisher by returning an error or setting a no-op handler;
update either the callback or the constructor to ensure replies are handled
safely.

In `@tools/loadgen/README.md`:
- Around line 93-110: The README.md fenced code blocks with the make commands
are missing language identifiers (triggering MD040); update each triple-backtick
fence that wraps the make -C tools/loadgen/deploy commands to use ```bash so all
three blocks (the sustained run block, the capacity-mode block, and the
reset-members block) start with ```bash and retain their existing content,
ensuring the markdown linter no longer flags MD040.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 129bb287-d6c4-4365-8977-af58da83c1e4

📥 Commits

Reviewing files that changed from the base of the PR and between 7b5747b and 074992d.

📒 Files selected for processing (22)

docs/superpowers/plans/2026-05-19-load-test-room-members.md
docs/superpowers/specs/2026-05-19-load-test-room-members-design.md
pkg/subject/subject.go
pkg/subject/subject_test.go
tools/loadgen/README.md
tools/loadgen/collector.go
tools/loadgen/deploy/Makefile
tools/loadgen/generator.go
tools/loadgen/main.go
tools/loadgen/main_test.go
tools/loadgen/members.go
tools/loadgen/members_collector.go
tools/loadgen/members_collector_test.go
tools/loadgen/members_generator.go
tools/loadgen/members_generator_test.go
tools/loadgen/members_integration_test.go
tools/loadgen/members_publisher.go
tools/loadgen/members_publisher_test.go
tools/loadgen/members_report.go
tools/loadgen/members_report_test.go
tools/loadgen/members_test.go
tools/loadgen/metrics.go

coderabbitai · 2026-05-19T10:15:54Z

+```
+client
+  → chat.user.{account}.request.room.{roomID}.{siteID}.member.add
+  → room-service.handleAddMembers  (auth, capacity, dedup, channel expansion)
+  → chat.room.canonical.{siteID}.member.add  (ROOMS stream)
+  → room-worker  (resolve, write subscriptions, emit broadcast + outbox)
+  → chat.room.{roomID}.event  (member_added RoomEvent)
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

The fences in these sections are missing language hints. Please annotate them (for example text, bash, or go) to satisfy markdownlint and keep docs tooling consistent.

Also applies to: 50-55, 238-251, 299-315

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 12-12: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around lines 12 - 19, Several fenced code blocks (for example the block that begins with "client → chat.user.{account}.request.room.{roomID}.{siteID}.member.add → room-service.handleAddMembers ..." and the other blocks at ranges 50-55, 238-251, 299-315) are missing language identifiers; edit those fenced code blocks in docs/superpowers/specs/2026-05-19-load-test-room-members-design.md and add an appropriate language tag (e.g., `text`, `bash`, or `go`) after the opening backticks so each block is annotated for markdownlint/MD040 and tooling.

coderabbitai · 2026-05-19T10:15:54Z

+The first cut implements `--shape=users` only. The `Shape` enum, flag,
+and validation are in place from the start so adding `orgs` / `channels` /
+`mixed` is a small follow-up plan, but the v1 plan does not seed an org
+pool or a source-channel pool, and the request builder only emits
+user-account lists.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify the v1 shape contract to avoid unsupported runtime paths.

Line [30] says v1 is shape=users only, but Lines [140-142] and [279-282] describe active multi-shape runtime behavior/tests. Please make the contract explicit: either reject non-users shapes at parse time in v1, or mark those sections as post-v1 only.

Also applies to: 140-142, 279-282

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around lines 30 - 34, The doc claims v1 implements only shape=users but later sections describe multi-shape behavior; update the spec to make the v1 contract explicit by either (A) adding a clear parse-time rejection for non-users shapes (referencing the Shape enum and the --shape flag and validation logic) and calling out that the request builder (which currently only emits user-account lists) will error on orgs/channels/mixed, or (B) marking the multi-shape behavior in sections referencing runtime/tests (lines describing multi-shape) as "post-v1" only; pick one approach and apply it consistently to the Shape enum, --shape flag/validation, and the request builder description so there are no unsupported runtime paths implied.

coderabbitai · 2026-05-19T10:15:54Z

+- `collector_member_test.go` — E1/E2 correlation join, size-bucketing
+  edges, percentile output, `member_added` filter rejects other
+  RoomEvent types.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix inconsistent test filename in the testing section.

Line [283] uses collector_member_test.go, while the layout uses members_collector_test.go. Keep one canonical name to prevent churn and mislinked implementation tasks.

Also applies to: 303-305

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around lines 283 - 285, The docs use two different test filenames — `collector_member_test.go` and `members_collector_test.go` — causing inconsistency; pick the canonical name (use `members_collector_test.go` to match the layout) and replace all occurrences of `collector_member_test.go` in this spec (including the testing section and the other spot around the later mention) with `members_collector_test.go` so the spec and implementation tasks reference the same filename.

coderabbitai · 2026-05-19T10:15:54Z

+	time.Sleep(2 * time.Second) // drain trailing replies/broadcasts
+	collector.DiscardBefore(warmupDeadline)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace fixed sleeps with event-driven drain completion.

Line 376 and Line 567 use time.Sleep(2 * time.Second) to wait for trailing events, which is nondeterministic under varying load and can skew measured results. Wait on explicit completion criteria (e.g., in-flight counters, collector quiescence, or context-bounded channel signaling) instead.

As per coding guidelines: “Never use time.Sleep() for goroutine synchronization — use proper sync primitives (channels, sync.WaitGroup, sync.Mutex, context cancellation).”

Also applies to: 567-568

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/main.go` around lines 376 - 377, Replace the blind time.Sleep calls (the ones before collector.DiscardBefore(warmupDeadline) and the similar sleep at the second location) with an event-driven drain: stop using time.Sleep and instead wait for a deterministic quiescence signal from the collector or an in-flight counter. Add or use a method like collector.WaitUntilEmpty(ctx, timeout) or expose an in-flight counter/channel on collector (e.g., collector.InFlight() + a channel closed when zero) and block on that (or a context timeout) before calling collector.DiscardBefore(warmupDeadline); ensure both occurrences (the sleep before DiscardBefore and the other sleep at the second location) are replaced with this wait and that a timeout/ctx is used to avoid indefinite blocking.

coderabbitai · 2026-05-19T10:15:54Z

+	_ = metricsSrv.Shutdown(shutCtx)
+	cancelShut()
+	_ = nc.Drain()
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle shutdown/drain errors explicitly instead of discarding them.

Line 384/386 and Line 571/573 ignore errors from server shutdown and NATS drain. This hides cleanup failures and can mask dropped metrics or pending publishes during teardown.

Suggested fix

- _ = metricsSrv.Shutdown(shutCtx) + if err := metricsSrv.Shutdown(shutCtx); err != nil { + slog.Warn("metrics shutdown", "error", err) + } cancelShut() - _ = nc.Drain() + if err := nc.Drain(); err != nil { + slog.Warn("nats drain", "error", err) + }

As per coding guidelines: “Never ignore errors silently — comment if intentionally discarded.”

Also applies to: 571-573

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/main.go` around lines 384 - 387, The shutdown code currently discards errors from metricsSrv.Shutdown(shutCtx) and nc.Drain(); instead, capture and handle those errors explicitly: call metricsSrv.Shutdown(shutCtx) and check its error (log via the existing logger or return it), and do the same for nc.Drain() (log error or propagate), and avoid discarding cancelShut() results if it can return an error; update both occurrences (the Shutdown/Drain calls around metricsSrv.Shutdown, cancelShut, nc.Drain and the repeated calls at the other location flagged) to surface failures (use process logger, ctx-aware messages, and/or return non-nil error) so cleanup failures are not silently ignored.

coderabbitai · 2026-05-19T10:15:55Z

+	// Allow trailing events to settle before checking counters.
+	time.Sleep(500 * time.Millisecond)
+
+	// Both simulated stages must have seen traffic.
+	assert.Greater(t, rsCalls.Load(), int64(0), "room-service was never called")
+	assert.Greater(t, rwCalls.Load(), int64(0), "room-worker never consumed")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace fixed sleep with deterministic synchronization.

The fixed delay can cause flakes under slower CI/load and is explicitly discouraged for synchronization in this repo. Use an eventual/wait-based assertion instead.

Suggested fix

- // Allow trailing events to settle before checking counters. - time.Sleep(500 * time.Millisecond) - - // Both simulated stages must have seen traffic. - assert.Greater(t, rsCalls.Load(), int64(0), "room-service was never called") - assert.Greater(t, rwCalls.Load(), int64(0), "room-worker never consumed") + // Both simulated stages must see traffic. + require.Eventually(t, func() bool { + return rsCalls.Load() > 0 && rwCalls.Load() > 0 + }, 5*time.Second, 50*time.Millisecond, "simulated pipeline saw no traffic")

As per coding guidelines: “Never use time.Sleep() for goroutine synchronization — use proper sync primitives.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Allow trailing events to settle before checking counters.

time.Sleep(500 * time.Millisecond)

// Both simulated stages must have seen traffic.

assert.Greater(t, rsCalls.Load(), int64(0), "room-service was never called")

assert.Greater(t, rwCalls.Load(), int64(0), "room-worker never consumed")

// Both simulated stages must see traffic.

require.Eventually(t, func() bool {

return rsCalls.Load() > 0 && rwCalls.Load() > 0

}, 5*time.Second, 50*time.Millisecond, "simulated pipeline saw no traffic")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/members_integration_test.go` around lines 121 - 126, Replace the fixed time.Sleep with a deterministic wait-based assertion: remove the 500ms Sleep and instead use a polling/assertion that retries until both counters are > 0 (for example use assert.Eventually or a small loop with time.After and polling) checking rsCalls.Load() > 0 and rwCalls.Load() > 0; reference the existing symbols rsCalls.Load() and rwCalls.Load() and fail the test if the condition does not become true within a reasonable timeout (e.g. a few seconds) to avoid flakiness.

coderabbitai · 2026-05-19T10:15:55Z

+func startEmbeddedJetStream(t *testing.T) (*nats.Conn, jetstream.JetStream) {
+	t.Helper()
+	opts := &natsserver.Options{
+		Port:      -1,
+		JetStream: true,
+		StoreDir:  t.TempDir(),
+	}
+	ns, err := natsserver.NewServer(opts)
+	require.NoError(t, err)
+	ns.Start()
+	require.True(t, ns.ReadyForConnections(5*time.Second), "nats server did not become ready")
+	t.Cleanup(ns.Shutdown)
+	nc, err := nats.Connect(ns.ClientURL())
+	require.NoError(t, err)
+	t.Cleanup(nc.Close)
+	js, err := jetstream.New(nc)
+	require.NoError(t, err)
+	return nc, js
+}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

These tests rely on a real NATS server instead of unit-level injection/mocks.

Current coverage is integration-style and violates the unit-test constraint for this repo. Please move these to explicit integration test scope or refactor publisher construction so unit tests can inject publish/subscribe fakes without booting NATS.

As per coding guidelines, “Never connect to real databases, NATS, or external services in unit tests. When a handler publishes to JetStream, inject the publish function as a field so tests can capture data without a real NATS connection.”

Also applies to: 40-141

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/members_publisher_test.go` around lines 20 - 38, The tests currently start an actual NATS server via startEmbeddedJetStream which makes them integration tests; either move these tests into an explicit integration test package/suite (so startEmbeddedJetStream remains but only used by integration tests) or refactor the publisher construction to accept an injectable publish client/function (e.g., add a PublishFunc or JetStreamer interface parameter to the publisher constructor used by MembersPublisher or NewMembersPublisher) so unit tests can supply a fake/mock instead of calling startEmbeddedJetStream; update tests in the 40-141 range to use the fake PublishFunc/JetStreamer for unit tests and keep a separate integration test that uses startEmbeddedJetStream if end-to-end verification is needed.

coderabbitai · 2026-05-19T10:15:55Z

+	sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) {
+		corr := m.Subject[len(inboxPrefix)+1:]
+		p.onReply(corr, m.Data, time.Now())
+	})


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard onReply before invoking it in the subscription callback.

If newFrontdoorMemberPublisher is called with a nil handler, this callback path panics on the first reply. Add a nil guard (or enforce non-nil in constructor) to keep reply handling fail-safe.

Suggested fix

sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) { corr := m.Subject[len(inboxPrefix)+1:] - p.onReply(corr, m.Data, time.Now()) + if p.onReply != nil { + p.onReply(corr, m.Data, time.Now()) + } })

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) {

corr := m.Subject[len(inboxPrefix)+1:]

p.onReply(corr, m.Data, time.Now())

})

sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) {

corr := m.Subject[len(inboxPrefix)+1:]

if p.onReply != nil {

p.onReply(corr, m.Data, time.Now())

}

})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/members_publisher.go` around lines 65 - 68, Guard the p.onReply call inside the nc.Subscribe callback to avoid panics when no handler is set: in the subscription created around inboxPrefix+".*" (the callback that computes corr := m.Subject[len(inboxPrefix)+1:]) check if p.onReply != nil before calling p.onReply(corr, m.Data, time.Now()), or alternatively enforce a non-nil handler in newFrontdoorMemberPublisher by returning an error or setting a no-op handler; update either the callback or the constructor to ensure replies are handled safely.

coderabbitai · 2026-05-19T10:15:55Z

+		_ = p.sub.Unsubscribe()
+		p.sub = nil


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t silently discard Unsubscribe() errors in Close().

Handle/log the error (or at least document intentional discard) so teardown failures are observable.

As per coding guidelines, “Never ignore errors silently — comment if intentionally discarded.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/loadgen/members_publisher.go` around lines 79 - 80, In Close(), don't ignore the result of p.sub.Unsubscribe(): capture the returned error from p.sub.Unsubscribe(), and either return it (or log it via the publisher's logger) before setting p.sub = nil so teardown failures are observable; if you intentionally want to discard the error, add a clear comment stating why. Update the Close method to handle the unsubscribe error from p.sub.Unsubscribe(), referencing the Close method, the p.sub field and the Unsubscribe() call.

coderabbitai · 2026-05-19T10:15:55Z

+```
+make -C tools/loadgen/deploy up
+make -C tools/loadgen/deploy seed-members PRESET=members-medium
+make -C tools/loadgen/deploy run-sustained PRESET=members-medium RATE=100 DURATION=60s
+```
+
+For capacity-mode growth curves:
+
+```
+make -C tools/loadgen/deploy seed-members PRESET=members-capacity
+make -C tools/loadgen/deploy run-capacity  PRESET=members-capacity TARGET_SIZE=500
+```
+
+Between sustained runs, reset state so candidate pools refill:
+
+```
+make -C tools/loadgen/deploy reset-members PRESET=members-medium
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to the new fenced code blocks.

The newly added fences are missing a language tag, which triggers MD040.

Suggested fix

-``` +```bash make -C tools/loadgen/deploy up make -C tools/loadgen/deploy seed-members PRESET=members-medium make -C tools/loadgen/deploy run-sustained PRESET=members-medium RATE=100 DURATION=60s

@@
- +bash
make -C tools/loadgen/deploy seed-members PRESET=members-capacity
make -C tools/loadgen/deploy run-capacity PRESET=members-capacity TARGET_SIZE=500

@@ -``` +```bash make -C tools/loadgen/deploy reset-members PRESET=members-medium

</details> <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.22.1)</summary> [warning] 93-93: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- [warning] 101-101: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- [warning] 108-108: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @tools/loadgen/README.md around lines 93 - 110, The README.md fenced code
blocks with the make commands are missing language identifiers (triggering
MD040); update each triple-backtick fence that wraps the make -C
tools/loadgen/deploy commands to use ```bash so all three blocks (the sustained
run block, the capacity-mode block, and the reset-members block) start with

longer flags MD040.

Round-two simplify pass: - Add presetLabel field to both members generators, set in constructor, used in the metrics WithLabelValues call instead of g.cfg.Preset.Name. Completes the cached-label pattern introduced in the previous pass. - Doc-comment snapshotLatencies to flag that callers must already hold the collector mutex, since the function now serves both Collector and MemberCollector across files. - Add TestParseInjectMode mirroring TestParseShape. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

…scrape + warn on --allow-concurrent Three small follow-ups from the final ship-readiness review: 1. parseInjectMode (internal, flags.go) consolidated into ParseInjectMode (exported, generator.go). PR #203 added the exported version; this removes the internal duplicate and updates the one caller (dispatch.go:457) + test (flags_test.go:27). The generator_test.go comment that flagged this as a known follow-up is updated to note the consolidation is done. 2. bootstrap_error scrape race: the metric is incremented then Run returns the error and the process exits in <1s. With the default Prometheus scrape interval of 15s, the metric was unobservable in practice. Hold the process open for 20s (ctx-honoring; SIGINT still exits promptly, and tests with short ctx timeouts still pass quickly) so the scraper has a chance to read the counter before exit. 20s > 15s + jitter; well below operator patience for a fast-failing run. Surfaced via slog.Warn so the operator knows why the process is lingering. 3. --allow-concurrent runs share SUT resources (ES index growth, search-service capacity, NATS redelivery budget) even though msgID-keyed correlation prevents cross-credit. Emit a structured slog.Warn when the flag is set so operators don't misread numbers that look like degradation in run B but are actually saturation from run A. msgID isolation comment in CHANGES.md is unchanged (still accurate at the correctness level).

claude added 21 commits May 19, 2026 06:51

feat(subject): add RoomMemberEventWildcard helper

5d8766c

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

feat(loadgen): add member-add Prometheus collectors

0f389b5

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

feat(loadgen): add Shape enum + inject/shape validation

884a167

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

feat(loadgen): add MembersPreset + builtin members presets

c0d0929

feat(loadgen): add OwnersByRoom helper

98c418e

feat(loadgen): add frontdoor member publisher with reply correlation

da3da91

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

feat(loadgen): add member-add E1/E2 collector

66af9f4

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

feat(loadgen): add SustainedMembersGenerator

fe50675

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

feat(loadgen): add MembersSummary + CapacitySummary printers

8be678e

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

docs(loadgen): document members workload + Makefile targets

eebb317

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

test(loadgen): use errors.Is for ErrPoolsExhausted assertion

c83b461

Replaces fragile error-string comparison with the typed sentinel check per CLAUDE.md's "Never compare errors by string" rule. https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

hmchangw changed the title ~~docs: design spec for room-member add load test~~ feat(loadgen): add room-member add load test (members-sustained, members-capacity) May 20, 2026

Joey0538 approved these changes May 20, 2026

View reviewed changes

hmchangw merged commit bfaf9b6 into main May 20, 2026
6 checks passed

hmchangw deleted the claude/load-test-room-members-7o4o4 branch May 20, 2026 07:18

coderabbitai Bot mentioned this pull request Jun 3, 2026

feat(loadgen): room-read max-rps workload #266

Merged

5 tasks

coderabbitai Bot mentioned this pull request Jun 3, 2026

Add X-Request-ID header to member publisher messages #268

Merged

		time.Sleep(2 * time.Second) // drain trailing replies/broadcasts
		collector.DiscardBefore(warmupDeadline)

Conversation

hmchangw commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hmchangw commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading