Skip to content

feat(loadgen): add room-member add load test (members-sustained, members-capacity)#203

Merged
hmchangw merged 22 commits into
mainfrom
claude/load-test-room-members-7o4o4
May 20, 2026
Merged

feat(loadgen): add room-member add load test (members-sustained, members-capacity)#203
hmchangw merged 22 commits into
mainfrom
claude/load-test-room-members-7o4o4

Conversation

@hmchangw
Copy link
Copy Markdown
Owner

@hmchangw hmchangw commented May 19, 2026

Adds the brainstorming spec for two new tools/loadgen subcommands —
members-sustained (open-loop throughput) and members-capacity
(sequential per-room growth) — covering frontdoor and canonical inject
modes, fixture/preset shape, E1/E2 correlation, and testing plan.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t

Summary by CodeRabbit

Release Notes

  • New Features

    • Added a new members load-test workload with members-sustained and members-capacity subcommands to benchmark member-addition operations.
    • Introduced configurable load profiles with preset sizes, adjustable rates, warmup periods, and target capacity limits.
    • Added support for multiple injection and topology configuration options to test different scenarios.
    • Implemented latency percentile reporting and bucketed capacity analysis for detailed performance insights.
  • Documentation

    • Added comprehensive design specification for the members load-test workload.
    • Updated load-test README with quick-start commands and detailed flag documentation for new member-focused testing workflows.
  • Tests

    • Added extensive test coverage for member-add load generation, publishing, collection, and end-to-end integration scenarios.

Review Change Stack

claude added 21 commits May 19, 2026 06:51
Adds the brainstorming spec for two new tools/loadgen subcommands —
members-sustained (open-loop throughput) and members-capacity
(sequential per-room growth) — covering frontdoor and canonical inject
modes, fixture/preset shape, E1/E2 correlation, and testing plan.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Scopes v1 to --shape=users only (per spec amendment); plan covers
fixtures, publishers (frontdoor + canonical), collector with
(roomID, sortedAccounts) E2 correlation, sustained + capacity
generators, summary printers, subcommand wiring, Makefile targets,
and integration test.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Implements the deterministic members-workload fixture builder: pure function
of (preset, seed, siteID) producing Users, Rooms, Subscriptions, RoomKeys,
and a CandidatePools map (accounts not yet seeded per room) for downstream
add-member generators.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Implements MemberPublisher interface and canonicalMemberPublisher that
publishes AddMembersRequest events directly to JetStream ROOMS stream,
bypassing room-service for isolated room-worker benchmarking.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Adds OnBroadcast callback to MemberCollector so per-room goroutines in
CapacityMembersGenerator can gate each publish on the previous broadcast
arriving, enabling closed-loop sequential room growth to a target size.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Adds --workload=messages|members (default: messages) to the seed and
teardown subcommands, dispatching to per-workload helpers. The messages
path preserves existing behavior except slog messages now include the
workload label ("seed complete (messages)" / "teardown complete (messages)").
The members path uses BuiltinMembersPreset and BuildMembersFixtures and
logs a candidatePoolTotal stat. Also extracts roomIDsOf helper shared by
both teardown variants.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Wires up the members-sustained CLI subcommand, plumbing together the
NATS/JetStream client, metrics HTTP server, E2 broadcast subscription,
MemberPublisher (frontdoor and canonical), ConsumerSampler against the
ROOMS stream, SustainedMembersGenerator, and end-of-run finalize with
summary printing and optional CSV export. Also adds stub for
members-capacity (to be implemented in Task 15).

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Replaces the stub runMembersCapacity with a real implementation that
drives each room to --target-size using CapacityMembersGenerator, then
reports per-bucket E1/E2 latency via CapacitySummary and computeSizeBuckets.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Verifies the full members-sustained pipeline with simulated room-service
and room-worker stages via NATS subscriptions, asserting clean exit and
non-zero traffic through both simulated stages.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
Replaces fragile error-string comparison with the typed sentinel check
per CLAUDE.md's "Never compare errors by string" rule.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
- Deduplicate snapshotLatencies: delete Collector.snapshotLatenciesLocked
  and route both Collector + MemberCollector through the package-level
  snapshotLatencies helper.
- Extract ParseInjectMode helper; collapse three duplicate inject parse
  blocks (runRun, runMembersSustained, runMembersCapacity) into a single
  call site each.
- Remove dead SiteID field from SustainedMembersConfig + CapacityMembersConfig
  (never read inside the generators; subjects are baked into the publisher).
- NewMemberCollector now takes InjectMode (typed) instead of raw string.
- Single time.Now() per publish in both generators (was called twice,
  causing skew in E1 latency at high rates).
- Cache string(Inject) / string(Shape) labels on generator construction
  rather than reconverting every publish.
- Trim spec-doc path from ErrPoolsExhausted and shape-validation error
  messages; the path will rot before the messages do.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Warning

Rate limit exceeded

@hmchangw has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 55 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16798987-c519-436b-ad27-e5b0c36313e8

📥 Commits

Reviewing files that changed from the base of the PR and between 074992d and eee926b.

📒 Files selected for processing (3)
  • tools/loadgen/generator_test.go
  • tools/loadgen/members_collector.go
  • tools/loadgen/members_generator.go
📝 Walkthrough

Walkthrough

This PR adds a complete load-test workload for benchmarking room member-addition operations. It introduces two execution modes (sustained rate-limited publishing and capacity-driven per-room growth), support for two request paths (frontdoor NATS request/reply and canonical JetStream), E1/E2 event correlation, and size-bucketed reporting.

Changes

Members workload benchmark

Layer / File(s) Summary
Design specification
docs/superpowers/specs/2026-05-19-load-test-room-members-design.md
Complete architecture document defining the workload modes, fixture model, injection paths, E1/E2 correlation strategy, metrics, error handling, test strategy, and file structure.
Subject filters and metrics foundation
pkg/subject/subject.go, pkg/subject/subject_test.go, tools/loadgen/metrics.go
New NATS wildcard filter for member events; Prometheus metrics struct extension with member-add counters, latency histograms, and room size gauge.
Members presets and fixture generation
tools/loadgen/members.go, tools/loadgen/members_test.go
Shape enum and validator; preset registry (small/medium/capacity) with BuildMembersFixtures for deterministic in-memory fixture generation including baseline subscriptions and per-room candidate pools.
Publisher and collector implementations
tools/loadgen/members_publisher.go, tools/loadgen/members_publisher_test.go, tools/loadgen/members_collector.go, tools/loadgen/members_collector_test.go
Canonical JetStream and frontdoor NATS publishers; thread-safe collector matching E1 replies and E2 broadcasts by corrID and room/accounts, with latency sampling and error counting.
Generator engines
tools/loadgen/members_generator.go, tools/loadgen/members_generator_test.go
SustainedMembersGenerator (open-loop rate-limited round-robin); CapacityMembersGenerator (sequential per-room targeting with ack-based progression).
Main CLI integration
tools/loadgen/generator.go, tools/loadgen/main.go
Dispatch routing for members subcommands; seed/teardown workload selection; runSeedMembers fixture staging; runMembersSustained and runMembersCapacity with NATS/JetStream wiring and metrics collection.
Reporting, validation, and deployment
tools/loadgen/collector.go, tools/loadgen/main_test.go, tools/loadgen/members_integration_test.go, tools/loadgen/members_report.go, tools/loadgen/members_report_test.go, tools/loadgen/README.md, tools/loadgen/deploy/Makefile
MembersSummary and CapacitySummary reporting with latency percentiles; CSV export; main_test.go CLI validation (metrics, error codes, presets); integration test with simulated NATS/JetStream pipeline; user documentation and Makefile deployment targets.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hmchangw/chat#84: Introduces member-add and member-event NATS subjects that this PR's loadgen workload uses for E1 reply correlation and E2 broadcast subscription.
  • hmchangw/chat#117: Initial tools/loadgen implementation of generator, collector, and metrics infrastructure that this PR extends with member-add benchmarking.

Suggested reviewers

  • mliu33
  • GITMateuszCharczuk

Poem

🐰 Hop, hop, the members hop in!
Rooms grow wide with ack-based grin,
Sustained paces, buckets measure fate,
Latencies dance from E1 to E2, first to late.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title accurately and specifically describes the main feature being added: load test implementations for room-member add operations with two workload modes (sustained and capacity).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/load-test-room-members-7o4o4

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md`:
- Around line 283-285: The docs use two different test filenames —
`collector_member_test.go` and `members_collector_test.go` — causing
inconsistency; pick the canonical name (use `members_collector_test.go` to match
the layout) and replace all occurrences of `collector_member_test.go` in this
spec (including the testing section and the other spot around the later mention)
with `members_collector_test.go` so the spec and implementation tasks reference
the same filename.
- Around line 30-34: The doc claims v1 implements only shape=users but later
sections describe multi-shape behavior; update the spec to make the v1 contract
explicit by either (A) adding a clear parse-time rejection for non-users shapes
(referencing the Shape enum and the --shape flag and validation logic) and
calling out that the request builder (which currently only emits user-account
lists) will error on orgs/channels/mixed, or (B) marking the multi-shape
behavior in sections referencing runtime/tests (lines describing multi-shape) as
"post-v1" only; pick one approach and apply it consistently to the Shape enum,
--shape flag/validation, and the request builder description so there are no
unsupported runtime paths implied.
- Around line 12-19: Several fenced code blocks (for example the block that
begins with "client →
chat.user.{account}.request.room.{roomID}.{siteID}.member.add →
room-service.handleAddMembers ..." and the other blocks at ranges 50-55,
238-251, 299-315) are missing language identifiers; edit those fenced code
blocks in docs/superpowers/specs/2026-05-19-load-test-room-members-design.md and
add an appropriate language tag (e.g., `text`, `bash`, or `go`) after the
opening backticks so each block is annotated for markdownlint/MD040 and tooling.

In `@tools/loadgen/main.go`:
- Around line 384-387: The shutdown code currently discards errors from
metricsSrv.Shutdown(shutCtx) and nc.Drain(); instead, capture and handle those
errors explicitly: call metricsSrv.Shutdown(shutCtx) and check its error (log
via the existing logger or return it), and do the same for nc.Drain() (log error
or propagate), and avoid discarding cancelShut() results if it can return an
error; update both occurrences (the Shutdown/Drain calls around
metricsSrv.Shutdown, cancelShut, nc.Drain and the repeated calls at the other
location flagged) to surface failures (use process logger, ctx-aware messages,
and/or return non-nil error) so cleanup failures are not silently ignored.
- Around line 631-637: The code currently computes a single global E1/E2 via
ComputePercentiles(c.E1Samples()) and assigns those same percentiles to every
non-empty bucket in the out slice; instead, compute percentiles per-bucket: for
each bucket index i where out[i].Count > 0, gather or request that bucket's
samples (e.g., use a bucket-aware samples accessor or filter
c.E1Samples()/c.E2Samples() by bucket index) and call ComputePercentiles for
that bucket, then assign those bucket-specific e1/e2 to out[i].E1 and out[i].E2;
do not reuse the global e1/e2 variables. Ensure you only compute percentiles
when out[i].Count > 0 to avoid unnecessary work.
- Around line 376-377: Replace the blind time.Sleep calls (the ones before
collector.DiscardBefore(warmupDeadline) and the similar sleep at the second
location) with an event-driven drain: stop using time.Sleep and instead wait for
a deterministic quiescence signal from the collector or an in-flight counter.
Add or use a method like collector.WaitUntilEmpty(ctx, timeout) or expose an
in-flight counter/channel on collector (e.g., collector.InFlight() + a channel
closed when zero) and block on that (or a context timeout) before calling
collector.DiscardBefore(warmupDeadline); ensure both occurrences (the sleep
before DiscardBefore and the other sleep at the second location) are replaced
with this wait and that a timeout/ctx is used to avoid indefinite blocking.

In `@tools/loadgen/members_integration_test.go`:
- Around line 57-67: The callback registered in rsSub (nc.Subscribe callback) is
silently discarding errors from json.Unmarshal, m.Respond, and js.Publish which
can hide pipeline failures; update the callback to check and handle each error
returned by m.Respond and js.Publish (and the outer nc.Subscribe err) — log or
t.Errorf/fatal the error or increment a failure counter so the test surface
fails when any Respond/Publish fails, and add a short comment only if you
intentionally choose to ignore an error; apply the same treatment to the other
simulation callbacks referenced around lines 80-97 to avoid silently swallowing
errors.
- Around line 121-126: Replace the fixed time.Sleep with a deterministic
wait-based assertion: remove the 500ms Sleep and instead use a polling/assertion
that retries until both counters are > 0 (for example use assert.Eventually or a
small loop with time.After and polling) checking rsCalls.Load() > 0 and
rwCalls.Load() > 0; reference the existing symbols rsCalls.Load() and
rwCalls.Load() and fail the test if the condition does not become true within a
reasonable timeout (e.g. a few seconds) to avoid flakiness.

In `@tools/loadgen/members_publisher_test.go`:
- Around line 20-38: The tests currently start an actual NATS server via
startEmbeddedJetStream which makes them integration tests; either move these
tests into an explicit integration test package/suite (so startEmbeddedJetStream
remains but only used by integration tests) or refactor the publisher
construction to accept an injectable publish client/function (e.g., add a
PublishFunc or JetStreamer interface parameter to the publisher constructor used
by MembersPublisher or NewMembersPublisher) so unit tests can supply a fake/mock
instead of calling startEmbeddedJetStream; update tests in the 40-141 range to
use the fake PublishFunc/JetStreamer for unit tests and keep a separate
integration test that uses startEmbeddedJetStream if end-to-end verification is
needed.

In `@tools/loadgen/members_publisher.go`:
- Around line 79-80: In Close(), don't ignore the result of p.sub.Unsubscribe():
capture the returned error from p.sub.Unsubscribe(), and either return it (or
log it via the publisher's logger) before setting p.sub = nil so teardown
failures are observable; if you intentionally want to discard the error, add a
clear comment stating why. Update the Close method to handle the unsubscribe
error from p.sub.Unsubscribe(), referencing the Close method, the p.sub field
and the Unsubscribe() call.
- Around line 65-68: Guard the p.onReply call inside the nc.Subscribe callback
to avoid panics when no handler is set: in the subscription created around
inboxPrefix+".*" (the callback that computes corr :=
m.Subject[len(inboxPrefix)+1:]) check if p.onReply != nil before calling
p.onReply(corr, m.Data, time.Now()), or alternatively enforce a non-nil handler
in newFrontdoorMemberPublisher by returning an error or setting a no-op handler;
update either the callback or the constructor to ensure replies are handled
safely.

In `@tools/loadgen/README.md`:
- Around line 93-110: The README.md fenced code blocks with the make commands
are missing language identifiers (triggering MD040); update each triple-backtick
fence that wraps the make -C tools/loadgen/deploy commands to use ```bash so all
three blocks (the sustained run block, the capacity-mode block, and the
reset-members block) start with ```bash and retain their existing content,
ensuring the markdown linter no longer flags MD040.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 129bb287-d6c4-4365-8977-af58da83c1e4

📥 Commits

Reviewing files that changed from the base of the PR and between 7b5747b and 074992d.

📒 Files selected for processing (22)
  • docs/superpowers/plans/2026-05-19-load-test-room-members.md
  • docs/superpowers/specs/2026-05-19-load-test-room-members-design.md
  • pkg/subject/subject.go
  • pkg/subject/subject_test.go
  • tools/loadgen/README.md
  • tools/loadgen/collector.go
  • tools/loadgen/deploy/Makefile
  • tools/loadgen/generator.go
  • tools/loadgen/main.go
  • tools/loadgen/main_test.go
  • tools/loadgen/members.go
  • tools/loadgen/members_collector.go
  • tools/loadgen/members_collector_test.go
  • tools/loadgen/members_generator.go
  • tools/loadgen/members_generator_test.go
  • tools/loadgen/members_integration_test.go
  • tools/loadgen/members_publisher.go
  • tools/loadgen/members_publisher_test.go
  • tools/loadgen/members_report.go
  • tools/loadgen/members_report_test.go
  • tools/loadgen/members_test.go
  • tools/loadgen/metrics.go

Comment on lines +12 to +19
```
client
→ chat.user.{account}.request.room.{roomID}.{siteID}.member.add
→ room-service.handleAddMembers (auth, capacity, dedup, channel expansion)
→ chat.room.canonical.{siteID}.member.add (ROOMS stream)
→ room-worker (resolve, write subscriptions, emit broadcast + outbox)
→ chat.room.{roomID}.event (member_added RoomEvent)
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

The fences in these sections are missing language hints. Please annotate them (for example text, bash, or go) to satisfy markdownlint and keep docs tooling consistent.

Also applies to: 50-55, 238-251, 299-315

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 12-12: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around
lines 12 - 19, Several fenced code blocks (for example the block that begins
with "client → chat.user.{account}.request.room.{roomID}.{siteID}.member.add →
room-service.handleAddMembers ..." and the other blocks at ranges 50-55,
238-251, 299-315) are missing language identifiers; edit those fenced code
blocks in docs/superpowers/specs/2026-05-19-load-test-room-members-design.md and
add an appropriate language tag (e.g., `text`, `bash`, or `go`) after the
opening backticks so each block is annotated for markdownlint/MD040 and tooling.

Comment on lines +30 to +34
The first cut implements `--shape=users` only. The `Shape` enum, flag,
and validation are in place from the start so adding `orgs` / `channels` /
`mixed` is a small follow-up plan, but the v1 plan does not seed an org
pool or a source-channel pool, and the request builder only emits
user-account lists.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify the v1 shape contract to avoid unsupported runtime paths.

Line [30] says v1 is shape=users only, but Lines [140-142] and [279-282] describe active multi-shape runtime behavior/tests. Please make the contract explicit: either reject non-users shapes at parse time in v1, or mark those sections as post-v1 only.

Also applies to: 140-142, 279-282

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around
lines 30 - 34, The doc claims v1 implements only shape=users but later sections
describe multi-shape behavior; update the spec to make the v1 contract explicit
by either (A) adding a clear parse-time rejection for non-users shapes
(referencing the Shape enum and the --shape flag and validation logic) and
calling out that the request builder (which currently only emits user-account
lists) will error on orgs/channels/mixed, or (B) marking the multi-shape
behavior in sections referencing runtime/tests (lines describing multi-shape) as
"post-v1" only; pick one approach and apply it consistently to the Shape enum,
--shape flag/validation, and the request builder description so there are no
unsupported runtime paths implied.

Comment on lines +283 to +285
- `collector_member_test.go` — E1/E2 correlation join, size-bucketing
edges, percentile output, `member_added` filter rejects other
RoomEvent types.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix inconsistent test filename in the testing section.

Line [283] uses collector_member_test.go, while the layout uses members_collector_test.go. Keep one canonical name to prevent churn and mislinked implementation tasks.

Also applies to: 303-305

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-19-load-test-room-members-design.md` around
lines 283 - 285, The docs use two different test filenames —
`collector_member_test.go` and `members_collector_test.go` — causing
inconsistency; pick the canonical name (use `members_collector_test.go` to match
the layout) and replace all occurrences of `collector_member_test.go` in this
spec (including the testing section and the other spot around the later mention)
with `members_collector_test.go` so the spec and implementation tasks reference
the same filename.

Comment thread tools/loadgen/main.go
Comment on lines +376 to +377
time.Sleep(2 * time.Second) // drain trailing replies/broadcasts
collector.DiscardBefore(warmupDeadline)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace fixed sleeps with event-driven drain completion.

Line 376 and Line 567 use time.Sleep(2 * time.Second) to wait for trailing events, which is nondeterministic under varying load and can skew measured results. Wait on explicit completion criteria (e.g., in-flight counters, collector quiescence, or context-bounded channel signaling) instead.

As per coding guidelines: “Never use time.Sleep() for goroutine synchronization — use proper sync primitives (channels, sync.WaitGroup, sync.Mutex, context cancellation).”

Also applies to: 567-568

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/main.go` around lines 376 - 377, Replace the blind time.Sleep
calls (the ones before collector.DiscardBefore(warmupDeadline) and the similar
sleep at the second location) with an event-driven drain: stop using time.Sleep
and instead wait for a deterministic quiescence signal from the collector or an
in-flight counter. Add or use a method like collector.WaitUntilEmpty(ctx,
timeout) or expose an in-flight counter/channel on collector (e.g.,
collector.InFlight() + a channel closed when zero) and block on that (or a
context timeout) before calling collector.DiscardBefore(warmupDeadline); ensure
both occurrences (the sleep before DiscardBefore and the other sleep at the
second location) are replaced with this wait and that a timeout/ctx is used to
avoid indefinite blocking.

Comment thread tools/loadgen/main.go
Comment on lines +384 to +387
_ = metricsSrv.Shutdown(shutCtx)
cancelShut()
_ = nc.Drain()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle shutdown/drain errors explicitly instead of discarding them.

Line 384/386 and Line 571/573 ignore errors from server shutdown and NATS drain. This hides cleanup failures and can mask dropped metrics or pending publishes during teardown.

Suggested fix
-	_ = metricsSrv.Shutdown(shutCtx)
+	if err := metricsSrv.Shutdown(shutCtx); err != nil {
+		slog.Warn("metrics shutdown", "error", err)
+	}
 	cancelShut()
-	_ = nc.Drain()
+	if err := nc.Drain(); err != nil {
+		slog.Warn("nats drain", "error", err)
+	}

As per coding guidelines: “Never ignore errors silently — comment if intentionally discarded.”

Also applies to: 571-573

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/main.go` around lines 384 - 387, The shutdown code currently
discards errors from metricsSrv.Shutdown(shutCtx) and nc.Drain(); instead,
capture and handle those errors explicitly: call metricsSrv.Shutdown(shutCtx)
and check its error (log via the existing logger or return it), and do the same
for nc.Drain() (log error or propagate), and avoid discarding cancelShut()
results if it can return an error; update both occurrences (the Shutdown/Drain
calls around metricsSrv.Shutdown, cancelShut, nc.Drain and the repeated calls at
the other location flagged) to surface failures (use process logger, ctx-aware
messages, and/or return non-nil error) so cleanup failures are not silently
ignored.

Comment on lines +121 to +126
// Allow trailing events to settle before checking counters.
time.Sleep(500 * time.Millisecond)

// Both simulated stages must have seen traffic.
assert.Greater(t, rsCalls.Load(), int64(0), "room-service was never called")
assert.Greater(t, rwCalls.Load(), int64(0), "room-worker never consumed")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Replace fixed sleep with deterministic synchronization.

The fixed delay can cause flakes under slower CI/load and is explicitly discouraged for synchronization in this repo. Use an eventual/wait-based assertion instead.

Suggested fix
-	// Allow trailing events to settle before checking counters.
-	time.Sleep(500 * time.Millisecond)
-
-	// Both simulated stages must have seen traffic.
-	assert.Greater(t, rsCalls.Load(), int64(0), "room-service was never called")
-	assert.Greater(t, rwCalls.Load(), int64(0), "room-worker never consumed")
+	// Both simulated stages must see traffic.
+	require.Eventually(t, func() bool {
+		return rsCalls.Load() > 0 && rwCalls.Load() > 0
+	}, 5*time.Second, 50*time.Millisecond, "simulated pipeline saw no traffic")

As per coding guidelines: “Never use time.Sleep() for goroutine synchronization — use proper sync primitives.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Allow trailing events to settle before checking counters.
time.Sleep(500 * time.Millisecond)
// Both simulated stages must have seen traffic.
assert.Greater(t, rsCalls.Load(), int64(0), "room-service was never called")
assert.Greater(t, rwCalls.Load(), int64(0), "room-worker never consumed")
// Both simulated stages must see traffic.
require.Eventually(t, func() bool {
return rsCalls.Load() > 0 && rwCalls.Load() > 0
}, 5*time.Second, 50*time.Millisecond, "simulated pipeline saw no traffic")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/members_integration_test.go` around lines 121 - 126, Replace
the fixed time.Sleep with a deterministic wait-based assertion: remove the 500ms
Sleep and instead use a polling/assertion that retries until both counters are >
0 (for example use assert.Eventually or a small loop with time.After and
polling) checking rsCalls.Load() > 0 and rwCalls.Load() > 0; reference the
existing symbols rsCalls.Load() and rwCalls.Load() and fail the test if the
condition does not become true within a reasonable timeout (e.g. a few seconds)
to avoid flakiness.

Comment on lines +20 to +38
func startEmbeddedJetStream(t *testing.T) (*nats.Conn, jetstream.JetStream) {
t.Helper()
opts := &natsserver.Options{
Port: -1,
JetStream: true,
StoreDir: t.TempDir(),
}
ns, err := natsserver.NewServer(opts)
require.NoError(t, err)
ns.Start()
require.True(t, ns.ReadyForConnections(5*time.Second), "nats server did not become ready")
t.Cleanup(ns.Shutdown)
nc, err := nats.Connect(ns.ClientURL())
require.NoError(t, err)
t.Cleanup(nc.Close)
js, err := jetstream.New(nc)
require.NoError(t, err)
return nc, js
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

These tests rely on a real NATS server instead of unit-level injection/mocks.

Current coverage is integration-style and violates the unit-test constraint for this repo. Please move these to explicit integration test scope or refactor publisher construction so unit tests can inject publish/subscribe fakes without booting NATS.

As per coding guidelines, “Never connect to real databases, NATS, or external services in unit tests. When a handler publishes to JetStream, inject the publish function as a field so tests can capture data without a real NATS connection.”

Also applies to: 40-141

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/members_publisher_test.go` around lines 20 - 38, The tests
currently start an actual NATS server via startEmbeddedJetStream which makes
them integration tests; either move these tests into an explicit integration
test package/suite (so startEmbeddedJetStream remains but only used by
integration tests) or refactor the publisher construction to accept an
injectable publish client/function (e.g., add a PublishFunc or JetStreamer
interface parameter to the publisher constructor used by MembersPublisher or
NewMembersPublisher) so unit tests can supply a fake/mock instead of calling
startEmbeddedJetStream; update tests in the 40-141 range to use the fake
PublishFunc/JetStreamer for unit tests and keep a separate integration test that
uses startEmbeddedJetStream if end-to-end verification is needed.

Comment on lines +65 to +68
sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) {
corr := m.Subject[len(inboxPrefix)+1:]
p.onReply(corr, m.Data, time.Now())
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard onReply before invoking it in the subscription callback.

If newFrontdoorMemberPublisher is called with a nil handler, this callback path panics on the first reply. Add a nil guard (or enforce non-nil in constructor) to keep reply handling fail-safe.

Suggested fix
 	sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) {
 		corr := m.Subject[len(inboxPrefix)+1:]
-		p.onReply(corr, m.Data, time.Now())
+		if p.onReply != nil {
+			p.onReply(corr, m.Data, time.Now())
+		}
 	})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) {
corr := m.Subject[len(inboxPrefix)+1:]
p.onReply(corr, m.Data, time.Now())
})
sub, err := nc.Subscribe(inboxPrefix+".*", func(m *nats.Msg) {
corr := m.Subject[len(inboxPrefix)+1:]
if p.onReply != nil {
p.onReply(corr, m.Data, time.Now())
}
})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/members_publisher.go` around lines 65 - 68, Guard the p.onReply
call inside the nc.Subscribe callback to avoid panics when no handler is set: in
the subscription created around inboxPrefix+".*" (the callback that computes
corr := m.Subject[len(inboxPrefix)+1:]) check if p.onReply != nil before calling
p.onReply(corr, m.Data, time.Now()), or alternatively enforce a non-nil handler
in newFrontdoorMemberPublisher by returning an error or setting a no-op handler;
update either the callback or the constructor to ensure replies are handled
safely.

Comment on lines +79 to +80
_ = p.sub.Unsubscribe()
p.sub = nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t silently discard Unsubscribe() errors in Close().

Handle/log the error (or at least document intentional discard) so teardown failures are observable.

As per coding guidelines, “Never ignore errors silently — comment if intentionally discarded.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/loadgen/members_publisher.go` around lines 79 - 80, In Close(), don't
ignore the result of p.sub.Unsubscribe(): capture the returned error from
p.sub.Unsubscribe(), and either return it (or log it via the publisher's logger)
before setting p.sub = nil so teardown failures are observable; if you
intentionally want to discard the error, add a clear comment stating why. Update
the Close method to handle the unsubscribe error from p.sub.Unsubscribe(),
referencing the Close method, the p.sub field and the Unsubscribe() call.

Comment thread tools/loadgen/README.md
Comment on lines +93 to +110
```
make -C tools/loadgen/deploy up
make -C tools/loadgen/deploy seed-members PRESET=members-medium
make -C tools/loadgen/deploy run-sustained PRESET=members-medium RATE=100 DURATION=60s
```

For capacity-mode growth curves:

```
make -C tools/loadgen/deploy seed-members PRESET=members-capacity
make -C tools/loadgen/deploy run-capacity PRESET=members-capacity TARGET_SIZE=500
```

Between sustained runs, reset state so candidate pools refill:

```
make -C tools/loadgen/deploy reset-members PRESET=members-medium
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to the new fenced code blocks.

The newly added fences are missing a language tag, which triggers MD040.

Suggested fix
-```
+```bash
 make -C tools/loadgen/deploy up
 make -C tools/loadgen/deploy seed-members PRESET=members-medium
 make -C tools/loadgen/deploy run-sustained PRESET=members-medium RATE=100 DURATION=60s

@@
- +bash
make -C tools/loadgen/deploy seed-members PRESET=members-capacity
make -C tools/loadgen/deploy run-capacity PRESET=members-capacity TARGET_SIZE=500

@@
-```
+```bash
make -C tools/loadgen/deploy reset-members PRESET=members-medium
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.1)</summary>

[warning] 93-93: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

[warning] 101-101: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

[warning] 108-108: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @tools/loadgen/README.md around lines 93 - 110, The README.md fenced code
blocks with the make commands are missing language identifiers (triggering
MD040); update each triple-backtick fence that wraps the make -C
tools/loadgen/deploy commands to use ```bash so all three blocks (the sustained
run block, the capacity-mode block, and the reset-members block) start with

longer flags MD040.

@hmchangw hmchangw changed the title docs: design spec for room-member add load test feat(loadgen): add room-member add load test (members-sustained, members-capacity) May 20, 2026
Round-two simplify pass:
- Add presetLabel field to both members generators, set in constructor,
  used in the metrics WithLabelValues call instead of g.cfg.Preset.Name.
  Completes the cached-label pattern introduced in the previous pass.
- Doc-comment snapshotLatencies to flag that callers must already hold
  the collector mutex, since the function now serves both Collector and
  MemberCollector across files.
- Add TestParseInjectMode mirroring TestParseShape.

https://claude.ai/code/session_01Q8Fyx576dmfM9yfbEMtB6t
@hmchangw hmchangw merged commit bfaf9b6 into main May 20, 2026
6 checks passed
@hmchangw hmchangw deleted the claude/load-test-room-members-7o4o4 branch May 20, 2026 07:18
GITMateuszCharczuk pushed a commit that referenced this pull request May 20, 2026
…scrape + warn on --allow-concurrent

Three small follow-ups from the final ship-readiness review:

1. parseInjectMode (internal, flags.go) consolidated into ParseInjectMode
   (exported, generator.go). PR #203 added the exported version; this
   removes the internal duplicate and updates the one caller
   (dispatch.go:457) + test (flags_test.go:27). The generator_test.go
   comment that flagged this as a known follow-up is updated to note the
   consolidation is done.

2. bootstrap_error scrape race: the metric is incremented then Run
   returns the error and the process exits in <1s. With the default
   Prometheus scrape interval of 15s, the metric was unobservable in
   practice. Hold the process open for 20s (ctx-honoring; SIGINT still
   exits promptly, and tests with short ctx timeouts still pass
   quickly) so the scraper has a chance to read the counter before
   exit. 20s > 15s + jitter; well below operator patience for a
   fast-failing run. Surfaced via slog.Warn so the operator knows why
   the process is lingering.

3. --allow-concurrent runs share SUT resources (ES index growth,
   search-service capacity, NATS redelivery budget) even though
   msgID-keyed correlation prevents cross-credit. Emit a structured
   slog.Warn when the flag is set so operators don't misread numbers
   that look like degradation in run B but are actually saturation
   from run A. msgID isolation comment in CHANGES.md is unchanged
   (still accurate at the correctness level).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants