Skip to content

perf(tcount): bound thread reply count recount to a constant#370

Merged
hmchangw merged 5 commits into
mainfrom
claude/message-gateway-bottleneck-is4kqg
Jun 25, 2026
Merged

perf(tcount): bound thread reply count recount to a constant#370
hmchangw merged 5 commits into
mainfrom
claude/message-gateway-bottleneck-is4kqg

Conversation

@hmchangw

@hmchangw hmchangw commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Summary

On the message hot path, the thread reply-count badge (tcount) was produced by re-counting the entire thread_messages_by_thread Cassandra partition on every reply and every delete — an O(thread-size) scan per write (O(N²) to build a thread), duplicated byte-for-byte in message-worker (add path) and history-service (delete path), and sitting on message-worker's synchronous JetStream-ack path.

This bounds that recount to a constant by stopping the scan once the non-deleted tally reaches a display cap of 99, extracted into a shared pkg/threadcount so the two authoritative writers can never drift.

  • pkg/threadcount — new Count(ctx, *gocql.Session, threadRoomID) (int, error) returning min(non-deleted replies, Cap) with const Cap = 99. Early-breaks at the cap (PageSize(Cap) + n < Cap guard), so it materializes ~Cap rows instead of the whole partition.
  • message-worker / history-service — each countThreadReplies is now a one-line delegation to threadcount.Count. The blind-SET of tcount and the countAndSetParentTcount callers are unchanged.
  • Per-write cost: O(thread-size) → O(99). Below 99: exact. At/above: 99 (FE renders >= 99 as "99+").

Why a cap (and not the documented COUNTER table)

tcount was deliberately moved to COUNT+blind-SET in #245 because it is idempotent under JetStream redelivery and soft-delete-aware — both preserved here (no LIMIT is used, since soft-deleted rows are live rows interspersed in the partition; the scan reads deleted and treats NULL as not-deleted). #245's documented follow-up (a Cassandra counter table + reconciliation job) would re-introduce the non-idempotency it eliminated and add a scheduled job; this design instead stays inside the idempotent, stateless model with no new table, job, or source of truth. The #245 plan's future-work item is updated to point here.

No schema/DDL change, no migration, no new dependency.

Changes

  • pkg/threadcount/ — new package + integration tests (under/over cap, deleted excluded, deleted-interspersed-over-cap, empty).
  • message-worker/store_cassandra.go, history-service/internal/cassrepo/write.go — delegate to the helper; capping integration test at each site.
  • Docs: cassandra_message_model.md (tcount column), client-api.md (tcount/newTcount fields), and the feat: real-time thread reply fan-out (broadcast-worker) + reply-count badge pipeline #245 plan's superseded-note.

Test Plan

  • make test (full unit suite) — green
  • make lint — 0 issues
  • gosec — PASS
  • Integration suites must run in CImake test-integration SERVICE=pkg/threadcount, …SERVICE=message-worker, …SERVICE=history-service. These were written and compile-verified under the integration build tag but could not be executed in the dev environment (no Docker).
  • govulncheck / semgrep must run in CI — blocked locally by the network policy (403 to their registries).

🤖 Generated with Claude Code

https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Thread reply counts are now capped at 99; values at/over the cap render as “99+”.
    • Reply add/delete flows use the same bounded counting behavior, including updating the latest surviving reply timestamp where applicable.
  • Bug Fixes

    • Thread badge totals remain consistent for threads with soft-deleted replies, including when reply volume exceeds the cap.
  • Documentation

    • Updated schema and client API docs to clarify tcount/newTcount are capped at 99 (where 99 means “99 or more”).
  • Tests

    • Added integration coverage for exact counting, capping behavior, deleted-row exclusion, and latest-timestamp semantics.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f4a9e51-c6bb-46dd-b9f3-e37cfa72fe76

📥 Commits

Reviewing files that changed from the base of the PR and between 769176c and 43a3c28.

📒 Files selected for processing (12)
  • docs/cassandra_message_model.md
  • docs/client-api.md
  • docs/superpowers/plans/2026-06-04-tcount-count-based.md
  • docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md
  • docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md
  • history-service/internal/cassrepo/write.go
  • history-service/internal/cassrepo/write_integration_test.go
  • message-worker/integration_test.go
  • message-worker/store_cassandra.go
  • pkg/threadcount/count.go
  • pkg/threadcount/integration_test.go
  • pkg/threadcount/main_test.go
✅ Files skipped from review due to trivial changes (4)
  • docs/cassandra_message_model.md
  • docs/superpowers/plans/2026-06-04-tcount-count-based.md
  • docs/client-api.md
  • docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/threadcount/main_test.go
  • history-service/internal/cassrepo/write_integration_test.go
  • history-service/internal/cassrepo/write.go
  • message-worker/store_cassandra.go
  • message-worker/integration_test.go

📝 Walkthrough

Walkthrough

Adds a shared bounded thread-reply counting helper, switches both writer paths to use it, and updates the related Cassandra, API, design, and implementation-plan documentation to describe the capped tcount semantics.

Changes

Bounded Thread Reply Count

Layer / File(s) Summary
Design spec and superseded plan docs
docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md, docs/superpowers/plans/2026-06-04-tcount-count-based.md
The bounded-count design, helper contract, write-site roles, test scope, and rollout constraints are documented, and the older COUNTER-table plan is marked superseded.
Implementation plan document
docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md
The step-by-step implementation plan covers the helper package, both writer integrations, documentation updates, and verification steps.
pkg/threadcount: Cap constant and bounded helpers
pkg/threadcount/count.go, pkg/threadcount/integration_test.go, pkg/threadcount/main_test.go
Cap = 99 is defined, Count delegates to CountAndLatest, and the bounded Cassandra scan tracks the latest surviving reply timestamp. Integration tests cover capped, deleted, empty, and latest-timestamp cases.
message-worker: delegate countThreadReplies to threadcount.Count
message-worker/store_cassandra.go, message-worker/integration_test.go
countThreadReplies now calls threadcount.Count, and the integration test asserts the returned count is capped at threadcount.Cap.
history-service: delegate countThreadReplies to threadcount.CountAndLatest
history-service/internal/cassrepo/write.go, history-service/internal/cassrepo/write_integration_test.go
countThreadReplies now calls threadcount.CountAndLatest, and the integration test asserts the capped count and latest surviving timestamp behavior.
Client API and Cassandra schema doc updates
docs/cassandra_message_model.md, docs/client-api.md
The tcount and newTcount docs now state the count is capped at 99 and that 99 means “99 or more,” with "99+" as the frontend rendering.

Sequence Diagram(s)

sequenceDiagram
  participant MessageWorker
  participant HistoryService
  participant threadcount
  participant Cassandra

  MessageWorker->>threadcount: Count(ctx, session, threadRoomID)
  threadcount->>Cassandra: Query thread_messages_by_thread
  Cassandra-->>threadcount: rows up to Cap
  threadcount-->>MessageWorker: bounded count

  HistoryService->>threadcount: CountAndLatest(ctx, session, threadRoomID)
  threadcount->>Cassandra: Query deleted and created_at
  Cassandra-->>threadcount: rows up to Cap
  threadcount-->>HistoryService: bounded count + latest surviving timestamp
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • hmchangw/chat#22: Introduces the Cassandra message model docs that this PR updates for the capped tcount semantics.
  • hmchangw/chat#93: Adds the original tcount column documentation that this PR refines to the bounded 99/"99+" behavior.
  • hmchangw/chat#354: Touches the same thread-counting and parent-message update paths in message-worker and history-service.

Suggested labels

ready

Suggested reviewers

  • mliu33
  • ngangwar962

Poem

🐇 I hopped through rows and counted neat,
Stopped at ninety-nine — what a feat!
Now 99+ shines bright and clear,
With one shared helper keeping near.
Soft-deleted shadows don’t confuse my feet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: bounding tcount recounts with a fixed cap.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/message-gateway-bottleneck-is4kqg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@hmchangw hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch 3 times, most recently from 769176c to 4fde085 Compare June 24, 2026 11:15

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
pkg/threadcount/count.go (1)

1-6: 🚀 Performance & Scalability | 🔵 Trivial

Doc claim of "constant" per-write cost is only true when soft-deletes are sparse.

The early-stop counts non-deleted rows, but with no CQL LIMIT the iterator must page through every soft-deleted row that clusters ahead of the Cap-th survivor. A thread that accumulates many deleted = true rows (or fewer than Cap live replies among a large tombstoned partition) forces a full-partition read on every reply/delete — i.e. cost is O(deleted + Cap), not constant. This is the inherent trade-off of dropping LIMIT to stay soft-delete-correct, but the package doc ("per-write cost stays constant regardless of thread size") and Count's "~Cap rows" wording understate it.

Consider softening the doc to reflect the O(survivors-scanned + interspersed deletes) worst case, and confirm soft-deleted rows in thread_messages_by_thread are bounded/compacted so this hot path can't degrade over a thread's lifetime.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/threadcount/count.go` around lines 1 - 6, Update the package comment in
threadcount to avoid claiming the per-write cost is constant regardless of
thread size, since Count can still scan many tombstoned rows before reaching Cap
live replies. Soften the wording around Count’s “~Cap rows” behavior to reflect
the worst case when soft-deletes are sparse or clustered, and mention the
trade-off introduced by omitting CQL LIMIT for soft-delete correctness.
history-service/internal/cassrepo/write_integration_test.go (1)

1679-1695: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider asserting the returned tlm (currently discarded).

This test ignores the second return value (_), so the over-cap tlm derivation in CountAndLatest is exercised but never validated. Asserting that tlm equals the newest inserted reply's created_at even when row count exceeds Cap would directly cover the clustering-order dependency flagged in write.go.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@history-service/internal/cassrepo/write_integration_test.go` around lines
1679 - 1695, The test for countThreadReplies currently drops the second return
value, so it never verifies the latest-message timestamp path in CountAndLatest.
Update TestRepository_countThreadReplies_CapsAtThreadcountCap to assert the
returned tlm from countThreadReplies equals the newest inserted reply’s
created_at, using the existing countThreadReplies and threadcount.Cap setup to
cover the over-cap clustering-order behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md`:
- Around line 417-425: Add explicit language identifiers to the fenced examples
in the affected markdown sections so they satisfy markdownlint MD040; update the
code fences around the example snippets in this document to use the appropriate
fence labels (for example, SQL for the schema snippets and Markdown where
applicable), and apply the same fix to the other referenced example blocks in
the same document.
- Line 7: The plan text is inconsistent for history-service: it currently tells
the delete path to use threadcount.Count, but that path must preserve tlm
updates by using threadcount.CountAndLatest instead. Update Task 3 and every
architecture/reference section that mentions history-service delegation so
history-service explicitly calls CountAndLatest while message-worker can still
use Count, and ensure the documented return shape includes the latest timestamp
where needed.

In `@docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md`:
- Around line 125-135: The shared-helper contract is inconsistent with the
delete path: the history-service delete flow still relies on
threadcount.CountAndLatest to recompute tlm, not just threadcount.Count. Update
the spec wording and example API around the delete-path helper to reference
threadcount.CountAndLatest for the delete flow while keeping threadcount.Count
as the shared display-cap counter, so the contract matches the actual writer
behavior.
- Around line 10-12: The fenced CQL snippet in the design doc is missing a
language identifier, so update that markdown block to use the appropriate SQL
fence for the query. Locate the fenced block containing the SELECT statement and
change it to a properly labeled sql code block to satisfy markdown linting.

In `@pkg/threadcount/integration_test.go`:
- Around line 24-30: The flagged `admin.Query(fmt.Sprintf(...))` in
`threadcount` is a false-positive SQL/CQL injection finding because `keyspace`
comes from `testutil.CassandraKeyspace` and cannot be parameterized as an
identifier. Add the scanner’s inline suppression directive with a brief
justification directly above the `fmt.Sprintf` statement in the test setup so
the SAST gate passes, keeping the suppression scoped only to this `CREATE TABLE`
query.

---

Nitpick comments:
In `@history-service/internal/cassrepo/write_integration_test.go`:
- Around line 1679-1695: The test for countThreadReplies currently drops the
second return value, so it never verifies the latest-message timestamp path in
CountAndLatest. Update TestRepository_countThreadReplies_CapsAtThreadcountCap to
assert the returned tlm from countThreadReplies equals the newest inserted
reply’s created_at, using the existing countThreadReplies and threadcount.Cap
setup to cover the over-cap clustering-order behavior.

In `@pkg/threadcount/count.go`:
- Around line 1-6: Update the package comment in threadcount to avoid claiming
the per-write cost is constant regardless of thread size, since Count can still
scan many tombstoned rows before reaching Cap live replies. Soften the wording
around Count’s “~Cap rows” behavior to reflect the worst case when soft-deletes
are sparse or clustered, and mention the trade-off introduced by omitting CQL
LIMIT for soft-delete correctness.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 15182611-7ba4-4617-ba4e-7e61001ab487

📥 Commits

Reviewing files that changed from the base of the PR and between a6c62d9 and 769176c.

📒 Files selected for processing (12)
  • docs/cassandra_message_model.md
  • docs/client-api.md
  • docs/superpowers/plans/2026-06-04-tcount-count-based.md
  • docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md
  • docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md
  • history-service/internal/cassrepo/write.go
  • history-service/internal/cassrepo/write_integration_test.go
  • message-worker/integration_test.go
  • message-worker/store_cassandra.go
  • pkg/threadcount/count.go
  • pkg/threadcount/integration_test.go
  • pkg/threadcount/main_test.go

Comment thread docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md Outdated
Comment thread docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md
Comment thread docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md Outdated
Comment thread docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md Outdated
Comment thread pkg/threadcount/integration_test.go
@hmchangw hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch from 4fde085 to e62bb10 Compare June 24, 2026 11:24

Copy link
Copy Markdown
Owner Author

Addressed the review feedback in e62bb10:

Fixed

  • Spec + plan docs: corrected the delete-path contract to threadcount.CountAndLatest (it also recomputes tlm), not Count. Updated the design spec's helper contract, the plan's Architecture/File-Structure/Task-3, and added the sql fence label.
  • pkg/threadcount/count.go: softened the package doc — per-write cost is bounded to ~Cap surviving rows plus any soft-deleted rows clustered ahead, not "constant regardless of thread size".
  • history-service cap test: now asserts the over-cap tlm equals the newest reply, covering the DESC-clustering path through the real delegation.

Skipped (with reasons)

  • SAST sql-injection suppression on the test CREATE TABLE: the repo's sast gate already passes on this — interpolating the testutil.CassandraKeyspace identifier into DDL is the standard pattern in every integration test in the repo (CQL can't bind identifiers), and adding a nosemgrep directive the rest of the suite doesn't use would be inconsistent. The flagged rule is CodeRabbit's OpenGrep config, not the repo's blocking scanner.
  • MD040 fence-language nits in the planning docs: markdownlint isn't a repo CI gate, and the flagged fences are before/after snippet fragments. (Did add the one real CQL block label in the spec.)
  • Docstring-coverage 45% pre-merge warning: the exported API (Count, CountAndLatest, Cap) is documented; the percentage counts test functions/helpers, which Go idiom leaves un-docstringed.

Separately, the earlier test-integration (history-service) failure was a transient Cassandra-container flake in the internal/service package (unable to discover protocol version: EOF at keyspace setup, before any test logic) — unrelated to this change; my cassrepo package passed. The new push re-runs it.


Generated by Claude Code

@hmchangw hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch from e62bb10 to 63b08e5 Compare June 24, 2026 11:44
claude added 5 commits June 24, 2026 13:36
Spec and TDD implementation plan for bounding the per-write tcount
recount: keep #245's idempotent, soft-delete-aware COUNT+blind-SET but
stop the partition scan at a display cap, extracted into a shared
pkg/threadcount used by both authoritative writers (message-worker via
Count, history-service via CountAndLatest which also recomputes tlm).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn
CountAndLatest scans thread_messages_by_thread, early-breaking once the
non-deleted tally reaches Cap=99 (PageSize(Cap) + n<Cap guard), so the
per-write read is bounded to ~Cap surviving rows (plus any soft-deleted
rows clustered ahead) rather than the whole partition. No CQL LIMIT —
soft-deleted rows are live, interspersed rows, so a hard LIMIT could
undercount. It also returns the latest surviving reply's created_at (tlm)
for the delete path; the partition's DESC clustering order surfaces the
latest survivor first.

Count is a thin delegation to CountAndLatest discarding the timestamp —
the add path needs only the count, and since created_at is a clustering
key, selecting it adds no meaningful read cost while one shared scan keeps
the two writers' counts provably identical. Integration tests cover
under/over cap, deleted excluded (incl. over-cap interspersed), empty, and
latest-survivor cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn
Delegate the add-path countThreadReplies to threadcount.Count, replacing
the unbounded partition scan. The blind-SET (now tcount+tlm) and the
countAndSetParentTcount caller are unchanged; tlm on the add path stays
the new reply's CreatedAt. Adds an integration test (using the existing
setupCassandra helper) asserting the count caps at 99.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn
Delegate the delete-path countThreadReplies to threadcount.CountAndLatest,
which bounds the count at the same Cap as the add path (so a reply and a
delete can't write divergent values and flip-flop the badge) while still
recomputing tlm (latest surviving reply) from the one bounded scan. The
capping integration test (using the existing setupCassandra helper) also
asserts the over-cap tlm is the newest reply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn
Note the 99 cap on the tcount column (messages_by_room, messages_by_id)
and on the tcount/newTcount client-api fields, and record in the #245
plan that the bounded-cap design supersedes its COUNTER-table +
reconciliation-job future-work item (a counter is not idempotent under
JetStream redelivery; the cap stays inside the stateless model).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn
@hmchangw hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch from 63b08e5 to 43a3c28 Compare June 24, 2026 13:36
@hmchangw hmchangw added the ready label Jun 25, 2026

@Joey0538 Joey0538 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@hmchangw hmchangw merged commit a27f2b0 into main Jun 25, 2026
8 checks passed
@hmchangw hmchangw deleted the claude/message-gateway-bottleneck-is4kqg branch June 25, 2026 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants