perf(tcount): bound thread reply count recount to a constant by hmchangw · Pull Request #370 · hmchangw/chat

hmchangw · 2026-06-22T07:34:12Z

Summary

On the message hot path, the thread reply-count badge (tcount) was produced by re-counting the entire thread_messages_by_thread Cassandra partition on every reply and every delete — an O(thread-size) scan per write (O(N²) to build a thread), duplicated byte-for-byte in message-worker (add path) and history-service (delete path), and sitting on message-worker's synchronous JetStream-ack path.

This bounds that recount to a constant by stopping the scan once the non-deleted tally reaches a display cap of 99, extracted into a shared pkg/threadcount so the two authoritative writers can never drift.

pkg/threadcount — new Count(ctx, *gocql.Session, threadRoomID) (int, error) returning min(non-deleted replies, Cap) with const Cap = 99. Early-breaks at the cap (PageSize(Cap) + n < Cap guard), so it materializes ~Cap rows instead of the whole partition.
message-worker / history-service — each countThreadReplies is now a one-line delegation to threadcount.Count. The blind-SET of tcount and the countAndSetParentTcount callers are unchanged.
Per-write cost: O(thread-size) → O(99). Below 99: exact. At/above: 99 (FE renders >= 99 as "99+").

Why a cap (and not the documented COUNTER table)

tcount was deliberately moved to COUNT+blind-SET in #245 because it is idempotent under JetStream redelivery and soft-delete-aware — both preserved here (no LIMIT is used, since soft-deleted rows are live rows interspersed in the partition; the scan reads deleted and treats NULL as not-deleted). #245's documented follow-up (a Cassandra counter table + reconciliation job) would re-introduce the non-idempotency it eliminated and add a scheduled job; this design instead stays inside the idempotent, stateless model with no new table, job, or source of truth. The #245 plan's future-work item is updated to point here.

No schema/DDL change, no migration, no new dependency.

Changes

pkg/threadcount/ — new package + integration tests (under/over cap, deleted excluded, deleted-interspersed-over-cap, empty).
message-worker/store_cassandra.go, history-service/internal/cassrepo/write.go — delegate to the helper; capping integration test at each site.
Docs: cassandra_message_model.md (tcount column), client-api.md (tcount/newTcount fields), and the feat: real-time thread reply fan-out (broadcast-worker) + reply-count badge pipeline #245 plan's superseded-note.

Test Plan

make test (full unit suite) — green
make lint — 0 issues
gosec — PASS
Integration suites must run in CI — make test-integration SERVICE=pkg/threadcount, …SERVICE=message-worker, …SERVICE=history-service. These were written and compile-verified under the integration build tag but could not be executed in the dev environment (no Docker).
govulncheck / semgrep must run in CI — blocked locally by the network policy (403 to their registries).

🤖 Generated with Claude Code

https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn

Generated by Claude Code

Summary by CodeRabbit

New Features
- Thread reply counts are now capped at 99; values at/over the cap render as “99+”.
- Reply add/delete flows use the same bounded counting behavior, including updating the latest surviving reply timestamp where applicable.
Bug Fixes
- Thread badge totals remain consistent for threads with soft-deleted replies, including when reply volume exceeds the cap.
Documentation
- Updated schema and client API docs to clarify tcount/newTcount are capped at 99 (where 99 means “99 or more”).
Tests
- Added integration coverage for exact counting, capping behavior, deleted-row exclusion, and latest-timestamp semantics.

coderabbitai · 2026-06-22T07:34:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f4a9e51-c6bb-46dd-b9f3-e37cfa72fe76

📥 Commits

Reviewing files that changed from the base of the PR and between 769176c and 43a3c28.

📒 Files selected for processing (12)

docs/cassandra_message_model.md
docs/client-api.md
docs/superpowers/plans/2026-06-04-tcount-count-based.md
docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md
docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md
history-service/internal/cassrepo/write.go
history-service/internal/cassrepo/write_integration_test.go
message-worker/integration_test.go
message-worker/store_cassandra.go
pkg/threadcount/count.go
pkg/threadcount/integration_test.go
pkg/threadcount/main_test.go

✅ Files skipped from review due to trivial changes (4)

docs/cassandra_message_model.md
docs/superpowers/plans/2026-06-04-tcount-count-based.md
docs/client-api.md
docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md

🚧 Files skipped from review as they are similar to previous changes (5)

pkg/threadcount/main_test.go
history-service/internal/cassrepo/write_integration_test.go
history-service/internal/cassrepo/write.go
message-worker/store_cassandra.go
message-worker/integration_test.go

📝 Walkthrough

Walkthrough

Adds a shared bounded thread-reply counting helper, switches both writer paths to use it, and updates the related Cassandra, API, design, and implementation-plan documentation to describe the capped tcount semantics.

Changes

Bounded Thread Reply Count

Layer / File(s)	Summary
Design spec and superseded plan docs `docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md`, `docs/superpowers/plans/2026-06-04-tcount-count-based.md`	The bounded-count design, helper contract, write-site roles, test scope, and rollout constraints are documented, and the older COUNTER-table plan is marked superseded.
Implementation plan document `docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md`	The step-by-step implementation plan covers the helper package, both writer integrations, documentation updates, and verification steps.
pkg/threadcount: Cap constant and bounded helpers `pkg/threadcount/count.go`, `pkg/threadcount/integration_test.go`, `pkg/threadcount/main_test.go`	`Cap = 99` is defined, `Count` delegates to `CountAndLatest`, and the bounded Cassandra scan tracks the latest surviving reply timestamp. Integration tests cover capped, deleted, empty, and latest-timestamp cases.
message-worker: delegate countThreadReplies to threadcount.Count `message-worker/store_cassandra.go`, `message-worker/integration_test.go`	`countThreadReplies` now calls `threadcount.Count`, and the integration test asserts the returned count is capped at `threadcount.Cap`.
history-service: delegate countThreadReplies to threadcount.CountAndLatest `history-service/internal/cassrepo/write.go`, `history-service/internal/cassrepo/write_integration_test.go`	`countThreadReplies` now calls `threadcount.CountAndLatest`, and the integration test asserts the capped count and latest surviving timestamp behavior.
Client API and Cassandra schema doc updates `docs/cassandra_message_model.md`, `docs/client-api.md`	The `tcount` and `newTcount` docs now state the count is capped at 99 and that `99` means “99 or more,” with `"99+"` as the frontend rendering.

Sequence Diagram(s)

sequenceDiagram
  participant MessageWorker
  participant HistoryService
  participant threadcount
  participant Cassandra

  MessageWorker->>threadcount: Count(ctx, session, threadRoomID)
  threadcount->>Cassandra: Query thread_messages_by_thread
  Cassandra-->>threadcount: rows up to Cap
  threadcount-->>MessageWorker: bounded count

  HistoryService->>threadcount: CountAndLatest(ctx, session, threadRoomID)
  threadcount->>Cassandra: Query deleted and created_at
  Cassandra-->>threadcount: rows up to Cap
  threadcount-->>HistoryService: bounded count + latest surviving timestamp

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

hmchangw/chat#22: Introduces the Cassandra message model docs that this PR updates for the capped tcount semantics.
hmchangw/chat#93: Adds the original tcount column documentation that this PR refines to the bounded 99/"99+" behavior.
hmchangw/chat#354: Touches the same thread-counting and parent-message update paths in message-worker and history-service.

Suggested labels

ready

Suggested reviewers

mliu33
ngangwar962

Poem

🐇 I hopped through rows and counted neat,
Stopped at ninety-nine — what a feat!
Now 99+ shines bright and clear,
With one shared helper keeping near.
Soft-deleted shadows don’t confuse my feet.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: bounding tcount recounts with a fixed cap.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/message-gateway-bottleneck-is4kqg

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

pkg/threadcount/count.go (1)
1-6: 🚀 Performance & Scalability | 🔵 Trivial

Doc claim of "constant" per-write cost is only true when soft-deletes are sparse.

The early-stop counts non-deleted rows, but with no CQL LIMIT the iterator must page through every soft-deleted row that clusters ahead of the Cap-th survivor. A thread that accumulates many deleted = true rows (or fewer than Cap live replies among a large tombstoned partition) forces a full-partition read on every reply/delete — i.e. cost is O(deleted + Cap), not constant. This is the inherent trade-off of dropping LIMIT to stay soft-delete-correct, but the package doc ("per-write cost stays constant regardless of thread size") and Count's "~Cap rows" wording understate it.

Consider softening the doc to reflect the O(survivors-scanned + interspersed deletes) worst case, and confirm soft-deleted rows in thread_messages_by_thread are bounded/compacted so this hot path can't degrade over a thread's lifetime.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/threadcount/count.go` around lines 1 - 6, Update the package comment in
threadcount to avoid claiming the per-write cost is constant regardless of
thread size, since Count can still scan many tombstoned rows before reaching Cap
live replies. Soften the wording around Count’s “~Cap rows” behavior to reflect
the worst case when soft-deletes are sparse or clustered, and mention the
trade-off introduced by omitting CQL LIMIT for soft-delete correctness.
history-service/internal/cassrepo/write_integration_test.go (1)
1679-1695: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Consider asserting the returned tlm (currently discarded).

This test ignores the second return value (_), so the over-cap tlm derivation in CountAndLatest is exercised but never validated. Asserting that tlm equals the newest inserted reply's created_at even when row count exceeds Cap would directly cover the clustering-order dependency flagged in write.go.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@history-service/internal/cassrepo/write_integration_test.go` around lines
1679 - 1695, The test for countThreadReplies currently drops the second return
value, so it never verifies the latest-message timestamp path in CountAndLatest.
Update TestRepository_countThreadReplies_CapsAtThreadcountCap to assert the
returned tlm from countThreadReplies equals the newest inserted reply’s
created_at, using the existing countThreadReplies and threadcount.Cap setup to
cover the over-cap clustering-order behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md`:
- Around line 417-425: Add explicit language identifiers to the fenced examples
in the affected markdown sections so they satisfy markdownlint MD040; update the
code fences around the example snippets in this document to use the appropriate
fence labels (for example, SQL for the schema snippets and Markdown where
applicable), and apply the same fix to the other referenced example blocks in
the same document.
- Line 7: The plan text is inconsistent for history-service: it currently tells
the delete path to use threadcount.Count, but that path must preserve tlm
updates by using threadcount.CountAndLatest instead. Update Task 3 and every
architecture/reference section that mentions history-service delegation so
history-service explicitly calls CountAndLatest while message-worker can still
use Count, and ensure the documented return shape includes the latest timestamp
where needed.

In `@docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md`:
- Around line 125-135: The shared-helper contract is inconsistent with the
delete path: the history-service delete flow still relies on
threadcount.CountAndLatest to recompute tlm, not just threadcount.Count. Update
the spec wording and example API around the delete-path helper to reference
threadcount.CountAndLatest for the delete flow while keeping threadcount.Count
as the shared display-cap counter, so the contract matches the actual writer
behavior.
- Around line 10-12: The fenced CQL snippet in the design doc is missing a
language identifier, so update that markdown block to use the appropriate SQL
fence for the query. Locate the fenced block containing the SELECT statement and
change it to a properly labeled sql code block to satisfy markdown linting.

In `@pkg/threadcount/integration_test.go`:
- Around line 24-30: The flagged `admin.Query(fmt.Sprintf(...))` in
`threadcount` is a false-positive SQL/CQL injection finding because `keyspace`
comes from `testutil.CassandraKeyspace` and cannot be parameterized as an
identifier. Add the scanner’s inline suppression directive with a brief
justification directly above the `fmt.Sprintf` statement in the test setup so
the SAST gate passes, keeping the suppression scoped only to this `CREATE TABLE`
query.

---

Nitpick comments:
In `@history-service/internal/cassrepo/write_integration_test.go`:
- Around line 1679-1695: The test for countThreadReplies currently drops the
second return value, so it never verifies the latest-message timestamp path in
CountAndLatest. Update TestRepository_countThreadReplies_CapsAtThreadcountCap to
assert the returned tlm from countThreadReplies equals the newest inserted
reply’s created_at, using the existing countThreadReplies and threadcount.Cap
setup to cover the over-cap clustering-order behavior.

In `@pkg/threadcount/count.go`:
- Around line 1-6: Update the package comment in threadcount to avoid claiming
the per-write cost is constant regardless of thread size, since Count can still
scan many tombstoned rows before reaching Cap live replies. Soften the wording
around Count’s “~Cap rows” behavior to reflect the worst case when soft-deletes
are sparse or clustered, and mention the trade-off introduced by omitting CQL
LIMIT for soft-delete correctness.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 15182611-7ba4-4617-ba4e-7e61001ab487

📥 Commits

Reviewing files that changed from the base of the PR and between a6c62d9 and 769176c.

📒 Files selected for processing (12)

docs/cassandra_message_model.md
docs/client-api.md
docs/superpowers/plans/2026-06-04-tcount-count-based.md
docs/superpowers/plans/2026-06-21-bounded-thread-reply-count.md
docs/superpowers/specs/2026-06-21-bounded-thread-reply-count-design.md
history-service/internal/cassrepo/write.go
history-service/internal/cassrepo/write_integration_test.go
message-worker/integration_test.go
message-worker/store_cassandra.go
pkg/threadcount/count.go
pkg/threadcount/integration_test.go
pkg/threadcount/main_test.go

hmchangw · 2026-06-24T11:24:53Z

Addressed the review feedback in e62bb10:

Fixed

Spec + plan docs: corrected the delete-path contract to threadcount.CountAndLatest (it also recomputes tlm), not Count. Updated the design spec's helper contract, the plan's Architecture/File-Structure/Task-3, and added the sql fence label.
pkg/threadcount/count.go: softened the package doc — per-write cost is bounded to ~Cap surviving rows plus any soft-deleted rows clustered ahead, not "constant regardless of thread size".
history-service cap test: now asserts the over-cap tlm equals the newest reply, covering the DESC-clustering path through the real delegation.

Skipped (with reasons)

SAST sql-injection suppression on the test CREATE TABLE: the repo's sast gate already passes on this — interpolating the testutil.CassandraKeyspace identifier into DDL is the standard pattern in every integration test in the repo (CQL can't bind identifiers), and adding a nosemgrep directive the rest of the suite doesn't use would be inconsistent. The flagged rule is CodeRabbit's OpenGrep config, not the repo's blocking scanner.
MD040 fence-language nits in the planning docs: markdownlint isn't a repo CI gate, and the flagged fences are before/after snippet fragments. (Did add the one real CQL block label in the spec.)
Docstring-coverage 45% pre-merge warning: the exported API (Count, CountAndLatest, Cap) is documented; the percentage counts test functions/helpers, which Go idiom leaves un-docstringed.

Separately, the earlier test-integration (history-service) failure was a transient Cassandra-container flake in the internal/service package (unable to discover protocol version: EOF at keyspace setup, before any test logic) — unrelated to this change; my cassrepo package passed. The new push re-runs it.

Generated by Claude Code

Spec and TDD implementation plan for bounding the per-write tcount recount: keep #245's idempotent, soft-delete-aware COUNT+blind-SET but stop the partition scan at a display cap, extracted into a shared pkg/threadcount used by both authoritative writers (message-worker via Count, history-service via CountAndLatest which also recomputes tlm). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn

CountAndLatest scans thread_messages_by_thread, early-breaking once the non-deleted tally reaches Cap=99 (PageSize(Cap) + n<Cap guard), so the per-write read is bounded to ~Cap surviving rows (plus any soft-deleted rows clustered ahead) rather than the whole partition. No CQL LIMIT — soft-deleted rows are live, interspersed rows, so a hard LIMIT could undercount. It also returns the latest surviving reply's created_at (tlm) for the delete path; the partition's DESC clustering order surfaces the latest survivor first. Count is a thin delegation to CountAndLatest discarding the timestamp — the add path needs only the count, and since created_at is a clustering key, selecting it adds no meaningful read cost while one shared scan keeps the two writers' counts provably identical. Integration tests cover under/over cap, deleted excluded (incl. over-cap interspersed), empty, and latest-survivor cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn

Delegate the add-path countThreadReplies to threadcount.Count, replacing the unbounded partition scan. The blind-SET (now tcount+tlm) and the countAndSetParentTcount caller are unchanged; tlm on the add path stays the new reply's CreatedAt. Adds an integration test (using the existing setupCassandra helper) asserting the count caps at 99. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn

Delegate the delete-path countThreadReplies to threadcount.CountAndLatest, which bounds the count at the same Cap as the add path (so a reply and a delete can't write divergent values and flip-flop the badge) while still recomputing tlm (latest surviving reply) from the one bounded scan. The capping integration test (using the existing setupCassandra helper) also asserts the over-cap tlm is the newest reply. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn

Note the 99 cap on the tcount column (messages_by_room, messages_by_id) and on the tcount/newTcount client-api fields, and record in the #245 plan that the bounded-cap design supersedes its COUNTER-table + reconciliation-job future-work item (a counter is not idempotent under JetStream redelivery; the cap stays inside the stateless model). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HvVrpPq7875QKs9JqmgCpn

Joey0538

LGTM! 🚀

hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch 3 times, most recently from 769176c to 4fde085 Compare June 24, 2026 11:15

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch from 4fde085 to e62bb10 Compare June 24, 2026 11:24

hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch from e62bb10 to 63b08e5 Compare June 24, 2026 11:44

claude added 5 commits June 24, 2026 13:36

hmchangw force-pushed the claude/message-gateway-bottleneck-is4kqg branch from 63b08e5 to 43a3c28 Compare June 24, 2026 13:36

hmchangw added the ready label Jun 25, 2026

Joey0538 approved these changes Jun 25, 2026

View reviewed changes

hmchangw merged commit a27f2b0 into main Jun 25, 2026
8 checks passed

hmchangw deleted the claude/message-gateway-bottleneck-is4kqg branch June 25, 2026 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(tcount): bound thread reply count recount to a constant#370

perf(tcount): bound thread reply count recount to a constant#370
hmchangw merged 5 commits into
mainfrom
claude/message-gateway-bottleneck-is4kqg

hmchangw commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hmchangw commented Jun 24, 2026

Uh oh!

Joey0538 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hmchangw commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why a cap (and not the documented COUNTER table)

Changes

Test Plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hmchangw commented Jun 24, 2026

Uh oh!

Joey0538 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hmchangw commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading