Skip to content

feat(search-sync-worker): add spotlight + user-room sync collections#78

Closed
Joey0538 wants to merge 1 commit intomainfrom
claude/room-sync-spotlight-GzOca
Closed

feat(search-sync-worker): add spotlight + user-room sync collections#78
Joey0538 wants to merge 1 commit intomainfrom
claude/room-sync-spotlight-GzOca

Conversation

@Joey0538
Copy link
Copy Markdown
Collaborator

@Joey0538 Joey0538 commented Apr 14, 2026

Summary

Adds two Collection implementations to search-sync-worker that consume member_added / member_removed events from the INBOX stream and maintain the spotlight (room typeahead) and user-room (message-search access control) Elasticsearch indexes. Replaces the old Monstache-based CDC sync for these two indexes with the existing OUTBOX/INBOX federation pipeline.

Index naming (overridable via env):

  • spotlight-{site}-v1-chat — one doc per subscription, search via roomName typeahead
  • user-room-{site} — one doc per user, holding a rooms[] array used as a terms filter on message search

What's in this PR

New collections

  • spotlightCollection (search-sync-worker/spotlight.go)

    • Per-subscription docs keyed by Subscription.ID
    • member_addedActionIndex with Version = evt.Timestamp (external versioning makes out-of-order delivery safe)
    • member_removedActionDelete with Version = evt.Timestamp
    • Template pattern spotlight-* with search_as_you_type on roomName via a whitespace/lowercase custom analyzer
  • userRoomCollection (search-sync-worker/user_room.go) — multi-pod safe

    • One doc per user, keyed by user account
    • member_addedActionUpdate with painless script + upsert
    • member_removedActionUpdate with painless script (no upsert)
    • Restricted rooms (Subscription.HistorySharedSince != nil) → skipped; the search service handles those via DB+cache at query time
    • Per-room LWW guard: each user doc carries a flattened roomTimestamps map. Both scripts read the stored timestamp, compare to params.ts, and short-circuit via ctx.op = 'none' on stale events. ES primary-shard atomicity + this guard make user-room-sync safe to run with multiple pods sharing the durable consumer.
    • Timestamp source: OutboxEvent.Timestamp (publish time), NOT Subscription.JoinedAtJoinedAt is immutable on the subscription row so add/remove for the same sub would carry the same value and confuse the guard.
    • Template pattern user-room-* maps rooms as text+keyword (existing query behavior preserved) and roomTimestamps as flattened to avoid mapping explosion as roomIds accumulate.

Collection interface changes

  • BuildAction now returns []searchengine.BulkAction so a single JetStream message can fan out to zero, one, or multiple ES actions. Handler tracks per-message action ranges and acks/nakks each source message as a unit.
  • New FilterSubjects(siteID) method so inbox-based collections can subscribe to both local (chat.inbox.{site}.member_*) and federated (chat.inbox.{site}.aggregate.member_*) variants via NATS 2.10+ consumer FilterSubjects.
  • StreamConfig returns jetstream.StreamConfig directly, with the canonical name + subjects sourced from pkg/stream.* so collections never redefine stream names locally.

Shared bits

  • inboxMemberCollection base struct centralizes StreamConfig + FilterSubjects for spotlight and user-room (zero per-instance state).
  • parseMemberEvent helper decodes OutboxEvent + MemberAddedPayload and validates preconditions shared by both inbox-member collections.
  • esPropertiesFromStruct[T any] generic consolidates template-mapping reflection — used by both messages and spotlight.

pkg/searchengine

  • New ActionUpdate type. Bulk adapter emits a plain update meta without version / version_type because _update is read-modify-write and ES rejects external versioning on it (true for both doc-merge and scripted updates — not specific to painless).
  • ActionIndex / ActionDelete still use external versioning for spotlight + messages idempotency.

pkg/stream

  • Inbox(siteID) now returns the full canonical ConfigName = INBOX_{siteID} and two non-overlapping subject patterns: chat.inbox.{site}.* (local direct publishes) and chat.inbox.{site}.aggregate.> (federated events sourced from remote OUTBOX streams via SubjectTransform). Centralizes every stream name + subject pattern so any consumer can read off what they're binding to and what the schema is.
  • Non-breaking change for inbox-worker: it reads only .Name from the returned Config, so adding .Subjects doesn't affect its current behavior.

pkg/model

  • New MemberAddedPayload{Subscription, Room} — the payload shape carried by OutboxEvent{Type: "member_added"} so inbox-member consumers can index without a DB lookup.
  • OutboxMemberAdded / OutboxMemberRemoved constants replace stringly-typed literals throughout the new code.

pkg/subject

  • New InboxMemberAdded / InboxMemberRemoved builders for local-publish subjects.
  • New InboxMemberAddedAggregate / InboxMemberRemovedAggregate for federated (transformed) subjects.
  • InboxMemberEventSubjects(siteID) returns the four-subject list used by spotlight and user-room consumer filters.

Bootstrap config (test-only, clearly grouped)

A nested bootstrapConfig struct groups the fields that are meaningful only in dev / integration tests. Env vars are all prefixed BOOTSTRAP_ so they're easy to spot in deployment manifests:

Env var Purpose
BOOTSTRAP_STREAMS Toggles CreateOrUpdateStream at startup. Leave false in production.
BOOTSTRAP_REMOTE_SITE_IDS Cross-site OUTBOX sources to attach to INBOX during bootstrap. Only consulted when BOOTSTRAP_STREAMS=true.

In production, streams are owned by their publisher services (message-gatekeeper for MESSAGES_CANONICAL, inbox-worker for INBOX) and search-sync-worker only manages its own durable consumers. Collections hold no remote-site state — the bootstrap loop in main.go detects the INBOX stream by comparing against stream.Inbox(cfg.SiteID).Name and swaps in inboxBootstrapStreamConfig (which layers on cross-site Sources + SubjectTransforms) before calling CreateOrUpdateStream. Stream creation is deduped by name so spotlight + user-room don't double-create the shared INBOX stream.

Consumer durable names

Per-purpose, no more generic search-sync-worker:

  • message-sync (was search-sync-worker)
  • spotlight-sync
  • user-room-sync

Graceful shutdown waits on all three runConsumer goroutines via a doneChs slice.

Tests

  • Unit tests: spotlight_test.go, user_room_test.go, inbox_stream_test.go, plus model + subject + searchengine round-trip and adapter coverage. ~410 lines of new unit tests.
  • Integration tests (inbox_integration_test.go, ~540 lines, gated by //go:build integration):
    • TestSpotlightSyncIntegration — local + federated member_added, federated member_removed, doc shape verification
    • TestUserRoomSyncIntegration — multi-room joins, federated upsert for new user, remove keeps roomTimestamps entry, restricted-room skip path, createdAt/updatedAt stamping
    • TestUserRoomSync_LWWGuard — sequential subtests proving the per-room timestamp guard handles in-order and out-of-order deliveries (initial add → stale add no-op → stale remove no-op → newer remove evicts → re-add restores → another stale add no-op)

Scope notes

  • inbox-worker is intentionally NOT modified here. The enhanced INBOX behavior (publishing + consuming aggregate.* events, migrating the handler to the new MemberAddedPayload shape, owning stream creation in production) ships in a separate PR. The pkg/stream.Inbox change in this PR is additive — inbox-worker reads only .Name and is unaffected.
  • room-worker is intentionally NOT modified here. The publish-side migration (building MemberAddedPayload, routing by invitee's home site to local INBOX vs OUTBOX) is a separate PR coordinated with inbox-worker.

Test plan

  • make lint — 0 issues ✅
  • make test — all services green ✅
  • go vet -tags=integration ./search-sync-worker/... ./pkg/... — clean ✅
  • make test-integration SERVICE=search-sync-worker (requires Docker for testcontainers-go) — needs CI run
  • Manual smoke test in local dev with BOOTSTRAP_STREAMS=true:
    • Verify spotlight index gets a doc for each subscription on member_added
    • Verify user-room doc has correct rooms array after a sequence of adds + removes
    • Verify restricted room is NOT indexed in user-room
    • Verify spotlight typeahead query returns expected hits

Known sharp edges (out of scope, follow-ups)

  • Spotlight ActionDelete on a non-existent doc returns 404, which the handler currently treats as failure → infinite nak/retry. Only triggerable by a multi-publisher race that doesn't exist in our topology (JetStream preserves per-subject order from a single publisher), but worth a 2-line handler fix in a follow-up to treat 404 on ActionDelete as success.
  • user-room-sync with multiple pods: safe via the LWW guard for member-event volume. Documented in user_room.go doc comment. If volume ever exceeds the single-pod ceiling, the sharding strategy is also documented.

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Room member events now include room name and type information
    • Enhanced member event synchronization with improved lifecycle tracking
  • Improvements

    • Better bulk indexing operations with update support
    • Improved member add/remove event processing and search capabilities

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements member event search synchronization and canonical event publishing. It adds RoomName and RoomType to MemberAddEvent, introduces bulk update action support in the search engine, extends the Collection interface with FilterSubjects() and slice-based BuildAction(), implements three new search-sync-worker collections (message, spotlight, user-room) for indexing member lifecycle events, refactors handler buffering to track action ranges per source message, and updates inbox/room-workers to publish canonical member events to dedicated subjects.

Changes

Cohort / File(s) Summary
Event Model & Search Infrastructure
pkg/model/event.go, pkg/searchengine/searchengine.go, pkg/searchengine/adapter.go
Added RoomName and RoomType fields to MemberAddEvent with JSON/BSON serialization. Introduced ActionUpdate bulk action type; updated BulkAction semantics where Version is ignored for updates and Doc varies by action type. Enhanced adapter to emit Elasticsearch bulk update actions and populate ErrorType from bulk response items.
Subject & Acknowledgment Utilities
pkg/subject/subject.go, pkg/natsutil/ack.go, pkg/natsutil/ack_test.go
Added RoomCanonicalMemberAdded(), RoomCanonicalMemberRemoved(), and RoomCanonicalMemberEventSubjects() helper functions. Introduced Acker/Naker interfaces and Ack()/Nak() functions with structured error logging.
Search-Sync-Worker Collection Interface & Core
search-sync-worker/collection.go, search-sync-worker/room_member.go
Updated Collection interface: StreamConfig() returns jetstream.StreamConfig, added FilterSubjects() method, changed BuildAction() to return []searchengine.BulkAction. Implemented roomMemberCollection with stream configuration and member event parser supporting MemberAddEvent/MemberRemoveEvent.
Message Collection & Template Infrastructure
search-sync-worker/messages.go, search-sync-worker/messages_test.go, search-sync-worker/template.go
Updated messageCollection to return jetstream.StreamConfig, added FilterSubjects(), changed BuildAction() to return slice. Renamed consumer to "message-sync". Added generic esPropertiesFromStruct() function for deriving Elasticsearch mappings from struct tags.
Spotlight Collection
search-sync-worker/spotlight.go
Implemented spotlightCollection for room typeahead indexing; parses member events and generates index/delete bulk actions. Added SpotlightSearchIndex schema with search_as_you_type analyzer for room names and esPropertiesFromStruct-backed template generation.
User-Room Collection
search-sync-worker/user_room.go
Implemented userRoomCollection generating Elasticsearch bulk update actions for per-user room access control. Includes Painless scripts implementing last-write-wins semantics for add events and timestamp-guarded removal for delete events.
Handler Buffering & Bulk Logic
search-sync-worker/handler.go, search-sync-worker/handler_test.go
Refactored buffering from per-action to message-scoped tracking (pendingMsg recording action ranges). Changed Add() to accumulate actions and handle zero/error cases distinctly. Reworked Flush() to submit flattened bulk request and ACK/NAK per source message with new isBulkItemSuccess() logic (2xx, 409, and context-aware 404 as success). Updated introspection methods to MessageCount() and ActionCount().
Worker Configuration & Bootstrap
search-sync-worker/main.go
Expanded config with SpotlightIndex, UserRoomIndex, FetchBatchSize, BulkBatchSize, BulkFlushInterval, and Bootstrap settings. Implemented 3-collection workflow with conditional template upserts and per-collection durable consumers. Added validation for batch/flush parameters.
Inbox-Worker Member Event Publishing
inbox-worker/handler.go, inbox-worker/handler_test.go, inbox-worker/main.go, inbox-worker/integration_test.go
Updated NewHandler() to accept siteID. Modified handleMemberAdded() and handleMemberRemoved() to publish canonical events to RoomCanonicalMemberAdded()/RoomCanonicalMemberRemoved() subjects. Updated all tests and initialization sites accordingly.
Room-Worker Member Event Publishing
room-worker/handler.go, room-worker/handler_test.go
Enhanced invite, add-members, and remove-member flows to publish canonical member_added/member_removed events. Extended MemberAddEvent payloads with RoomName and RoomType fields in all relevant publish paths. Updated test expectations for additional canonical event publishes.

Sequence Diagram(s)

sequenceDiagram
    participant JS as JetStream Stream
    participant Handler as Handler (fetchMsg loop)
    participant Col as Collection (spotlight/user-room)
    participant ES as Elasticsearch Adapter
    participant Bulk as Bulk Request
    
    loop Fetch & Process
        Handler->>JS: Fetch up to bulkBatchSize capacity
        JS-->>Handler: MemberAddEvent/MemberRemoveEvent (raw JSON)
        Handler->>Col: BuildAction(data)
        Col->>Col: parseMemberEvent()
        Col-->>Handler: []searchengine.BulkAction
        Handler->>Handler: Append actions to h.actions<br/>Record (msg, actionStart, actionCount)
    end
    
    alt Flush triggered (size or interval)
        Handler->>Handler: Snapshot & clear pending/actions under lock
        Handler->>ES: Submit single bulk request with all actions
        ES-->>Handler: BulkResponse (per-item results)
        Handler->>Handler: Check each action's status<br/>(2xx, 409, context-aware 404)
        
        alt All actions in message succeeded
            Handler->>JS: Ack source message
        else Any action failed
            Handler->>JS: Nak source message (retry)
        end
    else Bulk error
        Handler->>JS: Nak all pending source messages
    end
Loading
sequenceDiagram
    participant Room as Room Worker
    participant Sub as Subscription Update
    participant Out as Outbox Publish
    participant Canon as Canonical Subject Publish
    participant IB as Inbox Worker
    
    Note over Room,IB: Member Add Flow
    Room->>Room: processAddMembers()
    Room->>Sub: Update member subscription
    Sub-->>Room: subscription created
    Room->>Out: Publish MemberAddEvent to outbox<br/>(with RoomName, RoomType)
    Room->>Canon: Publish MemberAddEvent to<br/>RoomCanonicalMemberAdded()
    Out-->>IB: Consume from outbox
    Canon-->>IB: Consume from canonical subject
    IB->>IB: handleMemberAdded()
    IB->>Canon: Re-publish to canonical<br/>RoomCanonicalMemberAdded(siteID)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • mliu33
  • hmchangw
  • yenta

Poem

🐰 Hopping through events with flair,
Spotlight members in the air!
Canonical subjects now aligned,
Bulk updates and acks combined—
Search-sync hops into the light!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(search-sync-worker): add spotlight + user-room sync collections' is clear, concise, and directly summarizes the main change—adding two new Collection implementations (spotlight and user-room) to the search-sync-worker package.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/room-sync-spotlight-GzOca

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
search-sync-worker/main.go (1)

226-255: ⚠️ Potential issue | 🟠 Major

BATCH_SIZE no longer bounds the bulk request size.

This loop still fetches and buffers up to batchSize messages before it checks whether to flush, but BuildAction now returns multiple bulk actions per message. A small number of fan-out messages can therefore blow far past the configured limit before the first flush, which means oversized ES bulk requests and avoidable memory spikes.

Please base the flush threshold on buffered actions, and check it inside the message loop rather than only after the whole fetched batch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/main.go` around lines 226 - 255, The loop currently
flushes based on buffered messages, but because BuildAction can produce multiple
actions per message you must instead track and flush based on buffered actions:
add or use a handler method that reports current action count (e.g.,
ActionCount() or make handler.Add return number of actions added from a
message), change the fetch/threshold logic to compute fetchSize from remaining
action capacity (batchSize - handler.ActionCount(), with floor 1), and move the
flush check inside the for msg := range batch.Messages() loop so after each
handler.Add(...) you check handler.ActionCount() (or use handler.BufferFull()
redefined to mean action-capacity-full) and call handler.Flush(ctx) and update
lastFlush when the action limit is reached; ensure you stop processing further
messages from the fetched batch once the action threshold is hit so ES bulk size
cannot exceed batchSize.
🧹 Nitpick comments (1)
search-sync-worker/inbox_integration_test.go (1)

378-382: Don't split a stateful scenario into ordered subtests.

These subtests mutate shared ES/NATS state and rely on the previous step's side effects. If one require aborts mid-sequence, the rest inherit half-mutated state and start failing noisily. Either keep this as one linear test or create fresh fixtures per subtest. As per coding guidelines, "Each test must be fully independent — no shared mutable state between tests; never rely on test execution order."

Also applies to: 430-432

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/inbox_integration_test.go` around lines 378 - 382, The
TestUserRoomSync_LWWGuard test is split into ordered subtests that mutate shared
ES/NATS state and use require, which can leave later subtests in a half-mutated
state; either collapse the subtests into one single linear test body (remove
t.Run subtests) inside TestUserRoomSync_LWWGuard so state is mutated
deterministically, or make each t.Run create fresh fixtures (new ES index, new
NATS subject/connection, freshly seeded data) so they are fully independent;
update any helpers used by the subtests (setup/seed functions referenced by
TestUserRoomSync_LWWGuard and its t.Run children) to return isolated resources
and ensure every subtest defers teardown to clean ES/NATS state before
returning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/searchengine/searchengine.go`:
- Around line 25-35: The comment on BulkAction misleadingly states ES `_update`
doesn't accept `version`/`version_type`; change the comment on the BulkAction
type so it accurately says Elasticsearch does support versioning on update items
but this adapter intentionally ignores the Version field for ActionUpdate (i.e.,
for ActionUpdate, Version is ignored by design because search-sync collections
use collection-level idempotency/guards rather than external versioning). Update
the text to mention that this is a deliberate adapter choice and not an ES
limitation, referencing BulkAction, ActionUpdate and Version so readers
understand why Version is ignored for updates.

In `@search-sync-worker/handler.go`:
- Around line 111-126: The loop over pending batches treats any non-2xx/409
result as failure; change the check inside the loop (iterating i from
p.actionStart to p.actionStart+p.actionCount over results and actions) to also
treat Status==404 as success when the corresponding action is a delete or update
(i.e., check actions[i].Action == ActionDelete || actions[i].Action ==
ActionUpdate), while preserving the existing acceptance for 2xx and 409 for all
actions and keeping 404 as failure for index actions; update the condition that
sets allOK and the error logging accordingly so only true failures trigger
slog.Error with result.Status/result.Error/actions[i].DocID/actions[i].Index.

In `@search-sync-worker/inbox_stream.go`:
- Around line 37-43: The StreamSource construction is using both FilterSubject
and SubjectTransforms which are mutually exclusive; remove the FilterSubject
field from the StreamSource for the OUTBOX_<remote> source and rely on the
SubjectTransforms entry (its Source value, e.g. sourcePattern) to act as the
filter. Update the code that appends the jetstream.StreamSource (the block
creating sources = append(... &jetstream.StreamSource{ Name:
fmt.Sprintf("OUTBOX_%s", remote), FilterSubject: ..., SubjectTransforms:
[]jetstream.SubjectTransformConfig{ {Source: sourcePattern, Destination:
destPattern}, }, })) to omit FilterSubject so only SubjectTransforms.Source is
used as the selector.

In `@search-sync-worker/spotlight.go`:
- Around line 97-100: spotlightTemplateBody currently hard-codes "spotlight-*"
for the index_patterns which will miss custom/versioned spotlight indices;
change it to read the configured spotlight pattern instead (e.g., use the app
config value or helper like
getSpotlightIndexPattern()/cfg.SpotlightIndexPattern) and set "index_patterns":
[]string{configuredPattern} so the template targets the exact
configured/versioned spotlight index rather than the broad hard-coded wildcard.

In `@search-sync-worker/user_room.go`:
- Around line 182-185: userRoomTemplateBody currently hardcodes "user-room-*" so
template doesn't follow caller-supplied index name used by BuildAction; change
userRoomTemplateBody to accept the configured index/prefix (or otherwise access
the same config used by BuildAction, e.g., pass c.indexName or an indexPattern
string) and use that value for the "index_patterns" entry instead of
"user-room-*", ensuring the template will be applied to the actual indices (and
thus include the expected rooms.keyword / roomTimestamps mappings).
- Around line 107-117: The remove path currently emits a plain update (in the
OutboxMemberRemoved case using buildRemoveRoomUpdateBody and returning a
searchengine.BulkAction with ActionUpdate) which will 404 if the user doc is
missing; modify the OutboxMemberRemoved branch (and the similar branch around
the 161-176 range) to either include an upsert/tombstone payload or use
scripted_upsert so the update becomes an upsert (no-op when doc missing), or
alternatively change the bulk adapter to treat document_missing_exception/404
for remove updates as success; ensure the change references the same indexName
and account DocID and preserves the existing update script semantics while
preventing 404s on missing documents.

---

Outside diff comments:
In `@search-sync-worker/main.go`:
- Around line 226-255: The loop currently flushes based on buffered messages,
but because BuildAction can produce multiple actions per message you must
instead track and flush based on buffered actions: add or use a handler method
that reports current action count (e.g., ActionCount() or make handler.Add
return number of actions added from a message), change the fetch/threshold logic
to compute fetchSize from remaining action capacity (batchSize -
handler.ActionCount(), with floor 1), and move the flush check inside the for
msg := range batch.Messages() loop so after each handler.Add(...) you check
handler.ActionCount() (or use handler.BufferFull() redefined to mean
action-capacity-full) and call handler.Flush(ctx) and update lastFlush when the
action limit is reached; ensure you stop processing further messages from the
fetched batch once the action threshold is hit so ES bulk size cannot exceed
batchSize.

---

Nitpick comments:
In `@search-sync-worker/inbox_integration_test.go`:
- Around line 378-382: The TestUserRoomSync_LWWGuard test is split into ordered
subtests that mutate shared ES/NATS state and use require, which can leave later
subtests in a half-mutated state; either collapse the subtests into one single
linear test body (remove t.Run subtests) inside TestUserRoomSync_LWWGuard so
state is mutated deterministically, or make each t.Run create fresh fixtures
(new ES index, new NATS subject/connection, freshly seeded data) so they are
fully independent; update any helpers used by the subtests (setup/seed functions
referenced by TestUserRoomSync_LWWGuard and its t.Run children) to return
isolated resources and ensure every subtest defers teardown to clean ES/NATS
state before returning.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8ff946f6-d333-4ddf-aff1-028f37e0421f

📥 Commits

Reviewing files that changed from the base of the PR and between f9c4bb2 and 456d971.

📒 Files selected for processing (23)
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/searchengine/adapter.go
  • pkg/searchengine/adapter_test.go
  • pkg/searchengine/searchengine.go
  • pkg/stream/stream.go
  • pkg/stream/stream_test.go
  • pkg/subject/subject.go
  • pkg/subject/subject_test.go
  • search-sync-worker/collection.go
  • search-sync-worker/handler.go
  • search-sync-worker/inbox_integration_test.go
  • search-sync-worker/inbox_stream.go
  • search-sync-worker/inbox_stream_test.go
  • search-sync-worker/integration_test.go
  • search-sync-worker/main.go
  • search-sync-worker/messages.go
  • search-sync-worker/messages_test.go
  • search-sync-worker/spotlight.go
  • search-sync-worker/spotlight_test.go
  • search-sync-worker/template.go
  • search-sync-worker/user_room.go
  • search-sync-worker/user_room_test.go

Comment thread pkg/searchengine/searchengine.go
Comment thread search-sync-worker/handler.go
Comment thread search-sync-worker/inbox_stream.go Outdated
Comment thread search-sync-worker/spotlight.go Outdated
Comment thread search-sync-worker/user_room.go Outdated
Comment thread search-sync-worker/user_room.go Outdated
@Joey0538 Joey0538 force-pushed the claude/room-sync-spotlight-GzOca branch from 456d971 to c906357 Compare April 14, 2026 11:06
Copy link
Copy Markdown
Collaborator Author

Response to CodeRabbit review

Pushed c906357 (force-update of the squashed commit) addressing the actionable findings. Triage and rationale below.

✅ Fixed (5)

1. handler.go — treat 404 on ActionDelete / ActionUpdate as success (🔴 Critical)

Done. Extracted a new isBulkItemSuccess(action, result) helper that maps:

  • 2xx → success (always)
  • 409 → success (always — external-version stale write is desired-state-reached)
  • 404 → success only for ActionDelete (already deleted) and ActionUpdate (the user-room remove path emits a scriptless update on a doc that may not exist yet — desired state already reached)
  • 404 on ActionIndex stays a failure (indexing is supposed to create the doc)

This is the "known sharp edge" I flagged in the PR description, and it's the right call to fix it now. Added 14 unit test cases in TestIsBulkItemSuccess and 3 end-to-end cases in TestHandler_Flush_404OnDeleteAndUpdate.

Bonus: this also resolves finding #5 (user_room.go member_removed 404 on missing doc) without any change to user_room.go itself.

3. inbox_stream.go — drop FilterSubject when SubjectTransforms is set (🟠 Major)

Done. Verified against both docs.nats.io/source_and_mirror and ADR-36: FilterSubject and SubjectTransforms are mutually exclusive on a JetStream StreamSource. The transform's Source field acts as the filter. Removed the redundant field from inboxBootstrapStreamConfig, updated the doc comment to call out the constraint, and updated inbox_stream_test.go to assert FilterSubject is empty.

4. spotlight.go — use configured index name as template pattern (🟠 Major)

Done. spotlightTemplateBody(indexName) now sets index_patterns: [c.indexName] so a custom SPOTLIGHT_INDEX value still receives the correct mapping. Test updated to assert against the configured name (spotlight-site-a-v1-chat).

5. user_room.go — same template-pattern fix (🟠 Major)

Done. userRoomTemplateBody(indexName) mirrors the spotlight change. Test updated.

8. (nitpick) TestUserRoomSync_LWWGuard — collapse ordered subtests (🟢 Nit)

Done. Per CLAUDE.md's "each test must be fully independent — no shared mutable state, never rely on test execution order" rule, collapsed the 6 sequential t.Run subtests into one linear test body with // Step N: comment delimiters. The scenario is inherently stateful (each step builds on prior ES state to test LWW monotonicity), so independent subtests would defeat the purpose; the linear body is the right shape and now satisfies the rule.

❌ Rejected (1)

1. pkg/searchengine/searchengine.go — comment about _update and external versioning (🟡 Minor)

The comment as-written is correct. CodeRabbit's web-search summary is inaccurate.

The ES _update API explicitly does not support version_type=external. From the official Update API docs:

The update API doesn't support versioning other than internal.

And from the Bulk API docs for the update action, version and version_type are notably absent from the supported parameters list — only retry_on_conflict, _source, _source_excludes, _source_includes, and require_alias are supported. Sending version_type=external on a bulk update item triggers action_request_validation_exception.

The reason is architectural: _update is a read-modify-write operation (read current doc → apply partial update or run script → write back). External versioning's "my version is N, reject if stale" semantics only fit full document replacement (_index). For optimistic concurrency on _update, ES uses if_seq_no + if_primary_term.

The cited links in the web-search result describe the bulk API in general; they don't actually verify support for version/version_type on the update action specifically. Keeping the comment as-written.

⏸️ Deferred (1)

7. (outside diff) main.goBATCH_SIZE bounds messages, not actions (🟠 Major)

Valid architectural point but not actionable today. Every collection currently in this PR produces exactly one action per message (messageCollection, spotlightCollection, userRoomCollection all return []BulkAction of length 0 or 1). The actions slice can therefore never exceed the message count, which BATCH_SIZE already bounds. The fan-out concern only materializes if a future collection emits N actions per message.

When/if such a collection lands, the fix is straightforward: track an ActionCount() on the handler and check it inside the message loop instead of after the fetch batch. I'd rather make that change in the PR that introduces the fan-out collection than add unused complexity here.

I've left a comment in handler.go (the BufferLen returns the current number of buffered messages (not actions) doc comment) flagging the constraint so a future maintainer adding fan-out won't be surprised.

Summary

# Severity Status Where
1 🟡 Minor Rejected (CodeRabbit web-search inaccurate) pkg/searchengine/searchengine.go
2 🔴 Critical Fixed search-sync-worker/handler.go (+ tests)
3 🟠 Major Fixed search-sync-worker/inbox_stream.go (+ tests)
4 🟠 Major Fixed search-sync-worker/spotlight.go (+ tests)
5 🟠 Major Resolved by #2 search-sync-worker/user_room.go
6 🟠 Major Fixed search-sync-worker/user_room.go (+ tests)
7 🟠 Major Deferred (no fan-out collections today) search-sync-worker/main.go
8 🟢 Nit Fixed search-sync-worker/inbox_integration_test.go

make lint 0 issues, make test green, go vet -tags=integration ./search-sync-worker/... ./pkg/... clean. Branch still one squashed commit (c906357).

🤖 Generated with Claude Code


Generated by Claude Code

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
search-sync-worker/main.go (1)

80-91: ⚠️ Potential issue | 🟠 Major

Reject non-positive batch settings at startup.

runConsumer assumes these values are > 0. FETCH_BATCH_SIZE<=0 can collapse into Fetch(0)/busy looping, BULK_BATCH_SIZE<=0 keeps remaining<=0 forever, and FLUSH_INTERVAL<=0 forces constant flush checks. Please validate them immediately after parsing/defaulting config and exit with a clear error.

Suggested startup validation
 	if cfg.SpotlightIndex == "" {
 		cfg.SpotlightIndex = fmt.Sprintf("spotlight-%s-v1-chat", cfg.SiteID)
 	}
 	if cfg.UserRoomIndex == "" {
 		cfg.UserRoomIndex = fmt.Sprintf("user-room-%s", cfg.SiteID)
 	}
+	switch {
+	case cfg.FetchBatchSize <= 0:
+		slog.Error("invalid config", "name", "FETCH_BATCH_SIZE", "value", cfg.FetchBatchSize)
+		os.Exit(1)
+	case cfg.BulkBatchSize <= 0:
+		slog.Error("invalid config", "name", "BULK_BATCH_SIZE", "value", cfg.BulkBatchSize)
+		os.Exit(1)
+	case cfg.FlushInterval <= 0:
+		slog.Error("invalid config", "name", "FLUSH_INTERVAL", "value", cfg.FlushInterval)
+		os.Exit(1)
+	}
 
 	ctx := context.Background()

As per coding guidelines, "Fail fast on missing required config — log error and exit with non-zero code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/main.go` around lines 80 - 91, After parsing/defaulting
the config from env.ParseAs[config], validate that numeric settings used by
runConsumer (FETCH_BATCH_SIZE, BULK_BATCH_SIZE, FLUSH_INTERVAL) are positive; if
any are <= 0 log a clear error via slog.Error (including the setting name and
value) and exit with a non-zero code (os.Exit(1)). Update the startup path
immediately after the current cfg defaults (after setting SpotlightIndex and
UserRoomIndex) to check cfg.FetchBatchSize, cfg.BulkBatchSize, and
cfg.FlushInterval (or the exact field names in config) and fail fast to prevent
Fetch(0), infinite remaining loop, or constant flush checks in runConsumer.
🧹 Nitpick comments (2)
search-sync-worker/template.go (1)

25-33: Skip es-tagged fields that don't expose a concrete JSON name.

This shared helper will currently create a mapping entry under "" or "-" if a future struct forgets a json tag or uses json:"-". Failing closed here is safer than quietly generating a broken template.

Suggested hardening
 		jsonTag := field.Tag.Get("json")
 
 		name, _, _ := strings.Cut(jsonTag, ",")
+		if name == "" || name == "-" {
+			continue
+		}
 
 		esType, analyzer, _ := strings.Cut(esTag, ",")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/template.go` around lines 25 - 33, The code currently
derives JSON field name into variable name and unconditionally adds a mapping
entry; change it to skip fields that don't expose a concrete JSON name by
checking the parsed name from jsonTag and returning without adding a prop when
name == "" or name == "-". Update the block around jsonTag/name (the variables
jsonTag and name) so that after computing name, you do a guard (if name == "" ||
name == "-" { continue } or return depending on context) before computing
esType/analyzer and assigning props[name], ensuring no mapping is created for
anonymous/ignored json fields.
search-sync-worker/inbox_integration_test.go (1)

40-58: historyShared is accepted but its actual value is ignored.

On Line 57, the helper only converts the pointer to a boolean flag. The concrete timestamp passed by callers (e.g., restrictedFrom) is not propagated into Subscription.HistorySharedSince, which makes those test inputs less faithful.

💡 Proposed refactor
 type memberFixture struct {
 	SubID      string
 	Account    string
 	Restricted bool // if true, HistorySharedSince is set — user-room-sync filters, spotlight-sync indexes
+	HistorySharedSince *time.Time
 }

 func buildMemberEventPayload(
 	subID, account, roomID, roomName, siteID string,
 	joinedAt time.Time,
 	historyShared *time.Time,
 ) model.MemberAddedPayload {
 	return buildBulkMemberEventPayload(roomID, roomName, siteID, joinedAt, []memberFixture{{
 		SubID:      subID,
 		Account:    account,
 		Restricted: historyShared != nil,
+		HistorySharedSince: historyShared,
 	}})
 }

 func buildBulkMemberEventPayload(
 	roomID, roomName, siteID string,
 	joinedAt time.Time,
 	members []memberFixture,
 ) model.MemberAddedPayload {
-	historyFrom := joinedAt.Add(-1 * time.Hour)
 	subscriptions := make([]model.Subscription, 0, len(members))
 	for _, m := range members {
 		sub := model.Subscription{
 			ID:         m.SubID,
 			User:       model.SubscriptionUser{ID: "u-" + m.Account, Account: m.Account},
 			RoomID:     roomID,
 			SiteID:     siteID,
 			Role:       model.RoleMember,
 			JoinedAt:   joinedAt,
 			LastSeenAt: joinedAt,
 		}
-		if m.Restricted {
+		if m.HistorySharedSince != nil {
+			sub.HistorySharedSince = m.HistorySharedSince
+		} else if m.Restricted {
+			historyFrom := joinedAt.Add(-1 * time.Hour)
 			sub.HistorySharedSince = &historyFrom
 		}
 		subscriptions = append(subscriptions, sub)
 	}

Also applies to: 72-86

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/inbox_integration_test.go` around lines 40 - 58,
buildMemberEventPayload currently ignores the actual historyShared timestamp and
only sets Restricted based on its nil-ness; update it to pass the concrete
timestamp into buildBulkMemberEventPayload so that
Subscription.HistorySharedSince is populated. Specifically, when calling
buildBulkMemberEventPayload from buildMemberEventPayload, propagate
historyShared (the *time.Time) into the created memberFixture / Subscription
data rather than converting it to a boolean; ensure buildBulkMemberEventPayload
and any code that constructs Subscription.HistorySharedSince (or reads
memberFixture.Restricted) use the timestamp value to set
Subscription.HistorySharedSince when non-nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/stream/stream_test.go`:
- Around line 28-33: Replace uses of t.Errorf and t.Fatalf in stream_test.go
with testify assertions: add imports for "github.com/stretchr/testify/assert"
and "github.com/stretchr/testify/require", then convert non-fatal checks like
the Name and Subjects assertions to assert calls (e.g., assert.Equal(t,
tt.wantName, tt.cfg.Name) and assert.Len(t, tt.cfg.Subjects, 1); assert.Equal(t,
tt.wantSubj, tt.cfg.Subjects[0])) and convert checks that should stop the test
on failure to require calls (e.g., require.NoError/require.Equal where
appropriate) using the same tt.* symbols (tt.cfg.Name, tt.wantName,
tt.cfg.Subjects, tt.wantSubj) to locate each assertion to update.

In `@search-sync-worker/handler.go`:
- Around line 167-176: The isBulkItemSuccess function currently treats any 404
for delete/update as success; change it to inspect the bulk item error payload
(e.g., result.Error.Type or result.Error.Reason on searchengine.BulkResult) and
only treat 404 as idempotent success when the error type matches the benign
missing-document case (e.g., "document_missing_exception" or the equivalent used
in tests), otherwise return false so index/template-missing errors like
"index_not_found_exception" are not acked; ensure you handle nil/absent Error
safely and keep the existing 2xx and 409 logic in isBulkItemSuccess.

---

Outside diff comments:
In `@search-sync-worker/main.go`:
- Around line 80-91: After parsing/defaulting the config from
env.ParseAs[config], validate that numeric settings used by runConsumer
(FETCH_BATCH_SIZE, BULK_BATCH_SIZE, FLUSH_INTERVAL) are positive; if any are <=
0 log a clear error via slog.Error (including the setting name and value) and
exit with a non-zero code (os.Exit(1)). Update the startup path immediately
after the current cfg defaults (after setting SpotlightIndex and UserRoomIndex)
to check cfg.FetchBatchSize, cfg.BulkBatchSize, and cfg.FlushInterval (or the
exact field names in config) and fail fast to prevent Fetch(0), infinite
remaining loop, or constant flush checks in runConsumer.

---

Nitpick comments:
In `@search-sync-worker/inbox_integration_test.go`:
- Around line 40-58: buildMemberEventPayload currently ignores the actual
historyShared timestamp and only sets Restricted based on its nil-ness; update
it to pass the concrete timestamp into buildBulkMemberEventPayload so that
Subscription.HistorySharedSince is populated. Specifically, when calling
buildBulkMemberEventPayload from buildMemberEventPayload, propagate
historyShared (the *time.Time) into the created memberFixture / Subscription
data rather than converting it to a boolean; ensure buildBulkMemberEventPayload
and any code that constructs Subscription.HistorySharedSince (or reads
memberFixture.Restricted) use the timestamp value to set
Subscription.HistorySharedSince when non-nil.

In `@search-sync-worker/template.go`:
- Around line 25-33: The code currently derives JSON field name into variable
name and unconditionally adds a mapping entry; change it to skip fields that
don't expose a concrete JSON name by checking the parsed name from jsonTag and
returning without adding a prop when name == "" or name == "-". Update the block
around jsonTag/name (the variables jsonTag and name) so that after computing
name, you do a guard (if name == "" || name == "-" { continue } or return
depending on context) before computing esType/analyzer and assigning
props[name], ensuring no mapping is created for anonymous/ignored json fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff8d4979-df8b-43ba-a2d0-bb2d06547287

📥 Commits

Reviewing files that changed from the base of the PR and between 456d971 and 201c715.

📒 Files selected for processing (24)
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/searchengine/adapter.go
  • pkg/searchengine/adapter_test.go
  • pkg/searchengine/searchengine.go
  • pkg/stream/stream.go
  • pkg/stream/stream_test.go
  • pkg/subject/subject.go
  • pkg/subject/subject_test.go
  • search-sync-worker/collection.go
  • search-sync-worker/handler.go
  • search-sync-worker/handler_test.go
  • search-sync-worker/inbox_integration_test.go
  • search-sync-worker/inbox_stream.go
  • search-sync-worker/inbox_stream_test.go
  • search-sync-worker/integration_test.go
  • search-sync-worker/main.go
  • search-sync-worker/messages.go
  • search-sync-worker/messages_test.go
  • search-sync-worker/spotlight.go
  • search-sync-worker/spotlight_test.go
  • search-sync-worker/template.go
  • search-sync-worker/user_room.go
  • search-sync-worker/user_room_test.go
✅ Files skipped from review due to trivial changes (3)
  • pkg/model/model_test.go
  • pkg/searchengine/searchengine.go
  • search-sync-worker/spotlight_test.go
🚧 Files skipped from review as they are similar to previous changes (11)
  • search-sync-worker/messages_test.go
  • pkg/searchengine/adapter.go
  • pkg/stream/stream.go
  • search-sync-worker/integration_test.go
  • search-sync-worker/inbox_stream_test.go
  • pkg/model/event.go
  • pkg/subject/subject_test.go
  • search-sync-worker/inbox_stream.go
  • pkg/searchengine/adapter_test.go
  • search-sync-worker/messages.go
  • search-sync-worker/user_room.go

Comment thread pkg/stream/stream_test.go
Comment thread search-sync-worker/handler.go
Copy link
Copy Markdown
Collaborator Author

Response to CodeRabbit's review on 201c715 + naming cleanup

Pushed 4587f35 on top of 201c715. Branch is now three commits:

4587f35 refactor(search-sync-worker): rename FlushInterval, tighten 404 handling, fail fast on bad config
201c715 feat(search-sync-worker): support bulk invite via multi-subscription member events
c906357 feat(search-sync-worker): add spotlight and user-room sync collections

Triage against the latest review:

✅ Addressed

handler.go:176 — don't treat every 404 on delete/update as idempotent (🟠 Major)

Fair catch. Previous fix was too broad — index_not_found_exception at 404 means the backing index/template is missing, and silently acking those would drop messages on a bad deploy with no feedback.

Fix has three layers:

  1. pkg/searchengine.BulkResult gains an ErrorType field (machine-readable classifier) alongside the existing human-readable Error (Reason). Propagated from detail.Error.Type in the adapter.
  2. isBulkItemSuccess now matches on ErrorType at 404:
    • ActionDelete: success only when ErrorType == "" (delete-of-missing-doc sets result:"not_found" with no error block)
    • ActionUpdate: success only when ErrorType == "document_missing_exception" — unfamiliar error types fail closed, including index_not_found_exception
    • ActionIndex: always a failure (unchanged)
  3. Tests: TestIsBulkItemSuccess now has 14 cases covering document-missing vs index-not-found on both delete and update plus an unknown-error-type fail-closed case. TestHandler_Flush_404OnDeleteAndUpdate adds end-to-end "404 + index_not_found_exception" cases that must be nakked. TestAdapter_Bulk gets a new subtest verifying document_missing_exception and index_not_found_exception propagate into BulkResult.ErrorType.

main.go outside-diff — reject non-positive batch/interval settings at startup (🟠 Major)

Good find. Added fail-fast validation right after the index-name defaults:

if cfg.FetchBatchSize <= 0 {
    slog.Error("invalid config", "name", "FETCH_BATCH_SIZE", ...)
    os.Exit(1)
}
if cfg.BulkBatchSize <= 0 { ... }
if cfg.BulkFlushInterval <= 0 { ... }

Matches the CLAUDE.md "fail fast on bad config" rule. Without this, FetchBatchSize <= 0 would collapse into Fetch(0) busy loops and BulkBatchSize <= 0 would keep the remaining <= 0 fast path forever.

pkg/stream/stream_test.go — convert to testify (🟡 Minor)

Converted t.Errorf / t.Fatalf to assert.Equal / require.Len throughout. Matches the repo-wide testify guideline in CLAUDE.md §4.

template.go:25-33 — guard empty/ignored json names (🟢 Nit)

Added if name == "" || name == "-" { continue } plus a doc comment calling out the fail-closed policy. Without this, a future struct with an es tag but a missing/ignored json tag would silently emit a mapping entry under "", corrupting the template.

inbox_integration_test.go:40-58 — propagate historyShared timestamp faithfully (🟢 Nit)

memberFixture now carries HistorySharedSince *time.Time verbatim. The Restricted bool stays as a convenience shortcut for "pick a synthetic timestamp for me"; if HistorySharedSince is set explicitly it wins. Doc comment spells out the precedence.

🧹 Naming cleanup (user-requested, same commit)

Renamed FlushIntervalBulkFlushInterval and FLUSH_INTERVALBULK_FLUSH_INTERVAL for consistency with BulkBatchSize / BULK_BATCH_SIZE. The two variables are now clearly paired as "ES bulk flush triggers" (size-based + time-based). Without the prefix on the interval, it looked like an unrelated concept.

🔄 Already resolved (prior commits)

CodeRabbit's new review thread on user_room.go (member_removed 404 on missing doc — 🟠 major) is covered by the handler.go 404-as-success fix from c906357. That commit's isBulkItemSuccess handles it generically so we didn't need to add an upsert/tombstone shape to the remove path. The 404 on missing user-room doc still gets acked, and the new ErrorType check makes sure we only ack when it's genuinely "doc missing" and not "index missing."

Verification

  • make lint → 0 issues
  • make test → all services green (5 new test cases in TestIsBulkItemSuccess, 4 new in TestHandler_Flush_404OnDeleteAndUpdate, 1 new in TestAdapter_Bulk)
  • go vet -tags=integration ./search-sync-worker/... ./pkg/... → clean

🤖 Generated with Claude Code


Generated by Claude Code

Copy link
Copy Markdown
Collaborator Author

Follow-up: natsutil.Ack/Nak helpers + PR status

Pushed 375fc1c on top of 4587f35. Branch is now four commits:

375fc1c refactor(natsutil): add Ack/Nak helpers, use in search-sync-worker
4587f35 refactor(search-sync-worker): rename FlushInterval, tighten 404 handling, fail fast on bad config
201c715 feat(search-sync-worker): support bulk invite via multi-subscription member events
c906357 feat(search-sync-worker): add spotlight and user-room sync collections

What 375fc1c does

Adds shared natsutil.Ack(msg, reason) / natsutil.Nak(msg, reason) helpers and converts search-sync-worker's handler to use them.

Why: the "try to ack/nak a JetStream message and log any failure" pattern appears 18 times across 7 services (message-gatekeeper, broadcast-worker, inbox-worker, search-sync-worker, room-worker, notification-worker, message-worker) with divergent log shapes:

  • Log message: "failed to ack message" / "ack failed" / "ack malformed message" — three different phrases
  • Error key: "error" vs "err" — two different keys
  • Spelling: "nack" vs "nak" — mixed

A log-aggregation query like "every ack failure in the last hour" currently needs to match three message formats and two key names. With the shared helper, every service emits slog.Error("ack failed", "reason", ..., "error", ...) — one shape, queryable by cause via the reason field.

Interface design: Acker / Naker are minimal single-method interfaces, so the same helpers work for both jetstream.Msg (nats.go) and oteljetstream.Msg (otel-wrapped) without a wrapper type. Every consumer in the repo satisfies them.

Scope split (intentional)

  • This commit: helper + tests + search-sync-worker conversion (5 call sites). search-sync-worker/handler.go was the motivating case since this PR already touches it extensively.
  • Follow-up PR: migrate the 6 other services (13 call sites) and normalize the divergent spellings in one mechanical pass. I'll open it after this PR lands.

Keeping the other services out of this PR avoids expanding the blast radius to files unrelated to spotlight/user-room sync — reviewers can focus on the feature, and the migration PR is its own tidy commit.

Unresolved thread status

Only one review thread remains is_resolved: false: discussion_r3078912690 on search-sync-worker/user_room.go about member_removed 404. It's marked is_outdated: true because the fix landed in handler.go rather than on the line the comment is anchored to.

I've posted a reply on that thread pointing at the actual fix commits (c906357 + 4587f35). Feel free to click Resolve conversation to close it manually — CodeRabbit's autoresolver didn't pick it up because the fix wasn't on the exact file.

Verification

  • make lint → 0 issues
  • make test → all services green (4 new tests in pkg/natsutil/ack_test.go)
  • go vet -tags=integration ./search-sync-worker/... ./pkg/... → clean

🤖 Generated with Claude Code


Generated by Claude Code

@Joey0538 Joey0538 force-pushed the claude/room-sync-spotlight-GzOca branch from 3026f46 to 61a3cf9 Compare April 16, 2026 02:57
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/subject/subject_test.go (1)

74-90: Use assert/require for the new subject-slice checks.

This subtest is doing manual length and element comparisons even though the repo standardizes on Testify in _test.go files. require.Equal(t, want, got) would make the intent clearer and keep the test style consistent.

As per coding guidelines, "**/*_test.go: Use standard testing package with github.com/stretchr/testify/assert and testify/require for assertions`."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/subject/subject_test.go` around lines 74 - 90, Replace the manual length
and element-by-element comparisons in the subtest named
"InboxMemberEventSubjects" with a single require.Equal assertion; call
subject.InboxMemberEventSubjects("site-a") into got, build want as before, then
use require.Equal(t, want, got). Add the import for
"github.com/stretchr/testify/require" to the test file and remove the manual len
check and for-loop that compares elements.
search-sync-worker/handler_test.go (1)

304-407: Collapse the 404 permutations into a table-driven test.

These subtests all exercise the same handler flow with different {action, status, errorType, wantAck} inputs. A table would remove a lot of duplication and make it easier to add more ES error classifications without copying another full setup block.

As per coding guidelines, "**/handler_test.go: For handler tests: test each NATS/HTTP handler method with table-driven tests covering all documented scenarios`."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/handler_test.go` around lines 304 - 407, Collapse the
repeated subtests in TestHandler_Flush_404OnDeleteAndUpdate into a single
table-driven loop: define a slice of test cases containing name, collection
factory (e.g. newStubDeleteCollection, newStubUpdateCollection,
newStubIndexCollection), the mocked BulkResult (Status, ErrorType, Error) and
expected ack/nack booleans; then for each case call t.Run(case.name, func(t
*testing.T){ create a gomock.Controller, NewMockStore, set the Bulk expectation
(gomock.Any(), gomock.Len(1)) to return the case's BulkResult, build the handler
via NewHandler(store, coll, 500), create stubMsg, h.Add(msg), h.Flush(ctx) and
assert msg.acked/msg.nacked match expected }); keep existing expectations (Bulk
call and return values), and reuse existing helpers (newStubDeleteCollection,
newStubUpdateCollection, newStubIndexCollection, NewHandler, stubMsg) to ensure
behavior is identical while removing duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@search-sync-worker/inbox_stream.go`:
- Around line 37-45: The code is building NATS subject strings inline
(destPattern and sourcePattern) — instead, create and use subject
builder/pattern helpers in pkg/subject (e.g., functions like
InboxAggregatePattern(siteID string) and OutboxToPattern(srcSiteID, destSiteID
string)) and replace the fmt.Sprintf usages in inbox_stream.go: build
destPattern with pkg/subject.InboxAggregatePattern(siteID) and build
sourcePattern with pkg/subject.OutboxToPattern(remote, siteID); then pass those
returned patterns into the jetstream.StreamSource SubjectTransforms (leaving
Name as OUTBOX_{remote} and the SubjectTransformConfig usage unchanged) so all
canonical subject definitions live in pkg/subject.
- Around line 83-95: In parseMemberEvent, validate evt.Type after unmarshalling
the OutboxEvent and fail closed if it is not one of the supported values
("member_added" or "member_removed"); update parseMemberEvent (which returns
*model.OutboxEvent, *model.MemberAddedPayload) to check evt.Type and return a
descriptive error (e.g., "unsupported event type: %s") when the type is
unexpected so mispublished INBOX messages cannot be processed further.

---

Nitpick comments:
In `@pkg/subject/subject_test.go`:
- Around line 74-90: Replace the manual length and element-by-element
comparisons in the subtest named "InboxMemberEventSubjects" with a single
require.Equal assertion; call subject.InboxMemberEventSubjects("site-a") into
got, build want as before, then use require.Equal(t, want, got). Add the import
for "github.com/stretchr/testify/require" to the test file and remove the manual
len check and for-loop that compares elements.

In `@search-sync-worker/handler_test.go`:
- Around line 304-407: Collapse the repeated subtests in
TestHandler_Flush_404OnDeleteAndUpdate into a single table-driven loop: define a
slice of test cases containing name, collection factory (e.g.
newStubDeleteCollection, newStubUpdateCollection, newStubIndexCollection), the
mocked BulkResult (Status, ErrorType, Error) and expected ack/nack booleans;
then for each case call t.Run(case.name, func(t *testing.T){ create a
gomock.Controller, NewMockStore, set the Bulk expectation (gomock.Any(),
gomock.Len(1)) to return the case's BulkResult, build the handler via
NewHandler(store, coll, 500), create stubMsg, h.Add(msg), h.Flush(ctx) and
assert msg.acked/msg.nacked match expected }); keep existing expectations (Bulk
call and return values), and reuse existing helpers (newStubDeleteCollection,
newStubUpdateCollection, newStubIndexCollection, NewHandler, stubMsg) to ensure
behavior is identical while removing duplication.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2e212552-96a7-4b90-8bcb-75f0d783333b

📥 Commits

Reviewing files that changed from the base of the PR and between 375fc1c and 61a3cf9.

📒 Files selected for processing (26)
  • pkg/model/event.go
  • pkg/model/model_test.go
  • pkg/natsutil/ack.go
  • pkg/natsutil/ack_test.go
  • pkg/searchengine/adapter.go
  • pkg/searchengine/adapter_test.go
  • pkg/searchengine/searchengine.go
  • pkg/stream/stream.go
  • pkg/stream/stream_test.go
  • pkg/subject/subject.go
  • pkg/subject/subject_test.go
  • search-sync-worker/collection.go
  • search-sync-worker/handler.go
  • search-sync-worker/handler_test.go
  • search-sync-worker/inbox_integration_test.go
  • search-sync-worker/inbox_stream.go
  • search-sync-worker/inbox_stream_test.go
  • search-sync-worker/integration_test.go
  • search-sync-worker/main.go
  • search-sync-worker/messages.go
  • search-sync-worker/messages_test.go
  • search-sync-worker/spotlight.go
  • search-sync-worker/spotlight_test.go
  • search-sync-worker/template.go
  • search-sync-worker/user_room.go
  • search-sync-worker/user_room_test.go
✅ Files skipped from review due to trivial changes (6)
  • pkg/model/model_test.go
  • pkg/searchengine/adapter_test.go
  • pkg/natsutil/ack_test.go
  • pkg/natsutil/ack.go
  • search-sync-worker/spotlight_test.go
  • search-sync-worker/user_room.go
🚧 Files skipped from review as they are similar to previous changes (8)
  • search-sync-worker/messages_test.go
  • pkg/stream/stream.go
  • search-sync-worker/integration_test.go
  • pkg/searchengine/searchengine.go
  • pkg/stream/stream_test.go
  • search-sync-worker/collection.go
  • search-sync-worker/template.go
  • search-sync-worker/main.go

Comment thread search-sync-worker/inbox_stream.go Outdated
Comment on lines +37 to +45
destPattern := fmt.Sprintf("chat.inbox.%s.aggregate.>", siteID)
sources := make([]*jetstream.StreamSource, 0, len(remoteSiteIDs))
for _, remote := range remoteSiteIDs {
sourcePattern := fmt.Sprintf("outbox.%s.to.%s.>", remote, siteID)
sources = append(sources, &jetstream.StreamSource{
Name: fmt.Sprintf("OUTBOX_%s", remote),
SubjectTransforms: []jetstream.SubjectTransformConfig{
{Source: sourcePattern, Destination: destPattern},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Move these NATS subject patterns into pkg/subject.

This helper reintroduces raw subject formatting for both the sourced OUTBOX pattern and the rewritten INBOX aggregate pattern. Please add builders/pattern helpers in pkg/subject and reuse them here so the canonical subject definitions stay in one place.

As per coding guidelines, "Use dot-delimited hierarchical NATS subjects — use pkg/subject builders, never raw fmt.Sprintf" and "pkg/subject/*.go: Outbox subjects: outbox.{siteID}.to.{destSiteID}.{eventType}; define subject patterns in pkg/subject`."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/inbox_stream.go` around lines 37 - 45, The code is
building NATS subject strings inline (destPattern and sourcePattern) — instead,
create and use subject builder/pattern helpers in pkg/subject (e.g., functions
like InboxAggregatePattern(siteID string) and OutboxToPattern(srcSiteID,
destSiteID string)) and replace the fmt.Sprintf usages in inbox_stream.go: build
destPattern with pkg/subject.InboxAggregatePattern(siteID) and build
sourcePattern with pkg/subject.OutboxToPattern(remote, siteID); then pass those
returned patterns into the jetstream.StreamSource SubjectTransforms (leaving
Name as OUTBOX_{remote} and the SubjectTransformConfig usage unchanged) so all
canonical subject definitions live in pkg/subject.

Comment thread search-sync-worker/inbox_stream.go Outdated
Comment on lines +83 to +95
func parseMemberEvent(data []byte) (*model.OutboxEvent, *model.MemberAddedPayload, error) {
var evt model.OutboxEvent
if err := json.Unmarshal(data, &evt); err != nil {
return nil, nil, fmt.Errorf("unmarshal outbox event: %w", err)
}
if evt.Timestamp <= 0 {
return nil, nil, fmt.Errorf("parse member event: missing timestamp")
}
var payload model.MemberAddedPayload
if err := json.Unmarshal(evt.Payload, &payload); err != nil {
return nil, nil, fmt.Errorf("unmarshal member added payload: %w", err)
}
return &evt, &payload, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject unsupported OutboxEvent.Type values here.

parseMemberEvent is the shared decode boundary for inbox-member collections, but it currently accepts any evt.Type as long as the payload shape parses. Fail closed unless the type is member_added or member_removed, otherwise a mispublished INBOX message can reach the wrong indexing path.

Proposed fix
 func parseMemberEvent(data []byte) (*model.OutboxEvent, *model.MemberAddedPayload, error) {
 	var evt model.OutboxEvent
 	if err := json.Unmarshal(data, &evt); err != nil {
 		return nil, nil, fmt.Errorf("unmarshal outbox event: %w", err)
 	}
 	if evt.Timestamp <= 0 {
 		return nil, nil, fmt.Errorf("parse member event: missing timestamp")
 	}
+	if evt.Type != model.OutboxMemberAdded && evt.Type != model.OutboxMemberRemoved {
+		return nil, nil, fmt.Errorf("parse member event: unsupported type %q", evt.Type)
+	}
 	var payload model.MemberAddedPayload
 	if err := json.Unmarshal(evt.Payload, &payload); err != nil {
-		return nil, nil, fmt.Errorf("unmarshal member added payload: %w", err)
+		return nil, nil, fmt.Errorf("unmarshal member event payload: %w", err)
 	}
 	return &evt, &payload, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/inbox_stream.go` around lines 83 - 95, In
parseMemberEvent, validate evt.Type after unmarshalling the OutboxEvent and fail
closed if it is not one of the supported values ("member_added" or
"member_removed"); update parseMemberEvent (which returns *model.OutboxEvent,
*model.MemberAddedPayload) to check evt.Type and return a descriptive error
(e.g., "unsupported event type: %s") when the type is unexpected so mispublished
INBOX messages cannot be processed further.

@Joey0538 Joey0538 force-pushed the claude/room-sync-spotlight-GzOca branch from 61a3cf9 to 7cd6729 Compare April 20, 2026 07:50
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
inbox-worker/handler.go (1)

113-145: Re-publish failures are silently absorbed.

Both canonical re-publishes log on failure but let handleMemberAdded/handleMemberRemoved return nil, so JetStream will Ack and the search-index update is lost permanently. This matches the pre-existing best-effort publish style in the service, so not a blocker — but for these specific re-publishes the store-side work is idempotent (BulkCreateSubscriptions swallows duplicate keys; DeleteSubscriptionsByAccounts is a no-op on empty set), so Nak'ing on publish failure would be safe and would give you at-least-once delivery to the canonical stream. Consider either returning the error or at minimum adding a metric so drift is visible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inbox-worker/handler.go` around lines 113 - 145, The re-publish failures in
handleMemberRemoved (and the analogous handleMemberAdded) currently only log
errors and return nil, which causes the message to be Acked and drops the
canonical update; change the behavior so that if h.pub.Publish(ctx,
subject.RoomCanonicalMemberRemoved(h.siteID), evt.Payload) (and the
RoomCanonicalMemberAdded call) returns an error you propagate that error (return
it) instead of swallowing it, leveraging the idempotence of
DeleteSubscriptionsByAccounts and BulkCreateSubscriptions to safely Nak/retry;
alternatively, if you prefer not to change delivery semantics, increment a
visible metric on publish failure so drift is detectable.
inbox-worker/handler_test.go (1)

749-752: Optional: tighten canonical subject assertions.

assert.Contains(..., "chat.room.canonical") will also pass if the handler published the wrong canonical variant (e.g., member_added instead of member_removed). Consider comparing against the exact builder output so a bug in pkg/subject.RoomCanonicalMemberAdded/Removed or a swap in handler.go is caught:

Proposed tighter assertion
-	require.Len(t, records, 1)
-	assert.Contains(t, records[0].subject, "chat.room.canonical")
+	require.Len(t, records, 1)
+	assert.Equal(t, subject.RoomCanonicalMemberRemoved("site-test"), records[0].subject)

Also applies to: 803-806

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inbox-worker/handler_test.go` around lines 749 - 752, The test currently uses
assert.Contains on pub.getRecords()[0].subject which can miss incorrect
canonical variants; update the assertions to compare the subject exactly to the
expected constructed subject (use the subject builder functions like
pkg/subject.RoomCanonicalMemberAdded and pkg/subject.RoomCanonicalMemberRemoved
or the exact builder output used by the handler) instead of using Contains so
the test fails if the wrong variant is published (apply the same change to the
other occurrence around lines 803-806).
room-worker/handler.go (1)

310-312: Inconsistent siteID source for canonical member-event subjects.

The add paths key the canonical subject on room.SiteID (line 127 invite, line 673 add), while both remove paths key it on h.siteID:

// add paths
subject.RoomCanonicalMemberAdded(room.SiteID)

// remove paths (here and processRemoveOrg)
subject.RoomCanonicalMemberRemoved(h.siteID)

In room-worker these are equivalent today because room-worker only handles rooms whose SiteID == h.siteID, but the asymmetry is surprising and will silently break if that invariant ever loosens (e.g., a multi-site worker). Prefer a single convention — room.SiteID is the more semantically correct key since the canonical subject is "where the room lives":

Proposed alignment (remove paths)

In processRemoveIndividual you'd need to load the room first (or thread it through), so the simpler fix is to keep h.siteID everywhere and switch the add paths to match. Either direction is fine — the important thing is consistency.

-	if err := h.publish(ctx, subject.RoomCanonicalMemberAdded(room.SiteID), memberAddData); err != nil {
+	if err := h.publish(ctx, subject.RoomCanonicalMemberAdded(h.siteID), memberAddData); err != nil {
 		slog.Error("room canonical member_added publish failed", "error", err, "roomID", req.RoomID)
 	}

Also applies to: 433-435

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@room-worker/handler.go` around lines 310 - 312, The canonical member-event
subjects use inconsistent site IDs: some places call
subject.RoomCanonicalMemberAdded(room.SiteID) while others call
subject.RoomCanonicalMemberRemoved(h.siteID); normalize to use room.SiteID
everywhere (i.e., replace uses of h.siteID for canonical room-member subjects
with the room's SiteID) so the subject key is always "where the room lives";
update the remove paths (e.g., in processRemoveIndividual/processRemoveOrg and
the publish call using subject.RoomCanonicalMemberRemoved) to obtain/load the
Room and use room.SiteID when constructing the subject, ensuring all
subject.RoomCanonicalMemberAdded/Removed calls are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@search-sync-worker/handler.go`:
- Around line 119-120: The current handler treats any 409 result.Status as
success unconditionally, which hides ES optimistic-concurrency conflicts for
ActionUpdate; modify the logic to only ACK 409 for actions that rely on external
versioning (e.g., ActionIndex and ActionDelete when BulkAction.Version is set)
and do NOT treat 409 as success for ActionUpdate. Locate the code checking
result.Status (and the surrounding bulk-action handling that references
BulkAction.Version and the action type like
ActionUpdate/ActionIndex/ActionDelete) and change the branch so 409 returns true
only for index/delete versioned actions, while for ActionUpdate it returns false
(or schedules a retry/NAK) so conflicts are surfaced and retries occur.

In `@search-sync-worker/room_member.go`:
- Around line 57-75: The parser currently only checks len(evt.Accounts)>0 but
must also validate evt.RoomID and each account string is non-empty; update the
member_added (model.MemberAddEvent) and member_removed/member_left
(model.MemberRemoveEvent) branches where evt is used and memberEvent is returned
to: validate evt.RoomID != "" and iterate evt.Accounts to ensure no account ==
""; if any identifier is missing return a descriptive parse error (e.g., "parse
member_added event: missing room_id" or "parse member_removed event: empty
account id") and do not log or include account values in the error message; make
the same checks for both Add and Remove flows before returning
&memberEvent{...}.

In `@search-sync-worker/spotlight.go`:
- Around line 114-118: The Elasticsearch index template defines a
"custom_tokenizer" with type "whitespace" but incorrectly includes a
"token_chars" setting; remove the "token_chars" entry from the
"custom_tokenizer" map in spotlight.go (i.e., inside the tokenizer config where
"custom_tokenizer" is defined) so only supported options (like "type" and
optionally "max_token_length") remain, ensuring the template is valid for the
whitespace tokenizer.

---

Nitpick comments:
In `@inbox-worker/handler_test.go`:
- Around line 749-752: The test currently uses assert.Contains on
pub.getRecords()[0].subject which can miss incorrect canonical variants; update
the assertions to compare the subject exactly to the expected constructed
subject (use the subject builder functions like
pkg/subject.RoomCanonicalMemberAdded and pkg/subject.RoomCanonicalMemberRemoved
or the exact builder output used by the handler) instead of using Contains so
the test fails if the wrong variant is published (apply the same change to the
other occurrence around lines 803-806).

In `@inbox-worker/handler.go`:
- Around line 113-145: The re-publish failures in handleMemberRemoved (and the
analogous handleMemberAdded) currently only log errors and return nil, which
causes the message to be Acked and drops the canonical update; change the
behavior so that if h.pub.Publish(ctx,
subject.RoomCanonicalMemberRemoved(h.siteID), evt.Payload) (and the
RoomCanonicalMemberAdded call) returns an error you propagate that error (return
it) instead of swallowing it, leveraging the idempotence of
DeleteSubscriptionsByAccounts and BulkCreateSubscriptions to safely Nak/retry;
alternatively, if you prefer not to change delivery semantics, increment a
visible metric on publish failure so drift is detectable.

In `@room-worker/handler.go`:
- Around line 310-312: The canonical member-event subjects use inconsistent site
IDs: some places call subject.RoomCanonicalMemberAdded(room.SiteID) while others
call subject.RoomCanonicalMemberRemoved(h.siteID); normalize to use room.SiteID
everywhere (i.e., replace uses of h.siteID for canonical room-member subjects
with the room's SiteID) so the subject key is always "where the room lives";
update the remove paths (e.g., in processRemoveIndividual/processRemoveOrg and
the publish call using subject.RoomCanonicalMemberRemoved) to obtain/load the
Room and use room.SiteID when constructing the subject, ensuring all
subject.RoomCanonicalMemberAdded/Removed calls are consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a1cc1feb-84d4-434b-bc55-f57d97912080

📥 Commits

Reviewing files that changed from the base of the PR and between 61a3cf9 and 7cd6729.

📒 Files selected for processing (22)
  • inbox-worker/handler.go
  • inbox-worker/handler_test.go
  • inbox-worker/integration_test.go
  • inbox-worker/main.go
  • pkg/model/event.go
  • pkg/natsutil/ack.go
  • pkg/natsutil/ack_test.go
  • pkg/searchengine/adapter.go
  • pkg/searchengine/searchengine.go
  • pkg/subject/subject.go
  • room-worker/handler.go
  • room-worker/handler_test.go
  • search-sync-worker/collection.go
  • search-sync-worker/handler.go
  • search-sync-worker/handler_test.go
  • search-sync-worker/main.go
  • search-sync-worker/messages.go
  • search-sync-worker/messages_test.go
  • search-sync-worker/room_member.go
  • search-sync-worker/spotlight.go
  • search-sync-worker/template.go
  • search-sync-worker/user_room.go
✅ Files skipped from review due to trivial changes (4)
  • inbox-worker/integration_test.go
  • room-worker/handler_test.go
  • pkg/natsutil/ack_test.go
  • search-sync-worker/handler_test.go
🚧 Files skipped from review as they are similar to previous changes (9)
  • pkg/searchengine/adapter.go
  • search-sync-worker/messages_test.go
  • pkg/subject/subject.go
  • search-sync-worker/collection.go
  • pkg/natsutil/ack.go
  • pkg/model/event.go
  • search-sync-worker/messages.go
  • search-sync-worker/main.go
  • search-sync-worker/user_room.go

Comment thread search-sync-worker/handler.go Outdated
Comment on lines +119 to +120
if result.Status == 409 {
return true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Elasticsearch bulk update retry_on_conflict version_conflict_engine_exception official docs

💡 Result:

In Elasticsearch bulk API, use retry_on_conflict in the update action metadata to handle version_conflict_engine_exception during concurrent updates. Example from official docs: { "update" : { "_index" : "index1", "_id" : "1", "retry_on_conflict" : 3 } } { "doc" : { "field" : "value" } } This retries the update up to 3 times on version conflicts. The version_conflict_engine_exception occurs due to optimistic concurrency control when concurrent modifications change the document version/seq_no. Official bulk docs confirm retry_on_conflict is specific to update actions in bulk requests, unlike single update API where it's a query param. For upsert in bulk: { "update" : { "_id" : "1", "retry_on_conflict" : 3 } } { "doc" : { "field" : "value" }, "upsert" : { "counter" : 1 } } Bulk failures are per-item; one conflict doesn't fail the entire request.

Citations:


🏁 Script executed:

# First, let's see the full context of lines 119-120 in handler.go
cat -n search-sync-worker/handler.go | sed -n '100,140p'

Repository: hmchangw/chat

Length of output: 1197


🏁 Script executed:

# Check the searchengine package to see how ActionUpdate is built
find . -name "searchengine*" -type f | head -20

Repository: hmchangw/chat

Length of output: 91


🏁 Script executed:

# Search for version-related fields in the searchengine/handler code
rg "version_type|retry_on_conflict|Version" search-sync-worker/ pkg/searchengine/ -A 2 -B 2

Repository: hmchangw/chat

Length of output: 4976


🏁 Script executed:

# Check user_room.go to see what actions it builds
cat -n search-sync-worker/user_room.go | head -100

Repository: hmchangw/chat

Length of output: 3612


🏁 Script executed:

# Search for where ActionUpdate is used and what body is sent
rg "ActionUpdate" search-sync-worker/ -B 3 -A 3

Repository: hmchangw/chat

Length of output: 1303


Don't ACK 409 conflicts on ActionUpdate as success.

The BulkAction.Version field is explicitly ignored for ActionUpdate operations. A 409 response indicates an internal Elasticsearch optimistic concurrency conflict (seq_no mismatch), not a stale external version event. Treating it as success silently drops room-array updates when concurrent modifications occur, breaking user-room sync consistency.

Keep 409-as-success for versioned index and delete actions, but either NAK/retry updates or filter them from the success path:

Proposed fix
 	if result.Status == 409 {
-		return true
+		return action != searchengine.ActionUpdate
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/handler.go` around lines 119 - 120, The current handler
treats any 409 result.Status as success unconditionally, which hides ES
optimistic-concurrency conflicts for ActionUpdate; modify the logic to only ACK
409 for actions that rely on external versioning (e.g., ActionIndex and
ActionDelete when BulkAction.Version is set) and do NOT treat 409 as success for
ActionUpdate. Locate the code checking result.Status (and the surrounding
bulk-action handling that references BulkAction.Version and the action type like
ActionUpdate/ActionIndex/ActionDelete) and change the branch so 409 returns true
only for index/delete versioned actions, while for ActionUpdate it returns false
(or schedules a retry/NAK) so conflicts are surfaced and retries occur.

Comment on lines +57 to +75
if evt.Timestamp <= 0 {
return nil, fmt.Errorf("parse member_added event: missing timestamp")
}
if len(evt.Accounts) == 0 {
return nil, fmt.Errorf("parse member_added event: empty accounts")
}
return &memberEvent{Add: &evt}, nil

case "member_removed", "member_left":
var evt model.MemberRemoveEvent
if err := json.Unmarshal(data, &evt); err != nil {
return nil, fmt.Errorf("unmarshal member_removed event: %w", err)
}
if evt.Timestamp <= 0 {
return nil, fmt.Errorf("parse member_removed event: missing timestamp")
}
if len(evt.Accounts) == 0 {
return nil, fmt.Errorf("parse member_removed event: empty accounts")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate RoomID and each account before building index keys.

len(evt.Accounts) > 0 still allows [""], and missing RoomID currently flows into document IDs like account_ or empty room entries in downstream indexes. Since this parser is the NATS payload boundary for spotlight and user-room sync, reject malformed identifiers here without logging account values.

🛡️ Proposed validation
 		if evt.Timestamp <= 0 {
 			return nil, fmt.Errorf("parse member_added event: missing timestamp")
 		}
+		if evt.RoomID == "" {
+			return nil, fmt.Errorf("parse member_added event: missing roomID")
+		}
 		if len(evt.Accounts) == 0 {
 			return nil, fmt.Errorf("parse member_added event: empty accounts")
 		}
+		for i, account := range evt.Accounts {
+			if account == "" {
+				return nil, fmt.Errorf("parse member_added event: empty account at index %d", i)
+			}
+		}
 		return &memberEvent{Add: &evt}, nil
 
 	case "member_removed", "member_left":
@@
 		if evt.Timestamp <= 0 {
 			return nil, fmt.Errorf("parse member_removed event: missing timestamp")
 		}
+		if evt.RoomID == "" {
+			return nil, fmt.Errorf("parse member_removed event: missing roomID")
+		}
 		if len(evt.Accounts) == 0 {
 			return nil, fmt.Errorf("parse member_removed event: empty accounts")
 		}
+		for i, account := range evt.Accounts {
+			if account == "" {
+				return nil, fmt.Errorf("parse member_removed event: empty account at index %d", i)
+			}
+		}
 		return &memberEvent{Remove: &evt}, nil

Based on learnings, validation should be reserved for system boundaries such as deserialization or other untrusted entry points.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/room_member.go` around lines 57 - 75, The parser currently
only checks len(evt.Accounts)>0 but must also validate evt.RoomID and each
account string is non-empty; update the member_added (model.MemberAddEvent) and
member_removed/member_left (model.MemberRemoveEvent) branches where evt is used
and memberEvent is returned to: validate evt.RoomID != "" and iterate
evt.Accounts to ensure no account == ""; if any identifier is missing return a
descriptive parse error (e.g., "parse member_added event: missing room_id" or
"parse member_removed event: empty account id") and do not log or include
account values in the error message; make the same checks for both Add and
Remove flows before returning &memberEvent{...}.

Comment on lines +114 to +118
"tokenizer": map[string]any{
"custom_tokenizer": map[string]any{
"type": "whitespace",
"token_chars": []string{"letter", "digit", "punctuation", "symbol"},
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Elasticsearch whitespace tokenizer token_chars configuration official docs

💡 Result:

The Elasticsearch whitespace tokenizer does not have a token_chars configuration parameter. Official documentation consistently states that it only accepts one parameter: max_token_length (default 255), which splits tokens exceeding that length. The token_chars parameter does not exist for the whitespace tokenizer. It appears to be confused with the edge_ngram tokenizer's token_chars (which specifies character classes like letter, digit to include in n-grams) or the char_group tokenizer's tokenize_on_chars (which specifies characters or groups like whitespace, punctuation to split on). Official sources: Whitespace tokenizer docs confirm only max_token_length [1,2,3,6,20,21,22]. Char_group uses tokenize_on_chars [11,12,13,14]. No mention of token_chars for whitespace across versions up to current (8.x+).

Citations:


Remove token_chars from the whitespace tokenizer config.

token_chars is not a valid parameter for Elasticsearch's whitespace tokenizer. The whitespace tokenizer only supports max_token_length (default 255). The token_chars parameter belongs to tokenizers such as edge_ngram and is not applicable here. This invalid configuration will cause template installation to fail at startup.

🔧 Proposed fix
 					"tokenizer": map[string]any{
 						"custom_tokenizer": map[string]any{
-							"type":        "whitespace",
-							"token_chars": []string{"letter", "digit", "punctuation", "symbol"},
+							"type": "whitespace",
 						},
 					},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@search-sync-worker/spotlight.go` around lines 114 - 118, The Elasticsearch
index template defines a "custom_tokenizer" with type "whitespace" but
incorrectly includes a "token_chars" setting; remove the "token_chars" entry
from the "custom_tokenizer" map in spotlight.go (i.e., inside the tokenizer
config where "custom_tokenizer" is defined) so only supported options (like
"type" and optionally "max_token_length") remain, ensuring the template is valid
for the whitespace tokenizer.

Implement end-to-end room-member event pipeline for search indexing:
room-worker enriches + publishes to ROOMS stream, inbox-worker
re-publishes cross-site events to local ROOMS, and search-sync-worker
consumes from ROOMS to maintain spotlight (room typeahead) and
user-room (access control) Elasticsearch indexes.

Event enrichment (pkg/model, room-worker):
- MemberAddEvent gains RoomName + RoomType fields (already loaded in
  scope at processAddMembers/processInvite time — zero extra queries)
- room-worker publishes enriched events to RoomCanonical subjects
  (chat.room.canonical.{site}.member_added/removed) which land in the
  existing ROOMS_{siteID} stream. Published alongside existing
  chat.room.{roomID}.event.member for backward compat with other
  consumers.
- processAddMembers, processRemoveIndividual, processRemoveOrg, and
  processInvite all publish to ROOMS stream.

Cross-site relay (inbox-worker):
- After handling member_added (BulkCreateSubscriptions) and
  member_removed (DeleteSubscriptionsByAccounts), inbox-worker
  re-publishes the event to the local ROOMS stream so search-sync-worker
  on the remote site picks it up. Handler gains a siteID field.

search-sync-worker:
- Collection interface: BuildAction returns []BulkAction (fan-out),
  StreamConfig returns jetstream.StreamConfig, new FilterSubjects.
- Handler: pendingMsg tracks per-message action ranges, ActionCount()
  drives flush decisions, isBulkItemSuccess with ErrorType-aware 404
  handling, natsutil.Ack/Nak helpers.
- roomMemberCollection base: StreamConfig from stream.Rooms, filter
  subjects from subject.RoomCanonicalMemberEventSubjects.
- parseMemberEvent: tagged-union parser for MemberAddEvent /
  MemberRemoveEvent (supports member_added, member_removed, member_left).
- spotlightCollection: doc key = account_roomID (composite), indexes
  userAccount/roomId/roomName/roomType/siteId/joinedAt. External
  versioning via evt.Timestamp. Restricted rooms
  (HistorySharedSince > 0) skip entire event.
- userRoomCollection: per-user rooms array with LWW timestamp guard
  in painless scripts. roomTimestamps flattened map prevents stale
  out-of-order events from corrupting state. Multi-pod safe via ES
  primary-shard atomicity + the guard. Timestamp source is
  evt.Timestamp (not JoinedAt).
- main.go: multi-collection loop with per-collection stream/consumer
  wiring. FetchBatchSize/BulkBatchSize/BulkFlushInterval config split.
  bootstrapConfig for dev-only stream creation. Fan-out-safe
  runConsumer with mid-batch flush.
- esPropertiesFromStruct[T] generic for template mapping reflection.

pkg/searchengine:
- ActionUpdate type + bulk adapter (no external versioning on _update).
- BulkResult.ErrorType for distinguishing document_missing_exception
  from index_not_found_exception on 404.

pkg/natsutil:
- Ack/Nak helpers with Acker/Naker minimal interfaces.

https://claude.ai/code/session_01XTmSpmv5dT6UXX7NpRdYqN
Copy link
Copy Markdown
Collaborator Author

Closing in favor of #109.

This PR's most recent commits drifted to a ROOMS-stream architecture that duplicates the existing OUTBOX/INBOX federation pipeline (same Sources + SubjectTransforms dance, different stream name). #109 returns to the original INBOX-based design from the pre-force-push state (3026f46) and adapts it to main's newer bulk member event format (Accounts []string + event-level HistorySharedSince).

Keeping this PR's discussion as historical context. See docs/superpowers/plans/2026-04-20-search-sync-inbox-recovery.md in #109 for the full architectural rationale.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants