Skip to content

fix(groups): preserve authoritative names and stable gid aliases#35

Merged
mackid1993 merged 10 commits intolrhodin:masterfrom
lwittwer:pr_group_portal_name_stability
Mar 27, 2026
Merged

fix(groups): preserve authoritative names and stable gid aliases#35
mackid1993 merged 10 commits intolrhodin:masterfrom
lwittwer:pr_group_portal_name_stability

Conversation

@lwittwer
Copy link
Copy Markdown

Title

fix(groups): preserve authoritative names and stable gid aliases

Summary

This fixes two related group-portal problems: custom group names being clobbered on restart, and gid alias / participant matching drifting onto the wrong portal.

Changes

  • treat CloudKit / cached group names as authoritative and avoid participant-fallback renames for existing rooms
  • tighten participant-set matching so diff=1 only passes for self-only differences
  • harden gid alias eviction and stale-cache self-healing
  • fold the participant-list helper refactor into the final behavior change for a smaller PR surface

Notes

  • Scope is the shared group portal logic; no chat.db-only fixes are included here.

Copy link
Copy Markdown
Collaborator

@mackid1993 mackid1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk Analysis

1. participantSetsMatch signature change is a breaking semantic tightening (medium risk)

The old participantSetsMatch(a, b) allowed any single-member difference (diff <= 1). The new participantSetsMatch(a, b, selfHandle) only allows diff == 1 when the differing member is normalizedSelf. This is the correct fix for the reported bug (non-self differences wrongly passing), but it is a behavioral change that flows through 6+ call sites simultaneously:

  • resolveExistingGroupPortalID (2 paths)
  • resolveExistingGroupByGid steps 2, 3, and 4
  • findPortalIDsByParticipants (cloud_chat DB)
  • guidCacheMatchIsStale (new helper, wraps participantSetsMatch)

If any call site passes an empty selfHandle, participantSetsMatch now returns false for ANY diff >= 1 (the normalizedSelf != "" guard). This is safe because the only empty-selfHandle path is guidCacheMatchIsStale which returns false (not stale) when portalIDStr has no comma — and participantSetsMatch is never reached with an empty handle. However, a future caller that passes "" will silently get strict-equality semantics, which could cause hard-to-diagnose portal creation failures. Consider adding a brief doc comment on participantSetsMatch noting this behavior.

2. resolveExistingGroupByGid stale-guid self-heal acquires write lock inside read-lock iteration (medium risk)

In step 1 of resolveExistingGroupByGid, the code iterates imGroupGuids under RLock, then on a stale match: releases RLock, acquires write Lock to delete the entry, releases write Lock, does DB work, then re-acquires RLock and continues. This pattern is already used elsewhere in the file (pre-existing), so it is consistent. However, the map mutation + re-iteration combination means the for loop may visit already-deleted entries or skip entries due to Go's map iteration semantics after mutation. In practice this is benign (worst case: a stale entry is visited twice or skipped, caught on the next message), but worth noting.

3. handleReadReceipt / handleTyping staleness check calls guidCacheMatchIsStale while holding imGroupGuidsMu.RLock (low risk)

The diff shows guidCacheMatchIsStale is called inside the for portalIDStr, guid := range c.imGroupGuids loop. guidCacheMatchIsStale itself does NOT acquire imGroupGuidsMu (it only reads the portal ID string and calls normalizeIdentifierForPortalID + participantSetsMatch), so there is no deadlock. This is safe.

4. makePortalKey gid-alias staleness eviction (low risk)

The new code in makePortalKey canonicalizes participants before the staleness check. The guard len(participants) > 0 && len(canonical) > 0 prevents eviction when no participant info is available (e.g., some APNs messages). The compare-before-delete pattern (if c.gidAliases[gidID] == aliasedID) is correct for preventing races. Good defensive coding.

Complexity Review

1. buildCanonicalParticipantList extraction — good simplification

Extracting the repeated normalize/dedup/sort logic into a shared helper reduces four copy-pasted blocks to one. Clean and correct.

2. resolveGroupName returning (string, bool) — appropriate

The authoritative flag is well-motivated: it prevents refreshGroupPortalNamesFromContacts from clobbering user-set names with contact-derived fallbacks. The change is minimal and targeted.

3. resolveExistingGroupByGid step 3 now prefers exact over fuzzy matches — mild added complexity, justified

The old code returned the first match (arbitrary map iteration order). The new code collects exactMatch/fuzzyMatch and prefers exact. This is a behavioral improvement that avoids non-deterministic portal selection. The two extra variables and break/continue logic are straightforward.

4. guidCacheMatchIsStale helper — clean abstraction

Used in 4 call sites (handleReadReceipt, handleTyping, resolveExistingGroupByGid, makePortalKey/makeDeletePortalKey). Logic is simple and well-documented.

5. Step 2 in resolveExistingGroupByGid now excludes comma-based portals — intentional narrowing

The comment explains that step 3 (groupPortalIndex) handles comma-based portals more reliably. This is a meaningful semantic change but the comment makes the rationale clear. No concern.

Regression Potential

1. False-negative portal matching (medium regression potential)

The tighter participantSetsMatch could cause a portal not to be found when it previously would have been, if the single differing member is NOT self. This is the intended fix (two different groups with one non-self member difference should NOT match), but if there are edge cases where Apple reports self with an unexpected identifier that doesn't match selfHandle after normalization, a valid alias could be rejected. The normalizeIdentifierForPortalID(selfHandle) call in participantSetsMatch handles tel:/mailto: normalization, but watch for:

  • Edge case: user has multiple Apple IDs or handles that aren't in isMyHandlebuildCanonicalParticipantList filters those out, but participantSetsMatch's self check only knows the single selfHandle.

2. handleReadReceipt / handleTyping case-insensitive guid comparison + staleness check

The diff changes guid == *msg.SenderGuid to strings.EqualFold(guid, *msg.SenderGuid). This is a good hardening (guids should be case-insensitive UUIDs), but it widens the match set, which is then narrowed by the new staleness check. Net effect is more correct, but if a message arrives with no participants (len(rawParticipants) == 0), guidCacheMatchIsStale returns false (not stale), so the match proceeds as before. No regression.

3. refreshGroupPortalNamesFromContacts now skips non-authoritative renames for portals with existing names

This is the core name-clobbering fix. Regression scenario: a portal has a CloudKit/cached name that was actually stale (e.g., from a previous incorrect sync), and a contact-based name would have been more correct. Now the stale authoritative name persists. This seems like the right trade-off since real renames come through APNs, but monitor for stale name reports.

4. handleParticipantChange guid cache guard

The new guard prevents caching sender_guid for portals where the gid doesn't match. This could cause a cache miss later in handleReadReceipt/handleTyping for messages that would previously have had a cached guid. In practice, the guard is narrow (only blocks non-comma, non-matching-gid portals), so impact should be minimal.

Suggested Changes

  1. Minor: participantSetsMatch tracks only the LAST diffMember when diff > 1. When both A-not-in-B and B-not-in-A contribute a diff member, diffMember gets overwritten. This is fine because diff > 1 returns false regardless, but the variable name diffMember (singular) is slightly misleading. No code change needed, but a comment like // diffMember only meaningful when diff == 1 would clarify.

  2. Consider: add a unit test for participantSetsMatch with empty selfHandle. The empty-selfHandle-means-strict-equality semantics is non-obvious and could surprise future callers.

  3. Nit: makeDeletePortalKey stale-alias eviction (lines around the handleDeleteChat path) only evicts the alias but doesn't do the DB metadata self-heal that resolveExistingGroupByGid does. This is probably fine for delete operations (the portal is being deleted anyway), but worth documenting the asymmetry.

Overall Assessment

This is a well-targeted fix for two real problems (name clobbering and gid alias drift). The buildCanonicalParticipantList extraction and guidCacheMatchIsStale helper reduce duplication. The tighter participantSetsMatch semantics are correct but represent a meaningful behavioral change that should be monitored after deploy. The lock handling follows existing patterns in the codebase. Good PR.

@mackid1993
Copy link
Copy Markdown
Collaborator

Triage Summary

PR Title Risk Recommendation
#32 vCard group prefix strip Low Approve — clean 4-line fix
#34 Canonicalize DM senders in backfill Low Approve — minor follow-up for tapback sender
#37 HEIC decode when conversion disabled Low Approve — small, well-scoped
#35 Preserve group names & stable gid aliases Medium Approve with monitoring
#36 Target replies/tapbacks to correct part Medium Approve with caveat — verify transition behavior
#33 Stabilize SMS/RCS portal handling Medium-High Request changes — has actionable bugs

This PR (#35): Good refactor that reduces complexity. Monitor post-deploy for: tighter participantSetsMatch semantics (could false-negative if Apple reports self with unexpected identifier), and stale authoritative names persisting if APNs rename events are missed.

Copy link
Copy Markdown
Collaborator

@mackid1993 mackid1993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second Risk Review

The first review by mackid1993 was thorough. I focused on areas that may have been under-examined.

1. Missing test coverage for participantSetsMatch semantic change (notable gap)

There are no unit tests for participantSetsMatch anywhere in the repo (pkg/connector/*_test.go has no coverage for it). The signature and behavior change from diff <= 1 (any member) to diff == 1 && diffMember == normalizedSelf is the highest-impact semantic change in this PR — it affects 5+ call sites across portal resolution, guid caching, read receipts, and typing indicators. The empty-selfHandle strict-equality fallback and the diffMember overwrite-on-multi-diff behavior (first review, item 1) are non-obvious edge cases that really should have test coverage before merge.

2. resolveExistingGroupByGid step 1: stale-heal does synchronous DB write on the message-handling hot path (performance concern)

The new stale-guid self-heal in step 1 does GetExistingPortalByKey + metadata mutation + stalePortal.Save(ctx) synchronously during makePortalKey resolution, which is on the critical path for every incoming group message routed through resolveExistingGroupByGid. If the DB is slow or the portal table is large, this adds latency to message delivery. The comment says "MUST be synchronous" because portalToConversation would re-poison the cache, which is valid, but this means a single stale entry causes a DB round-trip on every message until it's cleaned up. In practice this only fires once per stale entry (it deletes+clears), so the amortized cost is fine, but worth noting.

3. guidCacheMatchIsStale passes raw c.handle to participantSetsMatch — normalization happens inside, but the portal ID parts are pre-normalized (subtle correctness note)

In guidCacheMatchIsStale, portalParts comes from splitting a comma-based portal ID (already in canonical/normalized form). incomingParts are normalized from raw participants. But c.handle is passed directly to participantSetsMatch, which calls normalizeIdentifierForPortalID(selfHandle) internally. This is correct — the normalization is idempotent. However, if c.handle is ever set to an already-normalized value (e.g., tel:+1234567890), the double normalization still works. No bug here, just confirming.

4. handleParticipantChange guid cache guard could silently drop valid cache entries (low risk)

The new guard:

isOwnGidPortal := portalIDStr == "gid:"+strings.ToLower(*msg.SenderGuid)
if strings.Contains(portalIDStr, ",") || isOwnGidPortal {

This means if finalPortalKey resolved to a different gid: portal (e.g., via alias resolution), the sender_guid is NOT cached for it. This prevents cache poisoning (good), but it also means that after an alias resolves a gid portal to a different gid portal, the guid cache won't be populated, and future handleReadReceipt/handleTyping cache lookups for that guid will miss. They'll fall through to other resolution paths, so correctness is maintained, but with an extra DB lookup each time.

5. resolveExistingGroupByGid step 2 narrowing — correct but asymmetric with resolveExistingGroupPortalID

Step 2 now only checks gid: portals in imGroupParticipants, deferring comma-based portals to step 3 (groupPortalIndex). However, resolveExistingGroupPortalID (the non-gid path) still uses groupPortalIndex with the old diff-tolerance logic augmented by the new participantSetsMatch guard. The two resolution paths now have subtly different matching behavior for the same participant set. This is intentional per the comments, but makes the resolution logic harder to reason about as a whole.

6. No issues found with lock ordering or deadlocks

All guidCacheMatchIsStale calls from within imGroupGuidsMu.RLock sections are safe — the helper acquires no locks. The RLock-release-WriteLock-release-RLock pattern in resolveExistingGroupByGid step 1 follows existing conventions in the codebase.

Summary

The main actionable item is adding unit tests for participantSetsMatch before merge — the semantic change is significant and the function is pure (no dependencies), making it trivial to test. The rest of the findings are low-risk observations. The overall approach is sound.

@mackid1993
Copy link
Copy Markdown
Collaborator

Proposed Fixes for PR #35

Fix 1: Unit tests for participantSetsMatch (must-have)

Create pkg/connector/cloud_backfill_store_test.go with table-driven tests covering the regression boundary between old (diff <= 1) and new (diff == 1 && member is self) behavior.

Fix 2: Async DB write in resolveExistingGroupByGid step 1 (nice-to-have)

The synchronous GetExistingPortalByKey + portal.Save sits on the hot message-delivery path. The in-memory cache is already cleared, so the DB write is only for durability and can be async.

Fix 3: Guid cache guard in handleParticipantChange misses alias-resolved portals (should-fix)

When finalPortalKey.ID is a gid: portal resolved via gidAliases, isOwnGidPortal is false. Widen from isOwnGidPortal to isGidPortal (strings.HasPrefix(portalIDStr, "gid:")) to cover alias-resolved portals.

See commit(s) for implementation.

- Add table-driven unit tests for participantSetsMatch covering the
  regression boundary between old (diff<=1) and new (diff==1 && self)
  behavior.
- Widen guid cache guard in handleParticipantChange from isOwnGidPortal
  to isGidPortal (strings.HasPrefix) to cover alias-resolved gid portals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mackid1993
Copy link
Copy Markdown
Collaborator

Final Review — Opus Risk Analysis

Risk: Low-Medium | Recommendation: MERGE with monitoring ✅

This PR fixes two real user-facing bugs (custom group names being clobbered by contact-derived fallbacks, and gid alias drift to incorrect portals) with well-targeted changes. The buildCanonicalParticipantList extraction is a net complexity reduction (removes 4 copy-pasted blocks).

Highest-risk change: participantSetsMatch semantic tightening

Old: diff <= 1 for any member. New: diff == 1 only if the differing member is normalized self. This is the correct fix — the old behavior caused false-positive matches between genuinely different groups that differed by one non-self member. The new unit tests cover the key boundary conditions (empty selfHandle, self-only diff, non-self diff, duplicates).

Edge case to watch: If Apple reports self with an identifier that doesn't match c.handle after normalization AND isn't in isMyHandle, the new code will false-negative. Low probability but would manifest as duplicate portals.

Other changes are low-risk:

  • resolveExistingGroupByGid stale-guid self-heal is synchronous but amortized (fires once per stale entry)
  • RLock/WriteLock/RLock pattern is pre-existing; worst case is idempotent re-processing
  • refreshGroupPortalNamesFromContacts skip logic correctly preserves authoritative names
  • strings.EqualFold for guid comparison is strictly more correct

Post-merge monitoring:

  1. Watch for duplicate portals (would indicate false-negative matching)
  2. Watch for stale group name reports (would indicate missed APNs rename events)
  3. Watch for "Skipping stale guid cache entry" log warnings to validate self-heal

No blocking issues found. The reviewer feedback (tests, widened guid cache guard) has been addressed.

@mackid1993
Copy link
Copy Markdown
Collaborator

Review findings

makeDeletePortalKey multi-participant fallback produces inconsistent portal IDs (Medium)

pkg/connector/client.go ~line 1937-1947

The "last resort" multi-participant path in makeDeletePortalKey manually constructs a portal ID:

members := make([]string, 0, len(msg.DeleteChatParticipants)+1)
members = append(members, c.handle)
for _, p := range msg.DeleteChatParticipants {
    members = append(members, addIdentifierPrefix(p))
}
sort.Strings(members)

This differs from buildCanonicalParticipantList in three ways:

  1. It does not deduplicate participants (if c.handle already appears in DeleteChatParticipants, it will be included twice)
  2. It uses addIdentifierPrefix instead of normalizeIdentifierForPortalID — these produce different results for the same input
  3. It does not filter self handles from the input list before appending c.handle

This means the portal ID produced here will often not match the portal ID that makePortalKey produces for the same group, causing the delete to target a nonexistent portal and silently fail. The fix would be to use buildCanonicalParticipantList here, consistent with how handleParticipantChange was refactored.

participantSetsMatch previously compared the differing member against a
single selfHandle string. Users with multiple Apple ID handles (phone +
email) would fail the check when CloudKit stored a different handle
variant than c.handle, causing duplicate portal creation.

Replace the selfHandle string parameter with an isSelf func(string) bool
predicate backed by c.isMyHandle, which checks against all known user
handles. Update all 5 call sites and findPortalIDsByParticipants.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mackid1993
Copy link
Copy Markdown
Collaborator

Bug fix pushed (commit 15fd99e)

participantSetsMatch fails for multi-handle users (Medium)

Bug: participantSetsMatch compared the differing member against a single selfHandle string (typically c.handle, the user's primary phone number). Users with multiple Apple ID handles (phone + email) would fail the diff-of-1 check when CloudKit stored the email variant. This caused findPortalIDsByParticipants and the participant-cache lookup in resolveExistingGroupByGid to miss valid matches, potentially creating duplicate portals.

Fix: Changed participantSetsMatch to accept an isSelf func(string) bool predicate instead of a bare selfHandle string. All 5 call sites in client.go and findPortalIDsByParticipants in cloud_backfill_store.go now pass c.isMyHandle, which checks against all known user handles. Updated tests to use the predicate and added a new test case for the multi-handle scenario.

David and others added 7 commits March 26, 2026 22:45
…cheMatchIsStale inputs

participantSetsMatch now tracks whether ALL differing members are self
handles, not just the last one, fixing false negatives when set A has
self_phone and set B has self_email.

guidCacheMatchIsStale now canonicalizes participants internally via
buildCanonicalParticipantList so all callsites (makeDeletePortalKey,
handleReadReceipt, handleTyping) get consistent behavior without
needing to pre-canonicalize.

Also adds compare-before-delete for gidAliases in makeDeletePortalKey
to prevent race conditions, consistent with makePortalKey.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refreshGroupPortalNamesFromContacts was blocking all non-authoritative
renames when portal.Name was non-empty. This meant portals created
before contacts loaded stayed stuck with raw phone number names.
Authoritative names (user-set via iMessage or CloudKit) are already
protected by resolveGroupName returning them with priority, so the
equality check `newName == portal.Name` is sufficient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…yloads

Typing and read-receipt payloads may only include [sender, target]
without the full group roster. The previous symmetric participantSetsMatch
comparison rejected valid cached comma-portal matches when the partial
payload was shorter than the cached member list.

Switch to a subset check: only flag a cache entry as stale when incoming
participants contain members NOT present in the portal's set. Missing
portal members in the payload are expected and no longer trigger rejection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
guidCacheMatchIsStale returned false for all non-comma portal IDs,
which meant gid->gid aliases cached by resolveExistingGroupByGid could
never be detected as stale. Once such an alias went stale, messages
would route to the wrong room indefinitely.

Look up the target portal's participant list from imGroupParticipants
when the portal ID has no comma, so gid: aliases get the same staleness
validation as comma-based portals.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The second return value was never consumed by any caller — simplify
the signature to return only the name string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…thoritative renames

resolveGroupName now returns (string, bool) where the bool indicates
whether the name came from an authoritative source (in-memory cache or
CloudKit display_name). refreshGroupPortalNamesFromContacts uses this
to skip contact-derived fallback names when the portal already has a
name, preventing user-set group names from being clobbered on restart.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mackid1993 mackid1993 merged commit f802693 into lrhodin:master Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants