fix(groups): preserve authoritative names and stable gid aliases#35
Conversation
mackid1993
left a comment
There was a problem hiding this comment.
Risk Analysis
1. participantSetsMatch signature change is a breaking semantic tightening (medium risk)
The old participantSetsMatch(a, b) allowed any single-member difference (diff <= 1). The new participantSetsMatch(a, b, selfHandle) only allows diff == 1 when the differing member is normalizedSelf. This is the correct fix for the reported bug (non-self differences wrongly passing), but it is a behavioral change that flows through 6+ call sites simultaneously:
resolveExistingGroupPortalID(2 paths)resolveExistingGroupByGidsteps 2, 3, and 4findPortalIDsByParticipants(cloud_chat DB)guidCacheMatchIsStale(new helper, wrapsparticipantSetsMatch)
If any call site passes an empty selfHandle, participantSetsMatch now returns false for ANY diff >= 1 (the normalizedSelf != "" guard). This is safe because the only empty-selfHandle path is guidCacheMatchIsStale which returns false (not stale) when portalIDStr has no comma — and participantSetsMatch is never reached with an empty handle. However, a future caller that passes "" will silently get strict-equality semantics, which could cause hard-to-diagnose portal creation failures. Consider adding a brief doc comment on participantSetsMatch noting this behavior.
2. resolveExistingGroupByGid stale-guid self-heal acquires write lock inside read-lock iteration (medium risk)
In step 1 of resolveExistingGroupByGid, the code iterates imGroupGuids under RLock, then on a stale match: releases RLock, acquires write Lock to delete the entry, releases write Lock, does DB work, then re-acquires RLock and continues. This pattern is already used elsewhere in the file (pre-existing), so it is consistent. However, the map mutation + re-iteration combination means the for loop may visit already-deleted entries or skip entries due to Go's map iteration semantics after mutation. In practice this is benign (worst case: a stale entry is visited twice or skipped, caught on the next message), but worth noting.
3. handleReadReceipt / handleTyping staleness check calls guidCacheMatchIsStale while holding imGroupGuidsMu.RLock (low risk)
The diff shows guidCacheMatchIsStale is called inside the for portalIDStr, guid := range c.imGroupGuids loop. guidCacheMatchIsStale itself does NOT acquire imGroupGuidsMu (it only reads the portal ID string and calls normalizeIdentifierForPortalID + participantSetsMatch), so there is no deadlock. This is safe.
4. makePortalKey gid-alias staleness eviction (low risk)
The new code in makePortalKey canonicalizes participants before the staleness check. The guard len(participants) > 0 && len(canonical) > 0 prevents eviction when no participant info is available (e.g., some APNs messages). The compare-before-delete pattern (if c.gidAliases[gidID] == aliasedID) is correct for preventing races. Good defensive coding.
Complexity Review
1. buildCanonicalParticipantList extraction — good simplification
Extracting the repeated normalize/dedup/sort logic into a shared helper reduces four copy-pasted blocks to one. Clean and correct.
2. resolveGroupName returning (string, bool) — appropriate
The authoritative flag is well-motivated: it prevents refreshGroupPortalNamesFromContacts from clobbering user-set names with contact-derived fallbacks. The change is minimal and targeted.
3. resolveExistingGroupByGid step 3 now prefers exact over fuzzy matches — mild added complexity, justified
The old code returned the first match (arbitrary map iteration order). The new code collects exactMatch/fuzzyMatch and prefers exact. This is a behavioral improvement that avoids non-deterministic portal selection. The two extra variables and break/continue logic are straightforward.
4. guidCacheMatchIsStale helper — clean abstraction
Used in 4 call sites (handleReadReceipt, handleTyping, resolveExistingGroupByGid, makePortalKey/makeDeletePortalKey). Logic is simple and well-documented.
5. Step 2 in resolveExistingGroupByGid now excludes comma-based portals — intentional narrowing
The comment explains that step 3 (groupPortalIndex) handles comma-based portals more reliably. This is a meaningful semantic change but the comment makes the rationale clear. No concern.
Regression Potential
1. False-negative portal matching (medium regression potential)
The tighter participantSetsMatch could cause a portal not to be found when it previously would have been, if the single differing member is NOT self. This is the intended fix (two different groups with one non-self member difference should NOT match), but if there are edge cases where Apple reports self with an unexpected identifier that doesn't match selfHandle after normalization, a valid alias could be rejected. The normalizeIdentifierForPortalID(selfHandle) call in participantSetsMatch handles tel:/mailto: normalization, but watch for:
- Edge case: user has multiple Apple IDs or handles that aren't in
isMyHandle—buildCanonicalParticipantListfilters those out, butparticipantSetsMatch's self check only knows the singleselfHandle.
2. handleReadReceipt / handleTyping case-insensitive guid comparison + staleness check
The diff changes guid == *msg.SenderGuid to strings.EqualFold(guid, *msg.SenderGuid). This is a good hardening (guids should be case-insensitive UUIDs), but it widens the match set, which is then narrowed by the new staleness check. Net effect is more correct, but if a message arrives with no participants (len(rawParticipants) == 0), guidCacheMatchIsStale returns false (not stale), so the match proceeds as before. No regression.
3. refreshGroupPortalNamesFromContacts now skips non-authoritative renames for portals with existing names
This is the core name-clobbering fix. Regression scenario: a portal has a CloudKit/cached name that was actually stale (e.g., from a previous incorrect sync), and a contact-based name would have been more correct. Now the stale authoritative name persists. This seems like the right trade-off since real renames come through APNs, but monitor for stale name reports.
4. handleParticipantChange guid cache guard
The new guard prevents caching sender_guid for portals where the gid doesn't match. This could cause a cache miss later in handleReadReceipt/handleTyping for messages that would previously have had a cached guid. In practice, the guard is narrow (only blocks non-comma, non-matching-gid portals), so impact should be minimal.
Suggested Changes
-
Minor:
participantSetsMatchtracks only the LAST diffMember when diff > 1. When both A-not-in-B and B-not-in-A contribute a diff member,diffMembergets overwritten. This is fine becausediff > 1returns false regardless, but the variable namediffMember(singular) is slightly misleading. No code change needed, but a comment like// diffMember only meaningful when diff == 1would clarify. -
Consider: add a unit test for
participantSetsMatchwith empty selfHandle. The empty-selfHandle-means-strict-equality semantics is non-obvious and could surprise future callers. -
Nit:
makeDeletePortalKeystale-alias eviction (lines around thehandleDeleteChatpath) only evicts the alias but doesn't do the DB metadata self-heal thatresolveExistingGroupByGiddoes. This is probably fine for delete operations (the portal is being deleted anyway), but worth documenting the asymmetry.
Overall Assessment
This is a well-targeted fix for two real problems (name clobbering and gid alias drift). The buildCanonicalParticipantList extraction and guidCacheMatchIsStale helper reduce duplication. The tighter participantSetsMatch semantics are correct but represent a meaningful behavioral change that should be monitored after deploy. The lock handling follows existing patterns in the codebase. Good PR.
Triage Summary
This PR (#35): Good refactor that reduces complexity. Monitor post-deploy for: tighter |
mackid1993
left a comment
There was a problem hiding this comment.
Second Risk Review
The first review by mackid1993 was thorough. I focused on areas that may have been under-examined.
1. Missing test coverage for participantSetsMatch semantic change (notable gap)
There are no unit tests for participantSetsMatch anywhere in the repo (pkg/connector/*_test.go has no coverage for it). The signature and behavior change from diff <= 1 (any member) to diff == 1 && diffMember == normalizedSelf is the highest-impact semantic change in this PR — it affects 5+ call sites across portal resolution, guid caching, read receipts, and typing indicators. The empty-selfHandle strict-equality fallback and the diffMember overwrite-on-multi-diff behavior (first review, item 1) are non-obvious edge cases that really should have test coverage before merge.
2. resolveExistingGroupByGid step 1: stale-heal does synchronous DB write on the message-handling hot path (performance concern)
The new stale-guid self-heal in step 1 does GetExistingPortalByKey + metadata mutation + stalePortal.Save(ctx) synchronously during makePortalKey resolution, which is on the critical path for every incoming group message routed through resolveExistingGroupByGid. If the DB is slow or the portal table is large, this adds latency to message delivery. The comment says "MUST be synchronous" because portalToConversation would re-poison the cache, which is valid, but this means a single stale entry causes a DB round-trip on every message until it's cleaned up. In practice this only fires once per stale entry (it deletes+clears), so the amortized cost is fine, but worth noting.
3. guidCacheMatchIsStale passes raw c.handle to participantSetsMatch — normalization happens inside, but the portal ID parts are pre-normalized (subtle correctness note)
In guidCacheMatchIsStale, portalParts comes from splitting a comma-based portal ID (already in canonical/normalized form). incomingParts are normalized from raw participants. But c.handle is passed directly to participantSetsMatch, which calls normalizeIdentifierForPortalID(selfHandle) internally. This is correct — the normalization is idempotent. However, if c.handle is ever set to an already-normalized value (e.g., tel:+1234567890), the double normalization still works. No bug here, just confirming.
4. handleParticipantChange guid cache guard could silently drop valid cache entries (low risk)
The new guard:
isOwnGidPortal := portalIDStr == "gid:"+strings.ToLower(*msg.SenderGuid)
if strings.Contains(portalIDStr, ",") || isOwnGidPortal {This means if finalPortalKey resolved to a different gid: portal (e.g., via alias resolution), the sender_guid is NOT cached for it. This prevents cache poisoning (good), but it also means that after an alias resolves a gid portal to a different gid portal, the guid cache won't be populated, and future handleReadReceipt/handleTyping cache lookups for that guid will miss. They'll fall through to other resolution paths, so correctness is maintained, but with an extra DB lookup each time.
5. resolveExistingGroupByGid step 2 narrowing — correct but asymmetric with resolveExistingGroupPortalID
Step 2 now only checks gid: portals in imGroupParticipants, deferring comma-based portals to step 3 (groupPortalIndex). However, resolveExistingGroupPortalID (the non-gid path) still uses groupPortalIndex with the old diff-tolerance logic augmented by the new participantSetsMatch guard. The two resolution paths now have subtly different matching behavior for the same participant set. This is intentional per the comments, but makes the resolution logic harder to reason about as a whole.
6. No issues found with lock ordering or deadlocks
All guidCacheMatchIsStale calls from within imGroupGuidsMu.RLock sections are safe — the helper acquires no locks. The RLock-release-WriteLock-release-RLock pattern in resolveExistingGroupByGid step 1 follows existing conventions in the codebase.
Summary
The main actionable item is adding unit tests for participantSetsMatch before merge — the semantic change is significant and the function is pure (no dependencies), making it trivial to test. The rest of the findings are low-risk observations. The overall approach is sound.
Proposed Fixes for PR #35Fix 1: Unit tests for
|
- Add table-driven unit tests for participantSetsMatch covering the regression boundary between old (diff<=1) and new (diff==1 && self) behavior. - Widen guid cache guard in handleParticipantChange from isOwnGidPortal to isGidPortal (strings.HasPrefix) to cover alias-resolved gid portals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Final Review — Opus Risk AnalysisRisk: Low-Medium | Recommendation: MERGE with monitoring ✅ This PR fixes two real user-facing bugs (custom group names being clobbered by contact-derived fallbacks, and gid alias drift to incorrect portals) with well-targeted changes. The Highest-risk change:
|
Review findings
|
participantSetsMatch previously compared the differing member against a single selfHandle string. Users with multiple Apple ID handles (phone + email) would fail the check when CloudKit stored a different handle variant than c.handle, causing duplicate portal creation. Replace the selfHandle string parameter with an isSelf func(string) bool predicate backed by c.isMyHandle, which checks against all known user handles. Update all 5 call sites and findPortalIDsByParticipants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug fix pushed (commit 15fd99e)
|
…cheMatchIsStale inputs participantSetsMatch now tracks whether ALL differing members are self handles, not just the last one, fixing false negatives when set A has self_phone and set B has self_email. guidCacheMatchIsStale now canonicalizes participants internally via buildCanonicalParticipantList so all callsites (makeDeletePortalKey, handleReadReceipt, handleTyping) get consistent behavior without needing to pre-canonicalize. Also adds compare-before-delete for gidAliases in makeDeletePortalKey to prevent race conditions, consistent with makePortalKey. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refreshGroupPortalNamesFromContacts was blocking all non-authoritative renames when portal.Name was non-empty. This meant portals created before contacts loaded stayed stuck with raw phone number names. Authoritative names (user-set via iMessage or CloudKit) are already protected by resolveGroupName returning them with priority, so the equality check `newName == portal.Name` is sufficient. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…yloads Typing and read-receipt payloads may only include [sender, target] without the full group roster. The previous symmetric participantSetsMatch comparison rejected valid cached comma-portal matches when the partial payload was shorter than the cached member list. Switch to a subset check: only flag a cache entry as stale when incoming participants contain members NOT present in the portal's set. Missing portal members in the payload are expected and no longer trigger rejection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
guidCacheMatchIsStale returned false for all non-comma portal IDs, which meant gid->gid aliases cached by resolveExistingGroupByGid could never be detected as stale. Once such an alias went stale, messages would route to the wrong room indefinitely. Look up the target portal's participant list from imGroupParticipants when the portal ID has no comma, so gid: aliases get the same staleness validation as comma-based portals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The second return value was never consumed by any caller — simplify the signature to return only the name string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…thoritative renames resolveGroupName now returns (string, bool) where the bool indicates whether the name came from an authoritative source (in-memory cache or CloudKit display_name). refreshGroupPortalNamesFromContacts uses this to skip contact-derived fallback names when the portal already has a name, preventing user-set group names from being clobbered on restart. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Title
fix(groups): preserve authoritative names and stable gid aliases
Summary
This fixes two related group-portal problems: custom group names being clobbered on restart, and gid alias / participant matching drifting onto the wrong portal.
Changes
Notes