Fix zero-state chat trap + auto-approve own-node sub-pairing#457
Merged
shanselman merged 4 commits intoMay 19, 2026
Merged
Conversation
…nonical sessionKey
Root cause: OpenClawChatDataProvider maintained a C# replica of gateway session
state keyed by 'threadId' that conflated UI selection token with server-assigned
sessionKey. On fresh install (zero sessions), CreateThreadAsync synthesized a
ChatThread with Id='main' (literal). But the gateway's hello-ok publishes BOTH
sessionDefaults.mainKey ('main' alias) AND sessionDefaults.mainSessionKey
('agent:main:main' canonical), and the parser was reading the alias. Optimistic
local entries landed under 'main' while gateway echoes used 'agent:main:main' —
two timelines, message orphaned, UI stuck on the welcome screen with no composer.
Fix (real, not a workaround — three-model agreement, dual-model code review):
Protocol layer (OpenClawGatewayClient):
- TryGetHandshakeMainSessionKey now prefers canonical mainSessionKey over alias.
- _mainSessionKey becomes string? (no 'main' default), with Volatile.Read/Write
for cross-thread publication.
- _hasHandshakeSnapshot tracks handshake completion; reset on disconnect.
- Extracted ResolveEffectiveSessionKey helper — throws InvalidOperationException
if no canonical key resolved. No more silent fallback to a stale literal.
- HandleChatEvent/HandleAgentEvent stop substituting 'main'/'unknown' for empty
sessionKey — pass through so the provider can surface the protocol gap.
Contract (IChatDataProvider):
- Deleted CreateThreadAsync. The gateway has no session.create RPC; pretending
it does was the original sin.
- Added ChatComposeTarget {SessionKey, IsReady} on ChatDataSnapshot — first-class
'where to send next' concept, distinct from 'what sessions exist'.
Provider (OpenClawChatDataProvider):
- Deleted CreateThreadAsync impl, ResolveMainOrFirstThreadLocked, and every
?? 'main' fallback in event handlers (5 sites). Empty SessionKey now drops
the event with a warning.
- BuildSnapshotLocked projects ComposeTarget from bridge state and injects a
synthetic ChatThread with Id=canonicalKey when the compose key has optimistic
entries but no SessionInfo yet — so SessionsUpdated later replaces it in
place with no re-keying or migration.
- ResolveDefaultThreadIdLocked uses SessionInfo.IsMain instead of string-equal
to literal 'main'.
UI (OpenClawChatRoot):
- Deleted StartFirstChat and the safety-net fallback (now structurally
impossible to fire).
- effectiveThread = selectedThread ?? composeOnlyThread; composer visible
whenever ComposeTarget.IsReady.
- Friendlier zero-state suggestions ('Say hi 👋', 'What can you do?',
'Give me a quick tour of OpenClaw').
- firstSendInFlight debounce clears only when the snapshot's compose-target
timeline has the optimistic user entry — not on every unrelated snapshot
(presence, models, channel health).
Tests: deleted 2 obsolete CreateThreadAsync tests; added 7 provider tests
covering fresh-install compose target, optimistic-by-canonical-key routing,
SessionsUpdated reconciliation, empty-sessionKey drop, canonical echo merge,
IsMain default resolution. Added 6-case OpenClawGatewayClientSessionKeyTests
exercising the ResolveEffectiveSessionKey throw directly.
Validation: build OK, Shared.Tests 1782/1810 (28 skipped), Tray.Tests 1096/1096.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nding state
Bug: on a fresh install where the device is already paired, the Windows node
sub-pairing (a separate gateway-side table, /home/openclaw/.openclaw/nodes/
paired.json) can sit empty indefinitely. The node connects, advertises its
caps/commands in the connect frame, gets into the gateway's pending list — but
the gateway returns caps:[], commands:[] in node.list responses until the
node-sub-pairing is approved. Result: tray menu shows the node Connected
with "Capabilities (0) · Commands (0)", agent capabilities silently inert.
Why the existing auto-approve didn't fire: GatewayConnectionManager's
OnNodePairingStatusChanged only triggers on the node-side
WindowsNodeClient.PairingStatusChanged event, which fires Pending only when
the node-side connection itself is pending. With the device already paired,
the node-side client sees PairingStatus.Paired and never raises Pending — the
only signal for the missing node-sub-pairing is the operator-side
NodePairListUpdated broadcast.
Fix (3 parts):
1. GatewayConnectionManager: subscribe NodePairListUpdated on every new
operator client. When a pending entry's NodeId matches our own
_nodeConnector.NodeDeviceId AND we have operator.admin/pairing scope,
call NodePairApproveAsync. Reuses _autoApproveInFlight CAS guard and
_lastAutoApprovedRequestId dedup. After approval, restart the node
connection so caps propagate.
2. InstancesPage: caution-yellow banner at top of the Nodes page when any
node or device pair is pending. Subtext is contextual ("This node is
connected but its capabilities won't activate…" when our own deviceId
is the pending one). Button navigates to ConnectionPage where the
existing per-row approve/reject UI lives.
3. ConnectionPage: Initialize() now eagerly calls RequestNodePairListAsync
+ RequestDevicePairListAsync, and pushes any already-loaded AppState
pending lists into the banner immediately, so the user sees pending
state on first navigation instead of waiting for the gateway's next
broadcast.
Tests: 4 new cases in NodePairAutoApproveTests cover own-node approve,
other-node ignored, missing scope, and dedup-on-rebroadcast. All 228
connection tests pass; Shared 1782; Tray 1096.
Live verified end-to-end: after launching the new build against a WSL
gateway where the node was stuck at caps:[], the operator-side
auto-approve fired and node.list now reports caps:[app, browser, camera,
canvas, device, location, screen, system] and 11 commands.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from dual-model adversarial review of fa421a3: H1 — Spin loop on approval failure: _lastAutoApprovedRequestId was set only inside if (approved). A failed approve (transient or policy reject) meant every subsequent NodePairListUpdated re-broadcast retried the same id forever. Now always recorded in finally, with a re-check against the current generation so a reconnect-during-await doesn't overwrite the new-generation null with a stale id. Same fix applied to the pre-existing node-side path. H2 — �reak after first attempt stranded second own-node pending: on exception/rejection the next own-pending in the same snapshot was never attempted (gateway might not re-broadcast if approve frame round-tripped and was rejected mid-flight). Now continue to scan all own-pending entries, then perform a single post-approval reconnect for the last-approved id (rather than per-entry, which would race with itself). H3 — CAS guard held across Task.Delay + StartNodeConnectionAsync (5–30s on WSL cold-start), starving unrelated approvals: moved the _autoApproveInFlight clear into a inally scoped JUST to the approve RPC. Post-approve reconnect runs outside the guard. Same on both paths. M1 — Stale _lastAutoApprovedRequestId write across generations: now re-checks _generation == gen before writing so a disconnect/reconnect during the approve await doesn't blow away the new generation's clean slate. M4 — Reflection-based test event raiser: added [InternalsVisibleTo("OpenClaw.Connection.Tests")] + internal RaiseNodePairListUpdatedForTests on OpenClawGatewayClient. Test helper FireNodePairListUpdated now delegates to that instead of poking the private event backing field via reflection (which silently breaks the moment events grow explicit add/remove blocks). M3 — Eager refresh swallowed exceptions silently: ConnectionPage and InstancesPage Task.Run blocks now log via Services.Logger.Warn instead of swallowing, matching the comment's stated intent. Validation: Connection 228 / Shared 1782 / Tray 1096, all pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only record a node pairing request as deduped after an actual approve attempt. This keeps the operator-side fallback able to approve the same request once the operator client is ready or gains approval scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This was referenced May 19, 2026
Contributor
|
@ranjeshj thank you - this is good work and I'm merging it. What landed well:
Validation I ran locally on the final head $(@{author=; headRefOid=3e713d97c147ea08d826fe162a005b21446147bf; mergeStateStatus=CLEAN; mergedAt=; state=OPEN; statusCheckRollup=System.Object[]; title=Fix zero-state chat trap + auto-approve own-node sub-pairing; url=https://github.com/openclaw/openclaw-windows-node/pull/457}.headRefOid.Substring(0,8)):
I opened follow-up issues for the non-blocking review items so they don't get lost:
None of those should block this PR. Thanks again for moving quickly and for covering the own-node safety boundaries. |
8 tasks
shanselman
added a commit
that referenced
this pull request
May 20, 2026
Roll back the PR #457 operator-side NodePairListUpdated auto-approve path because it can race the easy-button WSL setup node-pairing phase by restarting the node connection while setup is waiting on it. Keep the zero-state chat session-key fix and the read-only pending-pair UI surfacing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related setup-flow bugs hit while debugging a fresh install against a WSL gateway. Both fixed here, plus tests.
Bug 1 — Zero-state chat trap (commit fd9e2a9)
On a fresh install with zero gateway sessions, the user landed on a welcome screen with three suggestion buttons that did nothing and a hidden composer.
Root cause (validated by 3-model fresh root-cause analysis): OpenClawChatDataProvider maintained a C# replica of gateway session state keyed by
threadIdthat conflated UI selection token with server-assignedsessionKey.CreateThreadAsyncsynthesized a thread withId = "main"(literal), but the gateway's hello-ok publishes BOTHsessionDefaults.mainKey = "main"(legacy alias) andsessionDefaults.mainSessionKey = "agent:main:main"(canonical), and the parser was reading the alias. Optimistic local entries landed under"main"while gateway echoes used"agent:main:main"→ two timelines, message orphaned, UI stuck.Fix:
IChatDataProvider.CreateThreadAsync(a fiction — the gateway has nosession.createRPC).ChatComposeTarget {SessionKey, IsReady}toChatDataSnapshotas a first-class "where to send" concept.TryGetHandshakeMainSessionKeynow prefers the canonical key.OpenClawGatewayClient._mainSessionKey→string?(no"main"default), reset on disconnect,Volatile.Read/Write-protected.ResolveEffectiveSessionKey— throwsInvalidOperationExceptioninstead of silently substituting a literal.HandleChatEvent/HandleAgentEventno longer substitute"main"/"unknown"for missingsessionKey; provider drops events with empty keys.OpenClawChatRoot: deletedStartFirstChatand safety-net fallback (now structurally impossible);effectiveThread = selectedThread ?? composeOnlyThread; composer visible wheneverComposeTarget.IsReady;firstSendInFlightdebounce only clears when the snapshot's compose-target timeline has the optimistic entry.Followed by an adversarial dual-model code review which surfaced 5 HIGH-consensus issues, all addressed.
Bug 2 — Node pairing pending forever (commit fa421a3)
On the same install, after the device was paired, the gateway showed the Windows node as Connected but with
caps: []andcommands: []. Agent capabilities silently inert.Root cause: the gateway maintains two separate tables —
devices/paired.json(device-level identity) andnodes/paired.json(node-sub-pairing with caps/commands). The device-pair flow had auto-approved the device, but the node-sub-pairing was sitting innodes/pending.jsonindefinitely. The existingOnNodePairingStatusChangedauto-approve only triggers from the node-sideWindowsNodeClient.PairingStatusChangedevent, which firesPendingonly when the node-side connection itself is pending. With the device already paired, the node-side client sawPairedand never raisedPending— the only signal for the missing node-sub-pairing was the operator-sideNodePairListUpdatedbroadcast, which wasn't wired to auto-approve.Fix:
TryOperatorAutoApproveOwnNodePairAsyncinGatewayConnectionManager. Subscribes toNodePairListUpdatedon every new operator client. When a pending entry'sNodeIdmatches our own_nodeConnector.NodeDeviceIdand we haveoperator.admin/operator.pairingscope, callsNodePairApproveAsync. Reuses_autoApproveInFlightCAS guard and_lastAutoApprovedRequestIddedup.InstancesPage: caution-yellow banner at top when any pair is pending, with contextual subtext when our own node is the one waiting. Button navigates toConnectionPagewhere the existing per-row approve/reject UI lives.ConnectionPage.Initialize(): eagerly callsRequestNodePairListAsync+RequestDevicePairListAsync, and pushes already-loadedAppStatepending lists into the banner immediately.Tests: 4 new cases in
NodePairAutoApproveTestscovering own-node approve, other-node ignored, missing scope, and dedup-on-rebroadcast.Validation
./build.ps1✅OpenClaw.Shared.Tests: 1782 passed / 28 skipped ✅OpenClaw.Tray.Tests: 1096 passed ✅OpenClaw.Connection.Tests: 228 passed ✅node.listwent fromcaps: []to all 8 caps populated after launch.Files changed
src/OpenClaw.Chat/ChatModels.cs— deleteCreateThreadAsync, addChatComposeTargetsrc/OpenClaw.Shared/OpenClawGatewayClient.cs— canonical key parse, nullable + Volatile main session key,ResolveEffectiveSessionKeyhelper, drop"main"/"unknown"fallbackssrc/OpenClaw.Shared/IOperatorGatewayClient.cs— exposeMainSessionKey+HasHandshakeSnapshotsrc/OpenClaw.Connection/GatewayConnectionManager.cs— operator-side auto-approve pathsrc/OpenClaw.Tray.WinUI/Chat/{IChatGatewayBridge, OpenClawChatDataProvider, OpenClawChatRoot}.cs— provider refactor, UI compose-target wiringsrc/OpenClaw.Tray.WinUI/Pages/{ConnectionPage, InstancesPage}.{xaml, xaml.cs}— pending-pair surfacingtests/OpenClaw.Shared.Tests/OpenClawGatewayClientSessionKeyTests.cs— newtests/OpenClaw.Connection.Tests/NodePairAutoApproveTests.cs— +4 cases for operator-side pathtests/OpenClaw.Tray.Tests/OpenClawChatDataProviderTests.cs— replaced 2 obsolete tests with 7 new ones for compose-target behavior🦞