Skip to content

Fix zero-state chat trap + auto-approve own-node sub-pairing#457

Merged
shanselman merged 4 commits into
openclaw:masterfrom
ranjeshj:ranjeshj/fix-zero-state-chat-and-node-pairing
May 19, 2026
Merged

Fix zero-state chat trap + auto-approve own-node sub-pairing#457
shanselman merged 4 commits into
openclaw:masterfrom
ranjeshj:ranjeshj/fix-zero-state-chat-and-node-pairing

Conversation

@ranjeshj
Copy link
Copy Markdown
Contributor

Summary

Two related setup-flow bugs hit while debugging a fresh install against a WSL gateway. Both fixed here, plus tests.

Bug 1 — Zero-state chat trap (commit fd9e2a9)

On a fresh install with zero gateway sessions, the user landed on a welcome screen with three suggestion buttons that did nothing and a hidden composer.

Root cause (validated by 3-model fresh root-cause analysis): OpenClawChatDataProvider maintained a C# replica of gateway session state keyed by threadId that conflated UI selection token with server-assigned sessionKey. CreateThreadAsync synthesized a thread with Id = "main" (literal), but the gateway's hello-ok publishes BOTH sessionDefaults.mainKey = "main" (legacy alias) and sessionDefaults.mainSessionKey = "agent:main:main" (canonical), and the parser was reading the alias. Optimistic local entries landed under "main" while gateway echoes used "agent:main:main" → two timelines, message orphaned, UI stuck.

Fix:

  • Deleted IChatDataProvider.CreateThreadAsync (a fiction — the gateway has no session.create RPC).
  • Added ChatComposeTarget {SessionKey, IsReady} to ChatDataSnapshot as a first-class "where to send" concept.
  • TryGetHandshakeMainSessionKey now prefers the canonical key.
  • OpenClawGatewayClient._mainSessionKeystring? (no "main" default), reset on disconnect, Volatile.Read/Write-protected.
  • Extracted ResolveEffectiveSessionKey — throws InvalidOperationException instead of silently substituting a literal.
  • HandleChatEvent / HandleAgentEvent no longer substitute "main"/"unknown" for missing sessionKey; provider drops events with empty keys.
  • OpenClawChatRoot: deleted StartFirstChat and safety-net fallback (now structurally impossible); effectiveThread = selectedThread ?? composeOnlyThread; composer visible whenever ComposeTarget.IsReady; firstSendInFlight debounce only clears when the snapshot's compose-target timeline has the optimistic entry.
  • Friendlier zero-state suggestions ("Say hi 👋", "What can you do?", "Give me a quick tour of OpenClaw").

Followed by an adversarial dual-model code review which surfaced 5 HIGH-consensus issues, all addressed.

Bug 2 — Node pairing pending forever (commit fa421a3)

On the same install, after the device was paired, the gateway showed the Windows node as Connected but with caps: [] and commands: []. Agent capabilities silently inert.

Root cause: the gateway maintains two separate tables — devices/paired.json (device-level identity) and nodes/paired.json (node-sub-pairing with caps/commands). The device-pair flow had auto-approved the device, but the node-sub-pairing was sitting in nodes/pending.json indefinitely. The existing OnNodePairingStatusChanged auto-approve only triggers from the node-side WindowsNodeClient.PairingStatusChanged event, which fires Pending only when the node-side connection itself is pending. With the device already paired, the node-side client saw Paired and never raised Pending — the only signal for the missing node-sub-pairing was the operator-side NodePairListUpdated broadcast, which wasn't wired to auto-approve.

Fix:

  • New TryOperatorAutoApproveOwnNodePairAsync in GatewayConnectionManager. Subscribes to NodePairListUpdated on every new operator client. When a pending entry's NodeId matches our own _nodeConnector.NodeDeviceId and we have operator.admin/operator.pairing scope, calls NodePairApproveAsync. Reuses _autoApproveInFlight CAS guard and _lastAutoApprovedRequestId dedup.
  • InstancesPage: caution-yellow banner at top when any pair is pending, with contextual subtext when our own node is the one waiting. Button navigates to ConnectionPage where the existing per-row approve/reject UI lives.
  • ConnectionPage.Initialize(): eagerly calls RequestNodePairListAsync + RequestDevicePairListAsync, and pushes already-loaded AppState pending lists into the banner immediately.

Tests: 4 new cases in NodePairAutoApproveTests covering own-node approve, other-node ignored, missing scope, and dedup-on-rebroadcast.

Validation

  • ./build.ps1
  • OpenClaw.Shared.Tests: 1782 passed / 28 skipped ✅
  • OpenClaw.Tray.Tests: 1096 passed ✅
  • OpenClaw.Connection.Tests: 228 passed ✅
  • Live end-to-end against a WSL gateway: zero-state chat now sends correctly with the canonical key; gateway's node.list went from caps: [] to all 8 caps populated after launch.

Files changed

  • src/OpenClaw.Chat/ChatModels.cs — delete CreateThreadAsync, add ChatComposeTarget
  • src/OpenClaw.Shared/OpenClawGatewayClient.cs — canonical key parse, nullable + Volatile main session key, ResolveEffectiveSessionKey helper, drop "main"/"unknown" fallbacks
  • src/OpenClaw.Shared/IOperatorGatewayClient.cs — expose MainSessionKey + HasHandshakeSnapshot
  • src/OpenClaw.Connection/GatewayConnectionManager.cs — operator-side auto-approve path
  • src/OpenClaw.Tray.WinUI/Chat/{IChatGatewayBridge, OpenClawChatDataProvider, OpenClawChatRoot}.cs — provider refactor, UI compose-target wiring
  • src/OpenClaw.Tray.WinUI/Pages/{ConnectionPage, InstancesPage}.{xaml, xaml.cs} — pending-pair surfacing
  • tests/OpenClaw.Shared.Tests/OpenClawGatewayClientSessionKeyTests.cs — new
  • tests/OpenClaw.Connection.Tests/NodePairAutoApproveTests.cs — +4 cases for operator-side path
  • tests/OpenClaw.Tray.Tests/OpenClawChatDataProviderTests.cs — replaced 2 obsolete tests with 7 new ones for compose-target behavior

🦞

ranjeshj and others added 4 commits May 18, 2026 18:00
…nonical sessionKey

Root cause: OpenClawChatDataProvider maintained a C# replica of gateway session
state keyed by 'threadId' that conflated UI selection token with server-assigned
sessionKey. On fresh install (zero sessions), CreateThreadAsync synthesized a
ChatThread with Id='main' (literal). But the gateway's hello-ok publishes BOTH
sessionDefaults.mainKey ('main' alias) AND sessionDefaults.mainSessionKey
('agent:main:main' canonical), and the parser was reading the alias. Optimistic
local entries landed under 'main' while gateway echoes used 'agent:main:main' —
two timelines, message orphaned, UI stuck on the welcome screen with no composer.

Fix (real, not a workaround — three-model agreement, dual-model code review):

Protocol layer (OpenClawGatewayClient):
- TryGetHandshakeMainSessionKey now prefers canonical mainSessionKey over alias.
- _mainSessionKey becomes string? (no 'main' default), with Volatile.Read/Write
  for cross-thread publication.
- _hasHandshakeSnapshot tracks handshake completion; reset on disconnect.
- Extracted ResolveEffectiveSessionKey helper — throws InvalidOperationException
  if no canonical key resolved. No more silent fallback to a stale literal.
- HandleChatEvent/HandleAgentEvent stop substituting 'main'/'unknown' for empty
  sessionKey — pass through so the provider can surface the protocol gap.

Contract (IChatDataProvider):
- Deleted CreateThreadAsync. The gateway has no session.create RPC; pretending
  it does was the original sin.
- Added ChatComposeTarget {SessionKey, IsReady} on ChatDataSnapshot — first-class
  'where to send next' concept, distinct from 'what sessions exist'.

Provider (OpenClawChatDataProvider):
- Deleted CreateThreadAsync impl, ResolveMainOrFirstThreadLocked, and every
  ?? 'main' fallback in event handlers (5 sites). Empty SessionKey now drops
  the event with a warning.
- BuildSnapshotLocked projects ComposeTarget from bridge state and injects a
  synthetic ChatThread with Id=canonicalKey when the compose key has optimistic
  entries but no SessionInfo yet — so SessionsUpdated later replaces it in
  place with no re-keying or migration.
- ResolveDefaultThreadIdLocked uses SessionInfo.IsMain instead of string-equal
  to literal 'main'.

UI (OpenClawChatRoot):
- Deleted StartFirstChat and the safety-net fallback (now structurally
  impossible to fire).
- effectiveThread = selectedThread ?? composeOnlyThread; composer visible
  whenever ComposeTarget.IsReady.
- Friendlier zero-state suggestions ('Say hi 👋', 'What can you do?',
  'Give me a quick tour of OpenClaw').
- firstSendInFlight debounce clears only when the snapshot's compose-target
  timeline has the optimistic user entry — not on every unrelated snapshot
  (presence, models, channel health).

Tests: deleted 2 obsolete CreateThreadAsync tests; added 7 provider tests
covering fresh-install compose target, optimistic-by-canonical-key routing,
SessionsUpdated reconciliation, empty-sessionKey drop, canonical echo merge,
IsMain default resolution. Added 6-case OpenClawGatewayClientSessionKeyTests
exercising the ResolveEffectiveSessionKey throw directly.

Validation: build OK, Shared.Tests 1782/1810 (28 skipped), Tray.Tests 1096/1096.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nding state

Bug: on a fresh install where the device is already paired, the Windows node
sub-pairing (a separate gateway-side table, /home/openclaw/.openclaw/nodes/
paired.json) can sit empty indefinitely. The node connects, advertises its
caps/commands in the connect frame, gets into the gateway's pending list — but
the gateway returns caps:[], commands:[] in node.list responses until the
node-sub-pairing is approved. Result: tray menu shows the node Connected
with "Capabilities (0) · Commands (0)", agent capabilities silently inert.

Why the existing auto-approve didn't fire: GatewayConnectionManager's
OnNodePairingStatusChanged only triggers on the node-side
WindowsNodeClient.PairingStatusChanged event, which fires Pending only when
the node-side connection itself is pending. With the device already paired,
the node-side client sees PairingStatus.Paired and never raises Pending — the
only signal for the missing node-sub-pairing is the operator-side
NodePairListUpdated broadcast.

Fix (3 parts):

1. GatewayConnectionManager: subscribe NodePairListUpdated on every new
   operator client. When a pending entry's NodeId matches our own
   _nodeConnector.NodeDeviceId AND we have operator.admin/pairing scope,
   call NodePairApproveAsync. Reuses _autoApproveInFlight CAS guard and
   _lastAutoApprovedRequestId dedup. After approval, restart the node
   connection so caps propagate.

2. InstancesPage: caution-yellow banner at top of the Nodes page when any
   node or device pair is pending. Subtext is contextual ("This node is
   connected but its capabilities won't activate…" when our own deviceId
   is the pending one). Button navigates to ConnectionPage where the
   existing per-row approve/reject UI lives.

3. ConnectionPage: Initialize() now eagerly calls RequestNodePairListAsync
   + RequestDevicePairListAsync, and pushes any already-loaded AppState
   pending lists into the banner immediately, so the user sees pending
   state on first navigation instead of waiting for the gateway's next
   broadcast.

Tests: 4 new cases in NodePairAutoApproveTests cover own-node approve,
other-node ignored, missing scope, and dedup-on-rebroadcast. All 228
connection tests pass; Shared 1782; Tray 1096.

Live verified end-to-end: after launching the new build against a WSL
gateway where the node was stuck at caps:[], the operator-side
auto-approve fired and node.list now reports caps:[app, browser, camera,
canvas, device, location, screen, system] and 11 commands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from dual-model adversarial review of fa421a3:

H1 — Spin loop on approval failure: _lastAutoApprovedRequestId was set
only inside if (approved). A failed approve (transient or policy
reject) meant every subsequent NodePairListUpdated re-broadcast retried
the same id forever. Now always recorded in finally, with a re-check
against the current generation so a reconnect-during-await doesn't
overwrite the new-generation null with a stale id. Same fix applied to
the pre-existing node-side path.

H2 — �reak after first attempt stranded second own-node pending: on
exception/rejection the next own-pending in the same snapshot was never
attempted (gateway might not re-broadcast if approve frame round-tripped
and was rejected mid-flight). Now continue to scan all own-pending
entries, then perform a single post-approval reconnect for the
last-approved id (rather than per-entry, which would race with itself).

H3 — CAS guard held across Task.Delay + StartNodeConnectionAsync (5–30s
on WSL cold-start), starving unrelated approvals: moved the
_autoApproveInFlight clear into a inally scoped JUST to the approve
RPC. Post-approve reconnect runs outside the guard. Same on both paths.

M1 — Stale _lastAutoApprovedRequestId write across generations: now
re-checks _generation == gen before writing so a disconnect/reconnect
during the approve await doesn't blow away the new generation's clean
slate.

M4 — Reflection-based test event raiser: added
[InternalsVisibleTo("OpenClaw.Connection.Tests")] +
internal RaiseNodePairListUpdatedForTests on OpenClawGatewayClient.
Test helper FireNodePairListUpdated now delegates to that instead of
poking the private event backing field via reflection (which silently
breaks the moment events grow explicit add/remove blocks).

M3 — Eager refresh swallowed exceptions silently: ConnectionPage and
InstancesPage Task.Run blocks now log via Services.Logger.Warn instead
of swallowing, matching the comment's stated intent.

Validation: Connection 228 / Shared 1782 / Tray 1096, all pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Only record a node pairing request as deduped after an actual approve attempt. This keeps the operator-side fallback able to approve the same request once the operator client is ready or gains approval scope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@shanselman
Copy link
Copy Markdown
Contributor

@ranjeshj thank you - this is good work and I'm merging it.

What landed well:

  • The zero-session chat fix is the right shape: the UI no longer invents a main session, and ComposeTarget cleanly separates "ready to send" from "a server-side session row exists."
  • The canonical session-key handling fixes the timeline split that caused optimistic UI to drift from gateway echoes.
  • The operator-side own-node auto-approve path is appropriately scoped: it requires approval scope and only approves pending requests whose NodeId matches this tray's local node id.
  • The last patch also fixed a subtle retry/dedupe bug: a node-side auto-approve path that skipped because the operator client lacked scope no longer marks that request as deduped, so the operator-side fallback can still approve it once scope/client readiness is available.
  • The tests cover the important positive/negative pairing cases and chat provider behavior.

Validation I ran locally on the final head $(@{author=; headRefOid=3e713d97c147ea08d826fe162a005b21446147bf; mergeStateStatus=CLEAN; mergedAt=; state=OPEN; statusCheckRollup=System.Object[]; title=Fix zero-state chat trap + auto-approve own-node sub-pairing; url=https://github.com/openclaw/openclaw-windows-node/pull/457}.headRefOid.Substring(0,8)):

  • ./build.ps1 - passed
  • Shared tests - 1810 total, 1782 succeeded, 28 skipped
  • Tray tests - 1096 total, 0 failed
  • Focused Connection node auto-approve tests - passed
  • GitHub checks are green and merge state is clean

I opened follow-up issues for the non-blocking review items so they don't get lost:

None of those should block this PR. Thanks again for moving quickly and for covering the own-node safety boundaries.

@shanselman shanselman merged commit 7747e60 into openclaw:master May 19, 2026
11 checks passed
shanselman added a commit that referenced this pull request May 20, 2026
Roll back the PR #457 operator-side NodePairListUpdated auto-approve path because it can race the easy-button WSL setup node-pairing phase by restarting the node connection while setup is waiting on it.

Keep the zero-state chat session-key fix and the read-only pending-pair UI surfacing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants