Skip to content

Connection architecture refactoring: manager-owned lifecycle, setup integration, diagnostics#304

Merged
shanselman merged 47 commits into
openclaw:masterfrom
ranjeshj:user/ranjeshj/connection2
May 10, 2026
Merged

Connection architecture refactoring: manager-owned lifecycle, setup integration, diagnostics#304
shanselman merged 47 commits into
openclaw:masterfrom
ranjeshj:user/ranjeshj/connection2

Conversation

@ranjeshj
Copy link
Copy Markdown
Contributor

Summary

Refactors the connection layer toward the northstar architecture where GatewayConnectionManager owns the complete connection lifecycle — operator client, node connector, credential resolution, state machine, SSH tunnel, and diagnostics.

Key Changes

Architecture (Phases 1-5)

  • Delete dead methods: InitializeNodeService, ReinitializeGatewayClient, ReconnectGateway, ReconnectNodeServiceOnly
  • Eliminate redundant _currentStatus\ writes — OnManagerStateChanged is sole writer
  • Move SSH tunnel lifecycle into GatewayConnectionManager via ISshTunnelManager
  • Widen IOperatorGatewayClient with 35+ request methods; change manager return type to interface
  • Eliminate _gatewayClient\ field from App.xaml.cs (38 refs → _connectionManager?.OperatorClient)

Setup Engine Integration

  • New \ConnectionManagerOperatorConnector\ — WSL local-setup operator pairing now flows through the connection manager with full diagnostics visibility
  • Suppress node auto-connect during setup (engine controls node pairing in its own phase)
  • Fix: clear setup-state.json on replace-existing-config to prevent stale completed state

Connection Page UX

  • Diagnostics button (🔍) on status card opens Connection Status window
  • Pending Device Pairing card with scope-gated Approve/Reject buttons (requires operator.admin or operator.pairing)

State Machine Fixes

  • Allow WebSocketConnected/HandshakeSucceeded from Error state (auto-reconnect path)
  • Clear OperatorPairingRequestId on non-pairing transitions
  • Wrap SetNodeEnabled in semaphore for thread safety

Security & Robustness (from 5-model adversarial review)

  • Redact auth tokens in diagnostics logs (auth payloads, signed strings, sigToken previews)
  • Unsubscribe node events before semaphore disposal in Dispose()
  • Sync-wait for node disconnect in DisposeActiveClient (2s timeout)
  • Re-enable Approve/Reject buttons on false return
  • Move auto-approve guard after confirmed success
  • Disconnect manager on setup connector timeout

Tests: +55 new tests (878 tray, 1442 shared — all passing)

  • RetryPolicyTests (35) — ShouldRetry + GetBackoffMs coverage
  • NodeConnectorTests (8) — initial state, dispose, disconnect safety
  • StaleEventGuardTests (5) — generation guard, reconnect isolation
  • PairingFlowTests (7) — node pairing state transitions

Documentation

  • Replace 6 working docs with single \docs/node-connection-architecture.md\ documenting current implementation

ranjeshj and others added 30 commits May 9, 2026 07:54
Create ICredentialResolver interface and CredentialResolver implementation
with canonical resolution: DeviceToken > SharedGatewayToken > BootstrapToken.

- Add IDeviceIdentityReader/DeviceIdentityFileReader for testable identity reads
- Add GatewayRecord/SshTunnelConfig data model
- Add 11 new credential resolver tests covering both operator and node resolution
- Create Services/Connection/ directory structure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pure-logic state machine with operator and node sub-FSMs.
GatewayConnectionSnapshot immutable record with DeriveOverall().
ConnectionTrigger, OverallConnectionState, RoleConnectionState enums.
50 exhaustive transition tests covering all valid/invalid paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Factory interface wrapping OpenClawGatewayClient construction.
IGatewayClientLifecycle separates lifecycle (manager) from data (UI).
GatewayClientFactory creates real WebSocket clients.
GatewayClientLifecycleAdapter bridges the existing client to the interface.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Gateway catalog with JSON persistence (gateways.json).
GatewayRecord immutable record with SshTunnelConfig.
IFileSystem/RealFileSystem abstraction for testability.
Thread-safe with lock-protected list, events fire outside lock.
15 tests: CRUD, persistence round-trip, SSH config, URL lookup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixed-capacity ring buffer for timestamped diagnostic events.
IClock/SystemClock time abstraction for testable timestamps.
12 tests: recording, overflow, GetRecent, clear, event firing.
Phase 1 Foundation complete — all new code, no behavioral changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add DeviceTokenReceived and HandshakeSucceeded events to both
OpenClawGatewayClient and WindowsNodeClient. Events are additive —
clients continue to write tokens internally for backward compat.
Add DeviceTokenReceivedEventArgs, IDeviceIdentityStore interface.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Manager with interface, state machine, diagnostics, semaphore,
generation counter, and client creation via factory.
Handles connect/disconnect/reconnect/switch lifecycle.
Generation-guarded event handlers for stale event protection.
11 manager tests covering lifecycle, state transitions, events.
ConnectionErrorCategory, RetryPolicy, SetupCodeResult types.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add OperatorClientChangedEventArgs with OldClient/NewClient.
Manager fires event on client creation and disposal.
Enables App to re-wire data event handlers when client changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Critical switchover: App.xaml.cs no longer creates OpenClawGatewayClient
directly. InitializeGatewayClient() now bridges settings to a GatewayRecord
and delegates to _connectionManager.ConnectAsync().

- OnOperatorClientChanged re-wires all 27 data event handlers
- OnManagerStateChanged maps OverallConnectionState to legacy ConnectionStatus
- Disconnect paths use _connectionManager.DisconnectAsync()
- _gatewayClient field kept for backward compat, updated by OperatorClientChanged

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create INodeConnector interface and NodeConnector implementation.
NodeConnectionMode enum (Gateway/McpOnly/Disabled).
NodeConnector owns WindowsNodeClient lifecycle, delegates capability
setup to NodeService (which has WinUI deps and stays in App.xaml.cs).
App's InitializeNodeService() remains for now — full migration in Phase 4.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add MigrateFromSettings() to GatewayRegistry — idempotent migration
from SettingsData Token/BootstrapToken/SSH to GatewayRecord.
Identity file is copied (not moved) for rollback safety.
Mark SettingsData.Token/BootstrapToken as legacy in doc comments.
9 migration tests: creation, idempotency, SSH, identity copy, persistence.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implement ApplySetupCodeAsync in GatewayConnectionManager:
decode setup code → validate URL → disconnect → create/update
GatewayRecord → set active → save registry → connect.
All setup code entry points can now call this single method.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Step 3.2: DeviceIdentityStore implements IDeviceIdentityStore for
manager-owned token writes via DeviceIdentity.

Step 3.3: ISshTunnelManager interface wrapping SshTunnelService.
SshTunnelManager delegates to existing service.
Phase 3 Unification complete.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Delete UnsubscribeGatewayEvents() (27 unsubscribe lines — now handled
by OnOperatorClientChanged). Replace 3 remaining manual disconnect
paths with _connectionManager.DisconnectAsync(). Shutdown uses
_connectionManager.Dispose() for clean teardown.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Create IOperatorGatewayClient interface in OpenClaw.Shared with all
27 data events, query properties, and connection events.
OpenClawGatewayClient now implements this interface.
UI consumers can be typed to the interface to prevent lifecycle access.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
SettingsChangeClassifier.Classify() determines minimum reconnect action:
NoOp, UiOnly, CapabilityReload, NodeReconnect, OperatorReconnect, FullReconnect.
8 tests covering all classification categories.
All 4 phases of the northstar connection architecture are now implemented.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two-column layout adapted from openclaw-windows-node-2:
- Left: state machine visual (operator/node sub-FSMs), gateway catalog,
  credential resolution display
- Right: color-coded event timeline with copy/clear

Wired to ConnectionDiagnostics and IGatewayConnectionManager.
Accessible via Debug page button and 'connectionstatus' tray command.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adapted from node-2 ConnectionPage, wired to our architecture:
- Status card: disconnect button + connection attempts counter
- Setup code: live decode preview (shows URL + token before apply),
  wired through manager.ApplySetupCodeAsync()
- Recent gateways: list from GatewayRegistry with connect/remove buttons
- Live state: subscribes to IGatewayConnectionManager.StateChanged
- Advanced settings collapsed by default
- Removed connect toggle (replaced with explicit disconnect button)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove mDNS gateway discovery UI (scan button, gateway list, token prompt)
and all related code-behind (AutoScanAsync, PopulateGatewayList,
OnConnectToGateway, OnConnectWithToken, OnCancelTokenPrompt).
Gateway switching is now handled via the Recent Gateways card from
GatewayRegistry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ConnectionStatusWindow now has a SETUP CODE section in the left column:
- Paste setup code with live decode preview
- Connect button applies code via manager.ApplySetupCodeAsync()
- Disconnect button for teardown
- Reconnect (empty code + Connect) for quick retry

Auto-show diagnostics window on launch when no gateway is configured,
enabling end-to-end setup and testing without the Hub UI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rors

Based on connection-analysis-report.md comparing against gateway server
and Mac/TS reference implementation:

Fix 1 (HIGH): Node client now uses SignConnectPayloadV3 instead of
legacy SignPayload/BuildDebugPayload. Gateway verifies v3 on first try.

Fix 2 (MEDIUM): Remove operator 4-mode signature cascade
(V3Auth→V3Empty→V2Auth→V2Empty). Always use v3 with auth token,
matching reference client. Signature rejection = real error.

Fix 3 (MEDIUM): Add TryGetErrorDetailCode helper for structured
error.details.code parsing (DEVICE_AUTH_SIGNATURE_INVALID, etc.).

Fix 4 (LOW): Mark SignPayload and BuildDebugPayload as [Obsolete]
on DeviceIdentity with guidance to use SignConnectPayloadV3.

Updated 2 tests to match new immediate-failure behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bugs fixed during end-to-end gateway testing:

1. Bootstrap token misclassification — setup codes always store as BootstrapToken
2. Settings bridge overwrites registry — skip if record exists
3. Stale bootstrap auto-connect — skip startup connect with expired tokens
4. Identity file not synced to per-gateway dir — copy on every connect
5. v3 signature rejected by gateway — v3→v2 fallback (flag persists per session)
6. v2 retry on same socket wastes token — set flag, let reconnect handle it
7. Bootstrap can't request operator.admin — use bounded scopes for bootstrap
8. PairingRequired shown as Error — new PairingRequired event + state
9. WS disconnect during pairing → Error — stays in PairingRequired now
10. Terminal auth errors cause reconnect loops — detect structured error codes
11. SendConnectMessageAsync crashes silently — wrapped with error logging

Also added:
- ApplySetupCodeAsync clears stored device tokens for fresh pairing
- Registry-first startup connect (don't rely on settings)
- DiagnosticTeeLogger pipes client logs to Connection Status timeline
- Comprehensive handshake logging (role, scopes, signed payload, auth, raw errors)
- Direction arrows (→ GW / ← GW) and word-wrap in timeline
- 9 setup code flow unit tests
- v3→v2 signature fallback with per-gateway persistence

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comprehensive audit of northstar implementation status:
what's done, what's wired, what's created-but-unused,
what's missing, test failures, and testing instructions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…de connector, per-gateway identity

- Fix 3 broken shared tests (GatewayClientTestHelper.Client property, v3→v2 fallback test)
- Wire SettingsChangeImpact classifier into OnSettingsSaved (no reconnect for UI-only changes)
- Wire DeviceIdentityStore into Manager for token persistence
- Finish SshTunnelManager.StartAsync (was stub)
- Wire NodeConnector into Manager: starts after operator handshake, routes events to state machine
- Share v2 signature flag from operator to node (avoid wasted v3 roundtrip)
- Remove legacy Token/BootstrapToken fields from SettingsData and SettingsManager
- Clear bootstrap token from GatewayRecord after node pairing completes
- Per-gateway identity paths: OpenClawGatewayClient accepts identityPath parameter
- Remove root identity sync — all identity operations use per-gateway directories
- Add detailed node handshake logging ([HANDSHAKE] tags) matching operator style
- Node connector uses DiagnosticTeeLogger for Connection Status window timeline
- Fix credentials display to read per-gateway identity file
- Pass bootstrap tokens via bootstrapToken parameter (not token) to WindowsNodeClient

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ConnectionStatusWindow: add ExtendsContentIntoTitleBar + icon for dark title bar
- Auto-reconnect after pairing: manager polls every 5s to check if device was approved
  When still not approved, HandlePairingRequiredAsync fires again and re-schedules
- Fix per-gateway identity comment (no longer root)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove auto-reconnect poll (reverted)
- Operator: WebSocket disconnect/error suppressed when in PairingRequired state
- Node: same — disconnect/error after pairing required preserves Pairing UI state
- Gateway closes socket after PAIRING_REQUIRED; state machine now keeps showing Pairing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Clear bootstrap token from registry after operator pairing (not node)
- Add operator device token as fallback in ResolveNode for node role-upgrade pairing
- Remove root→per-gateway identity copy from InitializeGatewayClient (per-gateway is source of truth)
- On restart: operator uses stored device token, node uses operator token for role-upgrade

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…-upgrade

- WindowsNodeClient.BuildConnectAuth: fallback to identity.DeviceToken (operator's)
  when no NodeDeviceToken/bootstrap exists — sends auth.deviceToken not auth.token
- WindowsNodeClient.ResolveRequiredCredential: same fallback to avoid throw
- Fixes 'gateway token mismatch' error when node uses operator token for pairing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Revert: clear bootstrap after node gets device token (not operator)
- Revert: remove operator device token fallback (causes 'device token mismatch')
- Device tokens are role-specific — operator token cannot auth node role
- Bootstrap token stays in registry until node completes its own pairing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove ExtendsContentIntoTitleBar (caused overlap with window controls)
- Allow PairingRequired transition from Error state (both operator and node)
  Fixes race: Error event fires before PairingRequired, blocking the transition

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ranjeshj and others added 14 commits May 9, 2026 18:10
- Add _nodePairingPending and _operatorPairingPending volatile flags
- Set flag BEFORE acquiring semaphore in pairing handlers
- Check flag BEFORE acquiring semaphore in status handlers
- Prevents Disconnected event from winning semaphore race and overwriting PairingRequired
- Flags cleared on successful Connected status

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace volatile flags with direct PairingStatus/IsPairingRequired checks
- Both are set synchronously on the WS thread before async handlers run
- Eliminates race: Disconnected handler checks connector state, not a flag
  that hasn't been set yet because both events fire in same synchronous dispatch

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix node state: add Connecting case to OnNodeStatusChanged (was missing, kept node in Idle)
- Fix pairing race: preserve _isPendingApproval in OnDisconnected when _pairingBlocked
- Fix EmitStateChanged: always fire (node sub-state changes need to reach UI)
- Allow PairingRequired transition from Error state (both operator and node)
- Suppress Disconnect/Error events when pairing is pending (check connector state directly)
- Title bar: ExtendsContentIntoTitleBar with 40px top padding + title text
- PairingStatusEventArgs: add RequestId field for future auto-approve support
- Remove failed auto-approve attempt (requires operator.pairing scope)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…button

- Replace text-only operator/node status with state machine pill rows
- Read GatewayConnectionSnapshot directly (not old ConnectionStatus enum)
- Show 'Awaiting Approval' for PairingRequired instead of 'Connecting...'
- Pairing guidance card: approval command with Copy button + Connect/Disconnect
- Fix connection counter: reset during pairing, don't increment
- Read device ID from identity file when snapshot doesn't have it yet

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Diagnostics: add Direct Connect section (URL + Token fields, clears device tokens)
- Diagnostics: pairing state feedback (Connect button changes to 'once approved')
- Node auto-approve: re-enabled with scope guard (operator.admin/operator.pairing only)
- Loop prevention: track lastAutoApprovedRequestId, one attempt per requestId
- Shared gateway tokens now request operator.admin scope (removed local-only restriction)
- Default URL: ws://localhost:18790

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace Advanced Connection Settings expander with Direct Connect card
- Gateway URL + Token fields with Connect button (same flow as diagnostics)
- Clears stored device tokens so shared token is used with admin scope
- Saves SSH tunnel settings alongside
- Default URL: ws://localhost:18790

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Store SshTunnelConfig on GatewayRecord when SSH is enabled
- Start SSH tunnel via App.EnsureSshTunnelStarted() before connecting
- Save SSH settings to SettingsManager for legacy compat
- Show 'Starting SSH tunnel...' status during tunnel setup

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Manager: use ws://localhost:{localPort} when record has SshTunnelConfig (operator + node)
- Diagnostics: add SSH toggle + fields to Direct Connect section
- Diagnostics: OnDirectConnect stores SshTunnelConfig on GatewayRecord, starts tunnel
- Connection page: same flow (already wired)
- SSH uses OS ssh client with key-based auth (no password in app)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase A: WSL Local Setup
- SettingsManagerLocalGatewaySetupSettings syncs Token/BootstrapToken to GatewayRegistry on Save
- CreateLocalOnly accepts optional GatewayRegistry parameter
- App passes _gatewayRegistry to the factory

Phase B: Onboarding Flow
- Onboarding ConnectionPage uses Direct Connect flow: creates GatewayRecord + manager.ConnectAsync
- Polls manager.CurrentSnapshot instead of app.GatewayClient
- LocalSetupProgressPage uses manager.ReconnectAsync instead of ReinitializeGatewayClient
- Expose App.ConnectionManager as public property

Other:
- Shared gateway tokens now request operator.admin scope on all gateways (not just local)
- Update test: FreshStandardRemoteDevice now expects admin scopes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace 8 InitializeGatewayClient/InitializeNodeService call sites with
_connectionManager.ReconnectAsync():
- Tray connect toggle
- Hub connect action
- OnSettingsSaved full reconnect
- ReconnectGateway
- ReconnectNodeServiceOnly
- SSH tunnel restart
- Onboarding completion
- Startup: kept InitializeGatewayClient but removed manual node init

-68 lines from App.xaml.cs (4735 → 4214 cumulative this session)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…age UX

Phase 1: Delete dead methods (InitializeNodeService, ReinitializeGatewayClient,
ReconnectGateway, ReconnectNodeServiceOnly) and inline callers.

Phase 2: Make OnManagerStateChanged the sole writer of _currentStatus,
removing redundant writes from event handlers and disconnect actions.

Phase 3: Move SSH tunnel lifecycle into GatewayConnectionManager via
ISshTunnelManager. Manager starts/stops tunnel in ConnectAsync/DisconnectAsync.
Split ISshTunnelManager interface from SshTunnelManager implementation.

Phase 4: Widen IOperatorGatewayClient with 35+ request methods (sessions,
usage, config, cron, skills, pairing, channels, wizard). Change
IGatewayConnectionManager.OperatorClient return type to interface.

Phase 5: Eliminate _gatewayClient field from App.xaml.cs, replacing 38 refs
with _connectionManager?.OperatorClient. Update HubWindow and OnboardingState
to use IOperatorGatewayClient.

Phase 6: Add 55 new tests — RetryPolicyTests (35), NodeConnectorTests (8),
StaleEventGuardTests (5), PairingFlowTests (7).

Connection page UX: Add diagnostics button (opens Connection Status window),
add Pending Device Pairing card with scope-gated Approve/Reject buttons
(requires operator.admin or operator.pairing scope).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…le completed state

When replaceExistingConfigurationConfirmed=true, the engine loaded a
completed setup-state.json from a previous run and had nothing to do,
leaving the progress page stuck with all stages showing empty circles.

Also set AllowExistingDistro=true when replacing config since the WSL
distro already exists from the previous run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the setup engine's throwaway OpenClawGatewayOperatorConnector with
ConnectionManagerOperatorConnector that delegates to the app's
GatewayConnectionManager. All operator handshake/pairing events now appear
in the Connection Status diagnostics window.

Key changes:
- Add OperatorPairingRequestId to GatewayConnectionSnapshot so the setup
  engine can pass requestId to WSL CLI for explicit device approval
- ConnectionManagerOperatorConnector: implements IGatewayOperatorConnector
  via the manager (registry record setup, ConnectAsync, state change wait)
- Suppress node auto-connect during setup via _suppressNodeDuringSetup flag
  (cleared when engine completes/fails/cancels)
- Fix state machine: allow WebSocketConnected and HandshakeSucceeded from
  Error state to handle WebSocketClientBase's built-in auto-reconnect
- Setup engine factory accepts optional operatorConnectorOverride parameter

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from 5-model adversarial code review:

- SSH tunnel lifecycle: don't stop on disconnect, only on gateway switch
  or dispose. Add browser proxy forward to SshTunnelConfig.
- Suppress flag safety: try/catch on engine construction, stale engine
  guard prevents wrong engine from clearing flag.
- Dispose: unsubscribe node events before semaphore disposal.
- Token redaction: mask auth payloads, sigToken previews, and signed
  payloads in diagnostics logs.
- Connector: move StateChanged subscription after fast-path checks,
  disconnect manager on timeout.
- State machine: wrap SetNodeEnabled in semaphore, clear stale
  OperatorPairingRequestId on transition.
- UI: re-enable approve/reject buttons on false return, move auto-approve
  guard after confirmed success.
- DisposeActiveClient: sync-wait for node disconnect (2s timeout).
- Remove diagnostics window auto-open on startup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ranjeshj ranjeshj force-pushed the user/ranjeshj/connection2 branch from a159d80 to 789ac76 Compare May 10, 2026 18:51
Remove 6 working/aspirational documents (moved to external copilotdocs):
- connection-architecture-northstar.md
- connection-implementation-audit.md
- gateway-node-integration.md
- wsl-owner-open-issues.md
- wsl-owner-validation.md
- WINDOWS_NODE_TESTING.md

Replace with one concise document describing the current architecture
as implemented: components, state machine, credential resolution, setup
flow integration, and what remains in App.xaml.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the Windows Tray connection layer toward a manager-owned lifecycle centered on GatewayConnectionManager, introducing a gateway registry + credential resolution pipeline, expanded diagnostics/UX surfaces, and broad test coverage to support the new architecture.

Changes:

  • Introduces GatewayConnectionManager + supporting connection state machine/diagnostics abstractions, and updates UI surfaces to consume the new manager/client interfaces.
  • Moves gateway credentials out of SettingsData/settings.json into a new persisted GatewayRegistry model and updates onboarding/setup flows accordingly.
  • Adds substantial unit/integration test coverage for retry policy, registry migration, pairing/state transitions, and stale-event generation guards.
Show a summary per file
File Description
tests/OpenClaw.Tray.Tests/StartupSetupStateTests.cs Removes legacy setup gating tests tied to settings-stored tokens.
tests/OpenClaw.Tray.Tests/SetupWarningPageGuardPolicyTests.cs Updates existing-config guard tests to use gateway URL instead of token.
tests/OpenClaw.Tray.Tests/SettingsRoundTripTests.cs Updates settings serialization tests to reflect removal of token fields from SettingsData.
tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj Links new connection/registry/manager sources into the Tray test project.
tests/OpenClaw.Tray.Tests/OnboardingExistingConfigGuardTests.cs Removes legacy tests that treated settings token/bootstrap as “existing config”.
tests/OpenClaw.Tray.Tests/LocalSetupProgressGuardTests.cs Updates local-setup guards to treat non-default gateway URL as “existing config”.
tests/OpenClaw.Tray.Tests/LocalGatewaySetupTests.cs Adjusts setup engine construction tests to match registry-based credential flow.
tests/OpenClaw.Tray.Tests/Connection/StaleEventGuardTests.cs Adds tests for generation-based stale-event suppression in the manager.
tests/OpenClaw.Tray.Tests/Connection/SetupCodeFlowTests.cs Adds integration tests for setup-code decode → registry write → connect.
tests/OpenClaw.Tray.Tests/Connection/SettingsChangeImpactTests.cs Adds tests for reconnect impact classification based on settings diffs.
tests/OpenClaw.Tray.Tests/Connection/RetryPolicyTests.cs Adds tests for retry/backoff behavior.
tests/OpenClaw.Tray.Tests/Connection/PairingFlowTests.cs Adds tests for node pairing transitions driven by INodeConnector events.
tests/OpenClaw.Tray.Tests/Connection/NodeConnectorTests.cs Adds basic lifecycle/state tests for NodeConnector.
tests/OpenClaw.Tray.Tests/Connection/GatewayRegistryTests.cs Adds persistence + mutation tests for the new gateway registry.
tests/OpenClaw.Tray.Tests/Connection/GatewayRegistryMigrationTests.cs Adds tests for migrating legacy settings/identity into the registry format.
tests/OpenClaw.Tray.Tests/Connection/GatewayConnectionManagerTests.cs Adds baseline tests for connect/disconnect/switch/diagnostics surfaces.
tests/OpenClaw.Tray.Tests/Connection/CredentialResolverTests.cs Adds tests for canonical credential resolution order (device/shared/bootstrap).
tests/OpenClaw.Tray.Tests/Connection/ConnectionStateMachineTests.cs Adds tests for the operator/node state machine transitions.
tests/OpenClaw.Tray.Tests/Connection/ConnectionDiagnosticsTests.cs Adds tests for the diagnostics ring buffer behavior.
tests/OpenClaw.Shared.Tests/OpenClawGatewayClientTests.cs Updates shared client tests for new scopes/pairing/signature fallback behavior.
tests/OpenClaw.Shared.Tests/DeviceIdentityTests.cs Adds pragma for legacy/obsolete identity method coverage.
src/OpenClaw.Tray.WinUI/Windows/SetupWizardWindow.cs Stops reading/writing tokens from settings in legacy setup wizard paths.
src/OpenClaw.Tray.WinUI/Windows/HubWindow.xaml.cs Updates Hub to use IOperatorGatewayClient and wires diagnostics/window actions.
src/OpenClaw.Tray.WinUI/Windows/ConnectionStatusWindow.xaml Adds a new diagnostics/status window UI surface.
src/OpenClaw.Tray.WinUI/Services/StartupSetupState.cs Changes setup gating logic (now largely MCP/node-token based).
src/OpenClaw.Tray.WinUI/Services/SshTunnelManager.cs Adds an ISshTunnelManager wrapper around existing SshTunnelService.
src/OpenClaw.Tray.WinUI/Services/SettingsManager.cs Removes token persistence and adds ToSettingsData() snapshot generation.
src/OpenClaw.Tray.WinUI/Services/LocalGatewaySetup/LocalGatewaySetup.cs Integrates registry syncing + setup-state cleanup + manager connector overrides.
src/OpenClaw.Tray.WinUI/Services/DeepLinkHandler.cs Adds a deep-link action hook for opening the connection status surface.
src/OpenClaw.Tray.WinUI/Services/ConnectionManagerOperatorConnector.cs Routes setup-engine operator connect/reconnect through the manager + registry.
src/OpenClaw.Tray.WinUI/Services/Connection/SetupCodeResult.cs Introduces a result model for applying setup codes.
src/OpenClaw.Tray.WinUI/Services/Connection/SettingsChangeImpact.cs Adds reconnect-impact classification logic for settings changes.
src/OpenClaw.Tray.WinUI/Services/Connection/RoleConnectionState.cs Adds per-role connection state enum.
src/OpenClaw.Tray.WinUI/Services/Connection/RetryPolicy.cs Adds retry/backoff policy implementation.
src/OpenClaw.Tray.WinUI/Services/Connection/OverallConnectionState.cs Adds derived overall connection state enum.
src/OpenClaw.Tray.WinUI/Services/Connection/OperatorClientChangedEventArgs.cs Adds event args for operator client instance swaps.
src/OpenClaw.Tray.WinUI/Services/Connection/NodeConnector.cs Adds a lightweight node connector owning WindowsNodeClient lifecycle.
src/OpenClaw.Tray.WinUI/Services/Connection/NodeConnectionMode.cs Adds node connection mode enum.
src/OpenClaw.Tray.WinUI/Services/Connection/ISshTunnelManager.cs Adds SSH tunnel lifecycle abstraction used by the manager.
src/OpenClaw.Tray.WinUI/Services/Connection/INodeConnector.cs Adds a node-connection abstraction for manager-driven orchestration.
src/OpenClaw.Tray.WinUI/Services/Connection/IGatewayConnectionManager.cs Adds the public manager interface for UI + setup integration.
src/OpenClaw.Tray.WinUI/Services/Connection/IGatewayClientFactory.cs Adds factory + lifecycle abstraction for the underlying gateway client.
src/OpenClaw.Tray.WinUI/Services/Connection/ICredentialResolver.cs Adds credential resolver interface + GatewayCredential model.
src/OpenClaw.Tray.WinUI/Services/Connection/GatewayRegistry.cs Adds persisted registry for known gateways + migration support.
src/OpenClaw.Tray.WinUI/Services/Connection/GatewayRecord.cs Adds persisted per-gateway record model incl. tokens + SSH config.
src/OpenClaw.Tray.WinUI/Services/Connection/GatewayConnectionSnapshot.cs Adds immutable snapshot model for cross-thread connection state.
src/OpenClaw.Tray.WinUI/Services/Connection/GatewayClientFactory.cs Adds concrete factory/adaptor for creating gateway client lifecycles.
src/OpenClaw.Tray.WinUI/Services/Connection/DeviceIdentityStore.cs Adds identity store adapter for persisting received device tokens.
src/OpenClaw.Tray.WinUI/Services/Connection/CredentialResolver.cs Implements canonical credential selection order for operator/node roles.
src/OpenClaw.Tray.WinUI/Services/Connection/ConnectionTrigger.cs Adds trigger enum driving the state machine.
src/OpenClaw.Tray.WinUI/Services/Connection/ConnectionStateMachine.cs Adds core operator/node state transition logic and snapshot rebuilding.
src/OpenClaw.Tray.WinUI/Services/Connection/ConnectionErrorCategory.cs Adds error categorization for retry policy and diagnostics.
src/OpenClaw.Tray.WinUI/Services/Connection/ConnectionDiagnostics.cs Adds ring-buffer diagnostics timeline.
src/OpenClaw.Tray.WinUI/Services/Connection/ConnectionDiagnosticEvent.cs Adds diagnostics event model.
src/OpenClaw.Tray.WinUI/Pages/DebugPage.xaml.cs Adds action wiring to open the new connection status surface.
src/OpenClaw.Tray.WinUI/Pages/DebugPage.xaml Adds a “Connection Status Window” button.
src/OpenClaw.Tray.WinUI/Pages/ConnectionPage.xaml Refactors Connection page UX around manager state + pairing guidance + diagnostics.
src/OpenClaw.Tray.WinUI/Pages/ChatPage.xaml.cs Reads shared token from GatewayRegistry for chat URL generation.
src/OpenClaw.Tray.WinUI/Onboarding/Services/OnboardingState.cs Switches onboarding state to use IOperatorGatewayClient.
src/OpenClaw.Tray.WinUI/Onboarding/Services/OnboardingExistingConfigGuard.cs Stops treating settings token/bootstrap fields as config (tokens now in registry).
src/OpenClaw.Tray.WinUI/Onboarding/Pages/LocalSetupProgressPage.cs Uses connection manager reconnect path during local setup completion.
src/OpenClaw.Tray.WinUI/Onboarding/Pages/ConnectionPage.cs Routes onboarding “Test connection” through registry + manager connect flow.
src/OpenClaw.Tray.WinUI/Onboarding/OnboardingWindow.cs Uses registry token for chat overlay URL generation.
src/OpenClaw.Shared/WindowsNodeClient.cs Adds v2 signature fallback flag + richer handshake events + pairing requestId plumbing.
src/OpenClaw.Shared/SettingsData.cs Removes token/bootstrap fields from shared settings model.
src/OpenClaw.Shared/OpenClawGatewayClient.cs Implements IOperatorGatewayClient, adds handshake/pairing/token events, logs redacted auth details, and v2 fallback handling.
src/OpenClaw.Shared/Models.cs Extends PairingStatusEventArgs to include a pairing request id.
src/OpenClaw.Shared/IOperatorGatewayClient.cs Introduces operator-client facade interface with request methods/events.
src/OpenClaw.Shared/IFileSystem.cs Introduces filesystem abstraction for testability.
src/OpenClaw.Shared/IDeviceIdentityStore.cs Introduces device token persistence abstraction.
src/OpenClaw.Shared/IDeviceIdentityReader.cs Introduces device token read abstraction for credential resolution.
src/OpenClaw.Shared/IClock.cs Introduces clock abstraction for deterministic diagnostics tests.
src/OpenClaw.Shared/DeviceTokenReceivedEventArgs.cs Adds event args for device token receipt.
src/OpenClaw.Shared/DeviceIdentityFileReader.cs Adds default identity reader implementation.
src/OpenClaw.Shared/DeviceIdentity.cs Marks legacy signing helpers obsolete and clarifies usage.
src/OpenClaw.Cli/Program.cs Stops loading token from settings and relies on --token override.
docs/wsl-owner-open-issues.md Removes legacy WSL-owner Q&A doc.
docs/WINDOWS_NODE_TESTING.md Removes STT guidance sections.
docs/gateway-node-integration.md Removes stt.transcribe references from docs and examples.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

src/OpenClaw.Tray.WinUI/Services/StartupSetupState.cs:28

  • RequiresSetup currently returns true for the default case where EnableNodeMode is false and EnableMcpServer is false (which is the default in SettingsManager). That would force onboarding on every startup for normal operator-only usage. Consider basing this on whether an operator credential exists (stored operator device token and/or an active gateway record with a usable token) rather than node token + MCP flag.
  • Files reviewed: 84/84 changed files
  • Comments generated: 11

Comment on lines +106 to +124
public void Save()
{
RegistryData data;
lock (_lock)
{
data = new RegistryData { Gateways = _records.ToList(), ActiveId = _activeId };
}
var json = JsonSerializer.Serialize(data, s_jsonOptions);

var dir = Path.GetDirectoryName(_filePath);
if (dir != null && !_fs.DirectoryExists(dir))
_fs.CreateDirectory(dir);

// Atomic write: temp file then rename
var tempPath = _filePath + ".tmp";
_fs.WriteAllText(tempPath, json);
// On Windows, File.Move with overwrite works as atomic rename
File.Move(tempPath, _filePath, overwrite: true);
}
Comment on lines +176 to +191
if (string.IsNullOrWhiteSpace(gatewayUrl))
return false;

// Idempotent: don't duplicate if already migrated
if (FindByUrl(gatewayUrl) != null)
{
logger?.Info($"[Registry] Migration skipped — record already exists for {gatewayUrl}");
return false;
}

var id = Guid.NewGuid().ToString();
var record = new GatewayRecord
{
Id = id,
Url = gatewayUrl,
SharedGatewayToken = string.IsNullOrWhiteSpace(bootstrapToken) ? token : null,
Comment on lines +214 to +222
/// <summary>
/// Creates a snapshot of current settings as an immutable SettingsData record.
/// Used for settings change classification (no DPAPI protection applied).
/// </summary>
public SettingsData ToSettingsData() => new()
{
GatewayUrl = GatewayUrl,
// Token and BootstrapToken are no longer written — GatewayRegistry is the source of truth
UseSshTunnel = UseSshTunnel,
Comment on lines +1331 to +1341
// Sync credentials to GatewayRegistry (source of truth for connection architecture)
if (_registry != null && !string.IsNullOrWhiteSpace(GatewayUrl))
{
var existing = _registry.FindByUrl(GatewayUrl);
var recordId = existing?.Id ?? System.Guid.NewGuid().ToString();
var record = new OpenClawTray.Services.Connection.GatewayRecord
{
Id = recordId,
Url = GatewayUrl,
SharedGatewayToken = !string.IsNullOrWhiteSpace(Token) ? Token : existing?.SharedGatewayToken,
BootstrapToken = !string.IsNullOrWhiteSpace(BootstrapToken) ? BootstrapToken : existing?.BootstrapToken,
Comment on lines +131 to +138
public async Task<GatewayOperatorConnectionResult> ConnectWithStoredDeviceTokenAsync(
string gatewayUrl, CancellationToken cancellationToken = default)
{
_logger.Info("[SetupConnector] Reconnecting with stored device token via manager");

// Reconnect — the credential resolver will pick up the stored device token
await _manager.ReconnectAsync();

Comment on lines +172 to 174
var token = options.TokenOverride ?? string.Empty;

return (gatewayUrl ?? string.Empty, token ?? string.Empty, loaded);
Comment on lines +114 to 117
<Button Content="🔌 Connection Status Window" Click="OnOpenConnectionStatus" HorizontalAlignment="Stretch"/>
<Button x:Uid="DebugPage_Button_114" Content="📋 Copy Support Context" Click="OnCopySupportContext"
Style="{ThemeResource AccentButtonStyle}" HorizontalAlignment="Stretch"/>
<Button Content="🔄 Relaunch First-Run Setup" Click="OnRelaunchOnboarding" HorizontalAlignment="Stretch"/>
Comment on lines +65 to +78
<!-- Operator flow -->
<StackPanel Grid.Row="0" Margin="0,0,0,4">
<StackPanel Orientation="Horizontal" Spacing="4" Margin="0,0,0,2">
<TextBlock Text="OPERATOR" FontSize="9" FontWeight="Bold" Foreground="{ThemeResource TextFillColorSecondaryBrush}" CharacterSpacing="80" VerticalAlignment="Center" Width="70"/>
<Border x:Name="CpOpOff" CornerRadius="3" Padding="6,2"><TextBlock Text="Off" FontSize="10"/></Border>
<TextBlock Text="→" VerticalAlignment="Center" FontSize="9" Foreground="{ThemeResource TextFillColorSecondaryBrush}"/>
<Border x:Name="CpOpConnecting" CornerRadius="3" Padding="6,2"><TextBlock Text="Connecting" FontSize="10"/></Border>
<TextBlock Text="→" VerticalAlignment="Center" FontSize="9" Foreground="{ThemeResource TextFillColorSecondaryBrush}"/>
<Border x:Name="CpOpConnected" CornerRadius="3" Padding="6,2"><TextBlock Text="Connected" FontSize="10"/></Border>
</StackPanel>
<StackPanel Orientation="Horizontal" Spacing="4" Margin="74,0,0,0">
<Border x:Name="CpOpPairing" CornerRadius="3" Padding="6,2"><TextBlock Text="Pairing" FontSize="10"/></Border>
<Border x:Name="CpOpError" CornerRadius="3" Padding="6,2"><TextBlock Text="Error" FontSize="10"/></Border>
</StackPanel>
Comment on lines 14 to +17
return false;
}

return !string.IsNullOrWhiteSpace(settings.Token) ||
!string.IsNullOrWhiteSpace(settings.BootstrapToken) ||
HasStoredNodeDeviceToken(dataPath);
return HasStoredNodeDeviceToken(dataPath);
Comment on lines +33 to +38
/// <summary>
/// Identity directory name, deterministically derived from Id.
/// GUIDs are path-safe and guarantee uniqueness even if URLs change.
/// </summary>
public string IdentityDirName => Id;
}
shanselman and others added 2 commits May 10, 2026 13:19
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
shanselman added a commit that referenced this pull request May 10, 2026
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@shanselman shanselman merged commit 49bc647 into openclaw:master May 10, 2026
8 checks passed
@ranjeshj ranjeshj deleted the user/ranjeshj/connection2 branch May 12, 2026 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants