Skip to content

fix(whatsapp): downgrade recovered watchdog disconnects#77026

Merged
mcaxtr merged 1 commit intoopenclaw:mainfrom
rubencu:codex/whatsapp-watchdog-logging
May 11, 2026
Merged

fix(whatsapp): downgrade recovered watchdog disconnects#77026
mcaxtr merged 1 commit intoopenclaw:mainfrom
rubencu:codex/whatsapp-watchdog-logging

Conversation

@rubencu
Copy link
Copy Markdown
Contributor

@rubencu rubencu commented May 4, 2026

Summary

Describe the problem and fix in 2–5 bullets:

If this PR fixes a plugin beta-release blocker, title it fix(<plugin-id>): beta blocker - <summary> and link the matching Beta blocker: <plugin-name> - <summary> issue labeled beta-blocker. Contributors cannot label PRs, so the title is the PR-side signal for maintainers and automation.

  • Problem: WhatsApp's own watchdog closes stale/degraded Web connections with status 499 to recover, but the reconnect path surfaced that watchdog recovery as a runtime error and left recent-reconnect status behind after the next healthy connect.
  • Why it matters: a recovered stale transport could look like a user-visible WhatsApp failure even when the gateway recovered and the account was linked/connected again.
  • What changed: the watchdog close now carries the shared internal WHATSAPP_WATCHDOG_TIMEOUT_ERROR marker, the retry log is warning-style for that watchdog recovery path, and watchdog recovery status is cleared only after the socket becomes healthy again.
  • What did NOT change (scope boundary): non-watchdog disconnects, logged-out/conflict handling, retry limits, reconnect backoff, auth repair, channel configuration, and public status payload shape are unchanged. CHANGELOG.md is intentionally untouched for this contributor PR.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes N/A
  • Related N/A
  • This PR fixes a bug or regression

Real behavior proof (required for external PRs)

External contributors must show after-fix evidence from a real OpenClaw setup. Unit tests, mocks, lint, typechecks, snapshots, and CI are supplemental only. Screenshots are encouraged even for CLI, console, text, or log changes; terminal screenshots and copied live output count. Be mindful of private information like IP addresses, API keys, phone numbers, non-public endpoints, or other private details when providing evidence.

  • Behavior or issue addressed: watchdog recovery from stale WhatsApp Web transport no longer appears as a user-facing runtime error or persistent unhealthy reconnect status after the account reconnects.
  • Real environment tested: local macOS OpenClaw checkout on PR commit fa403de6755e446f60544e5216152381fad1b4bd, real configured WhatsApp account with phone/account details redacted.
  • Exact steps or command run after this patch:
    • Start branch gateway on a throwaway loopback port and query health plus channels.status over gateway RPC.
    • Run the production WhatsApp monitorWebChannel gateway monitor from this checkout against the same real default WhatsApp auth state with shortened internal watchdog timing only (messageTimeoutMs=3000, watchdogCheckMs=250, transportTimeoutMs=60000) so the watchdog path can be observed without waiting for the production-length window.
  • Evidence after fix (screenshot, recording, terminal capture, console output, redacted runtime log, linked artifact, or copied live output):
[status] {"connected":false,"healthState":"starting","lastDisconnect":null,"reconnectAttempts":0,"running":true}
[status] {"connected":true,"healthState":"healthy","lastDisconnect":null,"reconnectAttempts":0,"running":true}
[status] {"connected":true,"healthState":"stale","lastDisconnect":null,"reconnectAttempts":0,"running":true}
[status] {"connected":false,"healthState":"reconnecting","lastDisconnect":{"status":499,"error":"status=499","loggedOut":false,"hasExpectedField":false},"reconnectAttempts":1,"running":true}
[runtime.log] WhatsApp Web watchdog is recovering a stale connection (status 499). Retry 1/2 in 500ms.
[status] {"connected":true,"healthState":"healthy","lastDisconnect":null,"reconnectAttempts":0,"running":true}
[proof] {"ok":true,"watchdogLog":true,"runtime499Error":false,"reconnectSnapshot":true,"healthyAfterReconnect":true,"timeoutReached":false}

Gateway RPC health/status on the same branch also reported the real account linked/connected/healthy after the patch:

{
  "health": {
    "ok": true,
    "channels": {
      "whatsapp": {
        "running": true,
        "configured": true,
        "healthState": "healthy",
        "reconnectAttempts": 0,
        "lastError": null
      }
    },
    "eventLoop": {
      "degraded": false,
      "reasons": []
    }
  },
  "channelsStatus": {
    "whatsappAccounts": [
      {
        "accountId": "default",
        "enabled": true,
        "configured": true,
        "linked": true,
        "running": true,
        "connected": true,
        "healthState": "healthy",
        "statusState": "linked",
        "reconnectAttempts": 0,
        "lastDisconnect": null,
        "lastError": null
      }
    ]
  }
}
  • Observed result after fix: the watchdog-forced status 499 reconnect logs through runtime.log as watchdog recovery from a stale transport, does not call runtime.error, emits the normal public reconnecting status without an expected field, and clears lastDisconnect/reconnectAttempts after the next healthy connection.
  • What was not tested: production-length natural watchdog timing; the live proof uses shortened internal monitor timing to force the same branch quickly.
  • Before evidence (optional but encouraged): prior live gateway logs showed repeated WhatsApp watchdog timeout (app-silent) - restarting connection followed by user-facing WhatsApp Web connection closed (status 499) reconnect errors for the watchdog's own close reason.

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause: the watchdog forced a close with the same generic retry path as real disconnects, so the monitor could not distinguish an intentional watchdog recovery of a stale transport from an unexpected WhatsApp Web close.
  • Missing detection / guardrail: existing tests covered reconnecting after watchdog close, but did not assert the user-facing log level or the status snapshot cleanup after the successful reconnect.
  • Contributing context (if known): status issue reporting keys off lastDisconnect and reconnectAttempts, so watchdog recoveries need to be cleared once the socket is healthy.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
    • extensions/whatsapp/src/auto-reply.web-auto-reply.connection-and-logging.e2e.test.ts
    • extensions/whatsapp/src/auto-reply/monitor-state.test.ts
    • extensions/whatsapp/src/status-issues.test.ts
  • Scenario the test should lock in: a watchdog-forced status 499 reconnect logs as a watchdog recovery warning, never calls runtime.error, emits only the normal reconnecting public status shape, and clears lastDisconnect/reconnectAttempts after the next healthy connection; ordinary recent disconnects still report status issues.
  • Why this is the smallest reliable guardrail: the e2e harness exercises the real Web auto-reply monitor/controller path with shortened watchdog timings, while the unit status-controller test locks the narrow cleanup branch.
  • Existing test that already covers this (if any): existing reconnect e2e coverage proved a retry happened, but not the user-facing log/status behavior.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

Watchdog recovery from stale WhatsApp Web transport now shows as a warning-style reconnect log instead of a runtime error. Real disconnects, logged-out states, session conflicts, and exhausted retry attempts still surface as errors.

Diagram (if applicable)

Before:
watchdog timeout -> forced close status 499 -> generic reconnect error -> stale recent-reconnect status

After:
watchdog timeout -> forced close status 499 + watchdog marker -> warning -> healthy reconnect clears watchdog recovery status

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local repo checkout, Node via repo scripts and bundled Node 24 runtime for the one-off live monitor proof
  • Model/provider: N/A
  • Integration/channel (if any): WhatsApp plugin, real configured account details redacted
  • Relevant config (redacted): WhatsApp enabled and linked; gateway run on loopback throwaway port with auth none

Steps

  1. Start the rebased branch gateway on a throwaway loopback port.
  2. Query health and channels.status through gateway RPC.
  3. Run the production WhatsApp monitor against the real linked auth state with shortened internal watchdog timing to force the watchdog reconnect path.
  4. Run the focused WhatsApp watchdog/status tests and changed gate.

Expected

  • The branch gateway starts and WhatsApp reports linked/connected/healthy with no retained disconnect/error.
  • A watchdog-forced reconnect logs a watchdog recovery warning without calling runtime.error.
  • Watchdog recovery status clears after the next healthy connection.

Actual

  • Gateway health: ok=true, WhatsApp running=true, configured=true, healthState=healthy, reconnectAttempts=0, lastError=null.
  • Gateway channels.status: WhatsApp account linked=true, running=true, connected=true, healthState=healthy, statusState=linked, reconnectAttempts=0, lastDisconnect=null, lastError=null.
  • Live watchdog proof: ok=true, watchdogLog=true, runtime499Error=false, reconnectSnapshot=true, healthyAfterReconnect=true.
  • Focused tests and pnpm check:changed passed.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Validation run:

pnpm test extensions/whatsapp/src/auto-reply/monitor-state.test.ts extensions/whatsapp/src/auto-reply.web-auto-reply.connection-and-logging.e2e.test.ts extensions/whatsapp/src/status-issues.test.ts

Test Files  1 passed (1) in vitest.e2e.config.ts
Tests       21 passed (21)
Test Files  2 passed (2) in vitest.extension-whatsapp.config.ts
Tests       13 passed (13)

pnpm check:changed
lanes=extensions, extensionTests
Found 0 warnings and 0 errors.
Import cycle check: 0 runtime value cycle(s).

git diff --check origin/main...HEAD
passed

codex review --base origin/main
No actionable correctness issues were found in the diff.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: rebased onto latest origin/main at the time of the rebase; no CHANGELOG.md diff remains; live foreground gateway starts from this branch; WhatsApp is linked/connected/healthy through gateway RPC; live watchdog reconnect path is forced against a real linked WhatsApp auth state; watchdog recovery branch has e2e assertions for log level, status cleanup, and no leaked expected field in the public status payload; changed gate passes.
  • Edge cases checked: non-watchdog retry path still calls runtime.error; terminal logged-out/conflict branches are untouched; ordinary recent disconnect status issue coverage still passes via status-issues.test.ts.
  • What you did not verify: production-length natural watchdog timing.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: watchdog recovery reconnects could hide a real disconnect if detection were too broad.
    • Mitigation: the downgrade only applies when the close reason is the exact watchdog marker produced by the watchdog force-close path; all other retry, terminal, logged-out, and conflict paths keep their previous behavior.

@openclaw-barnacle openclaw-barnacle Bot added channel: whatsapp-web Channel integration: whatsapp-web size: S labels May 4, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 4, 2026

Codex review: needs maintainer review before merge.

Summary
The PR downgrades watchdog-triggered WhatsApp Web reconnects from runtime errors to warning logs, clears watchdog recovery reconnect status after a healthy reconnect, and adds focused regression tests plus a changelog line.

Reproducibility: yes. Source inspection gives a high-confidence reproduction path: current main emits a watchdog status 499 close with watchdog-timeout, then routes retryable closes through runtime.error and leaves reconnect fields that status issue reporting can surface after reconnect.

Real behavior proof
Sufficient (live_output): The PR body includes copied live output from a real linked WhatsApp setup showing the after-fix watchdog recovery path and healthy gateway status.

Next step before merge
No repair job is indicated; the patch has no blocking review finding, but the current head still needs maintainer review and resolution of the latest failing CI check.

Security
Cleared: The diff only changes WhatsApp plugin logging/status logic, focused tests, and changelog text, with no dependency, workflow, permission, secret, or command execution surface changes.

Review details

Best possible solution:

Land this plugin-side fix after maintainer review and green required checks, keeping watchdog-specific recovery handling in the WhatsApp plugin and preserving generic disconnect behavior for real failures.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection gives a high-confidence reproduction path: current main emits a watchdog status 499 close with watchdog-timeout, then routes retryable closes through runtime.error and leaves reconnect fields that status issue reporting can surface after reconnect.

Is this the best way to solve the issue?

Yes. The PR keys off the exact internal watchdog marker, downgrades only that recovery log, and clears only watchdog recovery history after a healthy reconnect while preserving ordinary retry, logged-out, conflict, and exhausted-retry behavior.

Acceptance criteria:

  • pnpm test extensions/whatsapp/src/auto-reply/monitor-state.test.ts extensions/whatsapp/src/auto-reply.web-auto-reply.connection-and-logging.e2e.test.ts extensions/whatsapp/src/status-issues.test.ts
  • pnpm check:changed
  • Review latest checks-node-core logs for head 3381ac9 before merge.

What I checked:

  • Current main generic retry path: Retryable WhatsApp Web closes on current main call statusController.noteClose and then runtime.error, so a watchdog-forced status 499 reaches user-visible error output. (extensions/whatsapp/src/auto-reply/monitor.ts:569, d54bab4b887f)
  • Current main watchdog marker: The watchdog force-close path currently emits status 499 with error text watchdog-timeout, giving the PR an existing narrow marker for intentional recovery closes. (extensions/whatsapp/src/connection-controller.ts:641, d54bab4b887f)
  • Current main retained reconnect status: Status issue reporting treats a linked, connected account with reconnectAttempts > 0 and a recent lastDisconnect as recently reconnected, which explains the stale status symptom after recovery. (extensions/whatsapp/src/status-issues.ts:156, d54bab4b887f)
  • PR diff scope: The live PR diff for head 3381ac93b7e2b46c7a4766ddcbbd6f85a0f4e18a exports the watchdog marker, passes watchdogRecovery into the status controller, routes only that marker through runtime.log(warn(...)), preserves non-watchdog runtime.error, and adds regression coverage. (extensions/whatsapp/src/auto-reply/monitor.ts:568, 3381ac93b7e2)
  • Runtime warning helper contract: openclaw/plugin-sdk/runtime-env publicly exports warn, and warn is a terminal theme formatter, so using runtime.log(warn(message)) matches existing runtime output patterns. (src/plugin-sdk/runtime-env.ts:17, d54bab4b887f)
  • Real behavior proof: The PR body includes copied live output from a real linked WhatsApp setup showing watchdogLog=true, runtime499Error=false, a reconnecting snapshot, and healthy linked status after reconnect. (3381ac93b7e2)

Likely related people:

  • mcaxtr: Recent current-main WhatsApp monitor/status commits include removal of exposeErrorText handling and auth/status recovery work; live PR metadata also shows mcaxtr assigned and as the committer of the current PR head. (role: recent area contributor and reviewer; confidence: high; commits: 4cba08df01ea, aa76cf43f011, 3381ac93b7e2; files: extensions/whatsapp/src/auto-reply/monitor.ts, extensions/whatsapp/src/status-issues.ts, extensions/whatsapp/src/auto-reply/monitor-state.ts)
  • steipete: Current blame on the watchdog, retry, and status issue lines points to recent WhatsApp lifecycle cleanup, and nearby history includes monitor-state centralization and socket teardown fixes. (role: recent area contributor; confidence: high; commits: c8d52e36d5dd, 66743b84fac2, 69225003820a; files: extensions/whatsapp/src/auto-reply/monitor.ts, extensions/whatsapp/src/auto-reply/monitor-state.ts, extensions/whatsapp/src/connection-controller.ts)
  • vincentkoc: Recent merged WhatsApp commits addressed quiet-socket reconnect behavior and group inbound recovery after reconnect churn, both adjacent to this watchdog/reconnect path. (role: recent reconnect-path contributor; confidence: medium; commits: e672b61417af, 21a92ea0f636; files: extensions/whatsapp/src/auto-reply/monitor.ts, extensions/whatsapp/src/connection-controller.ts)
  • rubencu: Beyond opening this PR, rubencu appears in prior merged current-main WhatsApp monitor/error-text history for the affected area. (role: recent WhatsApp contributor; confidence: medium; commits: 652f34103a4d, 3381ac93b7e2; files: extensions/whatsapp/src/auto-reply/monitor.ts)

Remaining risk / open question:

  • Latest GitHub checks for the current head include a checks-node-core failure with only a generic annotation available from the check API, so CI needs normal maintainer follow-up before merge.
  • The live proof forces the same watchdog path with shortened internal timing rather than waiting for production-length natural watchdog timing.

Codex review notes: model gpt-5.5, reasoning high; reviewed against d54bab4b887f.

Re-review progress:

@rubencu rubencu force-pushed the codex/whatsapp-watchdog-logging branch from 03830f4 to 648dd28 Compare May 10, 2026 04:03
@openclaw-barnacle openclaw-barnacle Bot added the proof: supplied External PR includes structured after-fix real behavior proof. label May 10, 2026
@rubencu rubencu force-pushed the codex/whatsapp-watchdog-logging branch from 648dd28 to 19aab62 Compare May 10, 2026 04:44
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 10, 2026
@rubencu rubencu force-pushed the codex/whatsapp-watchdog-logging branch from 19aab62 to fa403de Compare May 10, 2026 05:31
@rubencu rubencu changed the title fix(whatsapp): quiet watchdog reconnects fix(whatsapp): downgrade recovered watchdog disconnects May 10, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 10, 2026
@rubencu rubencu force-pushed the codex/whatsapp-watchdog-logging branch from fa403de to e5bf4b2 Compare May 10, 2026 12:00
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 10, 2026
@mcaxtr mcaxtr self-assigned this May 11, 2026
@mcaxtr mcaxtr requested review from a team as code owners May 11, 2026 01:25
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation channel: discord Channel integration: discord channel: googlechat Channel integration: googlechat channel: imessage Channel integration: imessage channel: line Channel integration: line channel: matrix Channel integration: matrix channel: mattermost Channel integration: mattermost channel: msteams Channel integration: msteams channel: nextcloud-talk Channel integration: nextcloud-talk channel: nostr Channel integration: nostr channel: signal Channel integration: signal channel: slack Channel integration: slack channel: telegram Channel integration: telegram channel: tlon Channel integration: tlon channel: voice-call Channel integration: voice-call labels May 11, 2026
@mcaxtr
Copy link
Copy Markdown
Member

mcaxtr commented May 11, 2026

Merged via squash.

Thanks @rubencu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: whatsapp-web Channel integration: whatsapp-web proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants