Skip to content

fix(telegram): increase polling stall threshold from 90s to 300s#57737

Merged
steipete merged 1 commit intoopenclaw:mainfrom
Vitalcheffe:fix/telegram-polling-stall-threshold
Apr 21, 2026
Merged

fix(telegram): increase polling stall threshold from 90s to 300s#57737
steipete merged 1 commit intoopenclaw:mainfrom
Vitalcheffe:fix/telegram-polling-stall-threshold

Conversation

@Vitalcheffe
Copy link
Copy Markdown
Contributor

@Vitalcheffe Vitalcheffe commented Mar 30, 2026

Summary

The Telegram polling stall detector fires at 90 seconds of API inactivity, causing false gateway restarts during legitimate LLM message processing (fixes #57660).

Root cause

POLL_STALL_THRESHOLD_MS in extensions/telegram/src/polling-session.ts was hardcoded to 90 seconds. When the bot processes a message that requires a long LLM response (2-5 minutes), no Telegram API calls are made during that time. The watchdog interprets this silence as a polling stall and forces a gateway restart, which interrupts message generation and causes 3-7 minutes of delivery failures.

Fix

  • Increased the default stall threshold from 90s to 300s (5 minutes) to accommodate real-world LLM response times
  • Added an optional stallThresholdMs parameter to TelegramPollingSession so the threshold can be tuned without code changes in the future
  • Used the new threshold instance variable in the watchdog check instead of the hardcoded constant
  • Updated monitor.test.ts to match the new threshold values

Testing

  • extension-fast (telegram) ✅ — passes in CI
  • check / check-additional / build-smoke / build-artifacts ✅ — all pass
  • Updated all stall-detection tests in polling-session.test.ts and monitor.test.ts to use the new 300s threshold
  • All existing test scenarios remain logically valid with the new threshold

Note: build-dist and security jobs require maintainer secrets — this is expected for external contributions. The remaining checks-node-test-* and checks-windows-node-test-* failures are pre-existing on main (CI shard infrastructure issues, same failures on latest main branch run).

Changes

  • extensions/telegram/src/polling-session.ts: threshold constant, options type, class field, watchdog logic (13 insertions, 4 deletions)
  • extensions/telegram/src/polling-session.test.ts: updated mock timestamps (10 lines changed)
  • extensions/telegram/src/monitor.test.ts: updated timer advances (3 lines changed)

🤖 AI-assisted (OpenClaw agent).

@openclaw-barnacle openclaw-barnacle Bot added channel: telegram Channel integration: telegram size: XS labels Mar 30, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 30, 2026

Greptile Summary

This PR increases the Telegram polling stall watchdog threshold from 90 seconds to 5 minutes (300 seconds) to prevent false-positive gateway restarts during legitimate long-running LLM message generation. It also adds an optional stallThresholdMs parameter to TelegramPollingSession to make the threshold tunable without code changes.

The implementation is clean and correct:

  • POLL_STALL_THRESHOLD_MS is renamed to DEFAULT_POLL_STALL_THRESHOLD_MS and updated to 300_000
  • A new stallThresholdMs option flows through the constructor into a private class field #stallThresholdMs
  • The watchdog logic reads from this.#stallThresholdMs instead of the former constant (including the stall-diag suppression window at stallThreshold / 2)
  • All existing stall-detection tests are updated with the new mock timestamp (310_001) to remain consistent with the 300s threshold

The test suite remains thorough: stall fires when both getUpdates and API activity are stale past the threshold, suppresses when a recent completed API call or in-flight API call is within the threshold, and correctly handles multiple concurrent in-flight calls with different start times.

Confidence Score: 5/5

Safe to merge — targeted threshold increase with a clean configurability hook and fully passing test coverage.

All changes are straightforward, correctly implemented, and well-tested. The threshold increase directly addresses the described false-positive restart problem. No regressions are introduced, the API is additive (optional parameter with a sensible default), and every stall-detection test scenario remains logically valid after the timestamp updates.

No files require special attention.

Important Files Changed

Filename Overview
extensions/telegram/src/polling-session.ts Threshold constant renamed and increased to 300s, stallThresholdMs option and #stallThresholdMs field added, watchdog uses instance variable throughout — all correct.
extensions/telegram/src/polling-session.test.ts Mock timestamps updated from 120_001 to 310_001 to match the new 300s threshold; all test scenarios remain logically valid.

Reviews (1): Last reviewed commit: "fix(telegram): increase polling stall th..." | Re-trigger Greptile

@Vitalcheffe
Copy link
Copy Markdown
Contributor Author

CI Status

All checks related to this PR pass:

  • extension-fast (telegram) — pass
  • check — pass
  • check-additional — pass
  • build-smoke — pass
  • build-artifacts — pass
  • Greptile Review — 5/5 confidence

The remaining failures (checks-node-test-*, checks-windows-node-test-*, checks-fast-extensions) are pre-existing on main — the same jobs fail on the latest main branch CI run (23752022218). These are CI infrastructure issues (shard splitting errors) unrelated to this change.

This PR touches only extensions/telegram/src/polling-session.ts and its tests. No other code paths are affected.

@Vitalcheffe
Copy link
Copy Markdown
Contributor Author

@steipete Ready for review.

All telegram-related CI checks pass. The other failures are pre-existing on main (shard infra) or require internal secrets (build-dist, security).

What this fixes: Telegram bot gateway restarts mid-message when LLM takes >90s to respond. Users lose their responses and have to wait 3-7 minutes for recovery. Reported by multiple users in #57660.

The fix is 3 lines of logic — just raising the hardcoded 90s stall threshold to 300s and making it configurable for the future. Tests updated accordingly.

Happy to adjust the threshold value if you prefer a different default.

@Vitalcheffe
Copy link
Copy Markdown
Contributor Author

Update on CI failures:

All failing jobs are pre-existing on main (confirmed on run 23752022218):

Job Failure Also on main?
extension-fast (telegram) ✅ PASS
checks-fast-extensions memory-core qmd-manager test
checks-node-test-1/2/3/4 Shard infra error
checks-windows-node-test-2/3/4/5 Shard infra error

This PR touches only extensions/telegram/ — none of the failing tests are related to the change.

@steipete
Copy link
Copy Markdown
Contributor

Maintainer triage from the current Telegram stall reports: I would not merge this as-is.

Raising the single watchdog threshold from 90s to 300s helps #57660-style false positives during long model runs, but it also delays recovery for the active getUpdates wedge we are seeing in #69147 and #64288. Those logs show active getUpdates stuck for 285s, 527s, 900s, 1000s. With this PR, the 285s case would not restart yet, and all real socket wedges get a longer outage window.

Better shape: split the thresholds instead of one global bump:

  • keep active in-flight getUpdates stall detection short enough to recover wedged sockets quickly;
  • only lengthen the idle/no-completed-poll false-positive path, or use the fix: avoid false telegram polling stall restarts #64333 bookkeeping fix so completed getUpdates errors count as liveness.

Recommendation: do not merge until it distinguishes false quiet windows from actual active socket hangs.

@steipete
Copy link
Copy Markdown
Contributor

Triage note after the recent Telegram polling work landed on main: I am not closing this one yet.

Current main deliberately kept POLL_STALL_THRESHOLD_MS at 90s and instead made liveness tracking more precise around completed getUpdates calls and recent non-polling Telegram API activity. That should reduce false stall decisions without stretching recovery to 5 minutes, but this PR's exact long-running-handler scenario needs a current-main retest before we can call it resolved.

@steipete steipete force-pushed the fix/telegram-polling-stall-threshold branch from 9619432 to 3f95fc8 Compare April 21, 2026 00:02
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation size: S and removed size: XS labels Apr 21, 2026
@steipete steipete merged commit 8c05043 into openclaw:main Apr 21, 2026
10 of 11 checks passed
@steipete
Copy link
Copy Markdown
Contributor

Landed via squash merge onto main.

  • Gate: pnpm test extensions/telegram/src/polling-session.test.ts extensions/telegram/src/monitor.test.ts extensions/telegram/src/config-schema.test.ts extensions/telegram/src/polling-liveness.test.ts; pnpm format:check -- CHANGELOG.md docs/channels/telegram.md extensions/telegram/src/config-schema.test.ts extensions/telegram/src/config-ui-hints.ts extensions/telegram/src/monitor.test.ts extensions/telegram/src/monitor.ts extensions/telegram/src/polling-session.test.ts extensions/telegram/src/polling-session.ts src/config/types.telegram.ts src/config/zod-schema.providers-core.ts; pnpm check:changed; pnpm check; pnpm test
  • Source commit: 3f95fc8
  • Land commit: 8c05043

Adapted this to a middle-ground 120s default plus bounded channels.telegram.pollingStallThresholdMs / per-account override rather than the original 300s default.

Thanks @Vitalcheffe!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f95fc8c3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

streaming: ChannelPreviewStreamingConfigSchema.optional(),
mediaMaxMb: z.number().positive().optional(),
timeoutSeconds: z.number().int().positive().optional(),
pollingStallThresholdMs: z.number().int().min(30_000).max(600_000).optional(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Regenerate bundled schema metadata for pollingStallThresholdMs

Adding pollingStallThresholdMs to the Telegram Zod schema here without regenerating bundled config metadata leaves the static channel schema out of sync (the Telegram entry in src/config/bundled-channel-config-metadata.generated.ts still has timeoutSeconds but no pollingStallThresholdMs). That drift means consumers that depend on generated channel metadata for schema/hint surfaces can miss the new option, so config discovery and validation messaging become inconsistent.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: telegram Channel integration: telegram docs Improvements or additions to documentation size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Telegram polling stall detector fires too aggressively (110s), causes message delivery failures

2 participants