Skip to content

feat(logs): give up on missing connections#1483

Merged
tcely merged 1 commit into
mainfrom
tcely-hat-handler-connection-abort
May 28, 2026
Merged

feat(logs): give up on missing connections#1483
tcely merged 1 commit into
mainfrom
tcely-hat-handler-connection-abort

Conversation

@tcely
Copy link
Copy Markdown
Collaborator

@tcely tcely commented May 26, 2026

Implement an isolated, unit-based reconnection tracking engine inside the background logging handler thread. This structure natively handles both instant startup dropouts (never listening) and long-term service disappearances without tracking raw wall-clock timers.

Key architectural controls:

  • Encapsulates dynamic telemetry properties (budget, first-failure snapshot, timeout values, and bonus/cost configurations) directly inside a private _ReconnectionState dataclass instantiated inside the thread.
  • Implements a 2:1 recovery ratio (failures cost 5 units; successful handshakes reward 10 units plus 1 unit per item sent) capped strictly at 1,000 units.
  • Implements a narrow-band jitter range combined with a 1.0+ additive guard. This prevents rapid-fire sub-second retry loops on early drops while injecting fractional millisecond variance to break up concurrent thundering herds.

@tcely tcely added this to the v1.0 milestone May 26, 2026
@tcely tcely requested a review from meeb May 26, 2026 14:21
@tcely tcely self-assigned this May 26, 2026
@tcely
Copy link
Copy Markdown
Collaborator Author

tcely commented May 26, 2026

Scenario 1 (Nothing is listening):

May 26 13:26:57 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.2:10514/TCP         Is the endpoint service running??       Retries remaining: 5      Next attempt at: 2026-05-26T13:27:03+00:00
May 26 13:27:03 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.2:10514/TCP         Is the endpoint service running??       Retries remaining: 4      Next attempt at: 2026-05-26T13:27:12+00:00
May 26 13:27:12 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.2:10514/TCP         Is the endpoint service running??       Retries remaining: 3      Next attempt at: 2026-05-26T13:27:25+00:00
May 26 13:27:25 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.2:10514/TCP         Is the endpoint service running??       Retries remaining: 2      Next attempt at: 2026-05-26T13:27:43+00:00
May 26 13:27:43 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.2:10514/TCP         Is the endpoint service running??       Retries remaining: 1      Next attempt at: 2026-05-26T13:28:08+00:00
May 26 13:28:08 common.logging.syslog.hat: Thread shutdown complete. Purged 0 local retry items and 1 queued items to prevent application blocks.

Scenario 2 (Service disappeared for a bit):

May 26 13:53:07 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.1:6514/TCP  Is the endpoint service running??       Retries remaining: 5    Next attempt at: 2026-05-26T13:53:13+00:00
May 26 13:53:13 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.1:6514/TCP  Is the endpoint service running??       Retries remaining: 4    Next attempt at: 2026-05-26T13:53:23+00:00
May 26 13:53:23 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.1:6514/TCP  Is the endpoint service running??       Retries remaining: 3    Next attempt at: 2026-05-26T13:53:36+00:00
May 26 13:53:25 common.logging.syslog.hat: Flushing logging queue in close
May 26 13:53:25 common.logging.syslog.hat: Flushing logging queue in close
May 26 13:53:36 common.logging.syslog.hat: Persistent failures to connect to: 127.0.0.1:6514/TCP  Is the endpoint service running??       Retries remaining: 2    Next attempt at: 2026-05-26T13:53:54+00:00
May 26 13:53:54 common.logging.syslog.hat: Flushing logging queue in close        

@tcely tcely force-pushed the tcely-common-logging-restructure branch from f5a0b88 to 4dcee6f Compare May 27, 2026 09:39
Implement an isolated, unit-based reconnection tracking engine inside the background logging handler thread.
This structure natively handles both instant startup dropouts (never listening)
and long-term service disappearances without tracking raw wall-clock timers.

Key architectural controls:
- Encapsulates dynamic telemetry properties (budget, first-failure snapshot, timeout values, and bonus/cost configurations) directly inside a private _ReconnectionState dataclass instantiated inside the thread.
- Implements a 2:1 recovery ratio (failures cost 5 units; successful handshakes reward 10 units plus 1 unit per item sent) capped strictly at 1,000 units.
- Implements a narrow-band jitter range combined with a 1.0+ additive guard. This prevents rapid-fire sub-second retry loops on early drops while injecting fractional millisecond variance to break up concurrent thundering herds.
@tcely tcely force-pushed the tcely-hat-handler-connection-abort branch from fd86547 to 0c6503d Compare May 27, 2026 11:06
@github-project-automation github-project-automation Bot moved this to Ready in Status May 28, 2026
Base automatically changed from tcely-common-logging-restructure to main May 28, 2026 06:08
@tcely tcely marked this pull request as ready for review May 28, 2026 06:09
@tcely tcely changed the title feat(logging): give up on missing connections feat(logs): give up on missing connections May 28, 2026
@tcely tcely merged commit c1ed71d into main May 28, 2026
7 checks passed
@tcely tcely deleted the tcely-hat-handler-connection-abort branch May 28, 2026 06:10
@github-project-automation github-project-automation Bot moved this from Ready to Done in Status May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants