Skip to content

feat(daemon): implement UpgradeMonitor goroutine for NetworkUpgradeExecute CRs (#519)#609

Merged
leninmehedy merged 11 commits into
00499-feat-solo-provisioner-daemon-corefrom
00519-implement-upgrade-monitor-goroutine
May 28, 2026
Merged

feat(daemon): implement UpgradeMonitor goroutine for NetworkUpgradeExecute CRs (#519)#609
leninmehedy merged 11 commits into
00499-feat-solo-provisioner-daemon-corefrom
00519-implement-upgrade-monitor-goroutine

Conversation

@leninmehedy
Copy link
Copy Markdown
Member

@leninmehedy leninmehedy commented May 22, 2026

Summary

  • Daemon now automatically detects when a network upgrade is ready and kicks off the execute workflow — no manual intervention needed
  • Survives transient network blips, proxy drops, and credential rotation without restarting the daemon process
  • Only one upgrade runs at a time — concurrent or duplicate triggers are safely rejected
  • A daemon crash mid-upgrade won't leave the system in an unknown state — panic recovery at every layer keeps the process alive and observable
  • Slow or unresponsive Kubernetes API calls can't hang the daemon indefinitely — client-side dial timeout bounds worst-case blocking
  • Configured via daemon.yaml (written by provisioner daemon install) — service file has no flags, stays identical on every node; adds node_id (required, used as nodeId in JSONL events) and upgrade_dir (optional, defaults to /opt/hgcapp/services-hedera/HapiApp2.0/data/upgrade/current)
  • Optional CLI flags (--node-id, --kubeconfig, --orbit, --upgrade-dir) override the corresponding daemon.yaml fields when set — useful for operator debugging and CI testing without a config file on disk; production deployments set no flags
  • Missing or invalid config fails fast at startup with a clear, actionable error rather than a cryptic runtime failure
  • handleExecute is a stub — full workflow and JSONL audit trail land in subsequent stories; design is captured in docs/claude/plans/eventlog-jsonl-upgrade-event-logger.md

Test plan

  • task vm:test:unit — 6 new unit tests covering trigger, dedup, busy-rejection, phase filtering, and auth-error detection
  • task test:coverage TEST_PATHS=./internal/daemon/... TEST_REGEX="."
  • Manual UAT: see docs/claude/reviews/00519-implement-upgrade-monitor-goroutine.md for full 9-step walkthrough

🤖 Generated with Claude Code

@leninmehedy leninmehedy requested a review from a team as a code owner May 22, 2026 07:10
@leninmehedy leninmehedy requested a review from crypto-pablo May 22, 2026 07:10
@swirlds-automation
Copy link
Copy Markdown

swirlds-automation commented May 22, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@leninmehedy leninmehedy requested review from brunodam and Copilot and removed request for crypto-pablo May 22, 2026 07:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the previously stubbed UpgradeMonitor for the solo-provisioner-daemon, adding a Kubernetes watch loop for NetworkUpgradeExecute CRs and wiring it into daemon startup with a fail-fast daemon.yaml configuration load.

Changes:

  • Implement UpgradeMonitor watch/reconnect loop with exponential backoff and auth-error kubeconfig refresh.
  • Add daemon.yaml parsing/validation at daemon startup and update daemon construction to be error-returning.
  • Add unit tests (fake dynamic client) and test-only exports/helpers.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/models/weaver_paths.go Adds DaemonConfigPath so daemon startup can locate daemon.yaml via standard paths.
internal/daemon/export_test.go Adds a test-only constructor to inject daemon sub-systems.
internal/daemon/errors.go Introduces typed daemon config error (ErrConfig).
internal/daemon/daemon.go Changes New to load daemon.yaml, build UpgradeMonitor, and return (*Daemon, error).
internal/daemon/config.go Implements DaemonConfig + LoadDaemonConfig with required-field validation.
internal/daemon/consensus/errors.go Adds consensus-scoped typed errors for K8s client/watch failures.
internal/daemon/consensus/upgrade_monitor.go Full UpgradeMonitor implementation (watch loop, backoff, auth rebuild, dedup, panic recovery).
internal/daemon/consensus/export_test.go Exposes isAuthError to white-box tests (test-only).
internal/daemon/consensus/upgrade_monitor_test.go Adds unit tests using dynamic/fake for basic watch-loop scenarios and auth-error detection.
cmd/daemon/main.go Updates daemon bootstrap to handle daemon.New returning an error.
docs/claude/reviews/00519-implement-upgrade-monitor-goroutine.md Adds UAT/review guide for the new monitor + config behavior.
docs/claude/plans/00519-implement-upgrade-monitor-goroutine.md Adds design/plan document for the UpgradeMonitor story.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/daemon/consensus/upgrade_monitor.go Outdated
Comment thread internal/daemon/consensus/upgrade_monitor.go Outdated
Comment thread internal/daemon/consensus/upgrade_monitor_test.go
Comment thread internal/daemon/consensus/upgrade_monitor_test.go Outdated
@leninmehedy leninmehedy marked this pull request as draft May 22, 2026 07:32
@leninmehedy leninmehedy force-pushed the 00519-implement-upgrade-monitor-goroutine branch from d9d32f5 to ca92284 Compare May 22, 2026 07:42
@leninmehedy leninmehedy requested a review from Copilot May 22, 2026 07:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Comment thread internal/daemon/consensus/upgrade_monitor.go Outdated
Comment thread internal/daemon/consensus/upgrade_monitor.go Outdated
Comment thread cmd/daemon/main.go Outdated
Comment thread internal/daemon/config.go
@leninmehedy leninmehedy force-pushed the 00519-implement-upgrade-monitor-goroutine branch 4 times, most recently from 21475fe to dd41ccb Compare May 22, 2026 11:22
@leninmehedy leninmehedy marked this pull request as ready for review May 22, 2026 23:47
@leninmehedy leninmehedy marked this pull request as draft May 22, 2026 23:54
@leninmehedy leninmehedy force-pushed the 00519-implement-upgrade-monitor-goroutine branch 8 times, most recently from 83a3895 to 981a868 Compare May 23, 2026 01:17
@leninmehedy leninmehedy marked this pull request as ready for review May 23, 2026 01:25
@leninmehedy leninmehedy force-pushed the 00499-feat-solo-provisioner-daemon-core branch from 083d40c to 1967568 Compare May 23, 2026 04:15
@leninmehedy leninmehedy force-pushed the 00519-implement-upgrade-monitor-goroutine branch from 1a0b2bc to 7952f26 Compare May 23, 2026 12:23
@leninmehedy leninmehedy self-assigned this May 23, 2026
Copy link
Copy Markdown
Contributor

@brunodam brunodam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brunodam
Copy link
Copy Markdown
Contributor

There's just a minor DCO issue that needs to be fixed.

@leninmehedy leninmehedy force-pushed the 00519-implement-upgrade-monitor-goroutine branch from 7952f26 to b8aa616 Compare May 27, 2026 14:56
leninmehedy and others added 10 commits May 28, 2026 00:57
…ecute CRs (#519)

Watches NetworkUpgradeExecute CRs for ReadyForProvisionerDaemon phase
transitions and triggers the execute-phase workflow (stub — full logic in
subsequent stories). Self-healing: all watch errors retry with exponential
backoff (2 s → 5 min); auth errors additionally rebuild the dynamic client
from kubeconfig on disk so the daemon recovers after RBAC is applied without
a manual restart.

- consensus/upgrade_monitor.go: UpgradeMonitor with Run/runWatch/handleEvent/
  handleExecute (stub); buildDynamicClient; isAuthError; operationId dedup
- daemon/config.go: LoadDaemonConfig reads daemon.yaml (kubeconfig + orbit);
  fails fast if missing or fields empty
- daemon/daemon.go: New() reads daemon.yaml and constructs UpgradeMonitor;
  Run() adds it to errgroup
- daemon/export_test.go: NewWithComponents test helper (test-only, not in
  production API)
- pkg/models/weaver_paths.go: DaemonConfigPath = $home/config/daemon.yaml
- upgrade_monitor_test.go: 4 unit tests via fake dynamic client

Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…pgrade/migrate workflows

Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…olicy for daemon and UC

Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…tured journald logs

Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…ovisioner daemon check consumer

Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…leExecute stub

Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…guide to daemon.yaml approach

- Add NodeID (required) and UpgradeDir (optional, default /opt/solo/weaver/upgrade) fields
  to DaemonConfig and LoadDaemonConfig validation
- Add DaemonConfig.upgradeDir() helper with default fallback
- Update daemon.yaml example in plan + review docs with all four fields
- Replace CLI flags table in implementation-guide.md with daemon.yaml field table;
  drop --poll-interval (UpgradeMonitor is watch-based, not poll-based); clean up
  ExecStart example to show no flags

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…ns table

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
…fields

Add --node-id, --kubeconfig, --orbit, --upgrade-dir flags to
solo-provisioner-daemon. Each flag is optional and overrides the
corresponding daemon.yaml field when set. The production service file
remains flag-free; flags are for operator debugging and CI integration
testing without requiring a daemon.yaml file on disk.

Implementation:
- Extract DaemonConfig.Validate() from LoadDaemonConfig so validation
  can be called after overrides are applied
- Add NewFromConfig(paths, cfg) constructor for callers that hold a
  pre-resolved config; New() wraps it for the file-only production path
- cmd/daemon/main.go: load daemon.yaml, apply flag overrides, re-validate,
  call NewFromConfig

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
@leninmehedy leninmehedy force-pushed the 00519-implement-upgrade-monitor-goroutine branch from b8aa616 to 326e201 Compare May 27, 2026 14:58
Signed-off-by: Lenin Mehedy <lenin.mehedy@hashgraph.com>
@leninmehedy leninmehedy force-pushed the 00519-implement-upgrade-monitor-goroutine branch from 4349009 to 4c801f2 Compare May 28, 2026 00:46
@leninmehedy leninmehedy merged commit db9fe60 into 00499-feat-solo-provisioner-daemon-core May 28, 2026
16 checks passed
@leninmehedy leninmehedy deleted the 00519-implement-upgrade-monitor-goroutine branch May 28, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants