Fix distributed runner startup and scheduling safety#704
Conversation
Collection of fixes discovered while dogfooding distributed runners on Garden: - Runner join: use coordinator address from -c flag instead of server's bind address; rewrite loopback etcd endpoints with coordinator host (#1) - Runner start: find containerd in release directory, matching server behavior (#2) - Runner join: accept join code via --code flag or stdin pipe (#5) - Etcd TLS: include discovered IPs in server cert SANs; validate/reuse existing certs across restarts instead of regenerating every boot (#6) - Runner logging: add INFO-level logs for RPC connection, entity registration, session creation, network TLS, and status transitions (#7) - Runner start: initialize netdb subnet from flannel lease for sandbox controller (#8) - Runner start: lazy-init NetServ when not provided (#9) - Node entity: add ident to Join-created entities so setupEntity finds and updates the same entity; set RunnerId in setupEntity (#10) - New `miren runner status` command for local health inspection (#11) - Containerd readiness: dial socket instead of checking file existence to prevent connection-refused race (#12) - Scheduler: require both status=READY and non-empty ApiAddress before scheduling to a node (#13)
Distributed runners were reconciling disk mounts, volumes, and leases for all nodes globally, causing tight loops when they tried to attach disks belonging to the coordinator. This flooded etcd and locked up the server. Add node ID filtering to: - DiskLeaseController.reconcileLease: skip leases assigned to other nodes - DiskController.handleProvisioning: skip volumes owned by other nodes - DiskMountWatchController: skip mount events for other nodes - DiskVolumeWatchController: skip volume events for other nodes This is the foundation for eventually floating disks between nodes — each node only acts on its own resources.
Distributed runners couldn't pull images because cluster.local (the coordinator's OCI registry hostname) wasn't in their resolver. Map it to the coordinator's IP at runner startup so image pulls go to the right place over the flannel network.
Runner was advertising `:8444` (no IP) as its ApiAddress, so the coordinator couldn't proxy traffic to runner sandboxes. Discover the machine's outbound IP by doing a UDP dial to the coordinator address and use that as the listen/advertise address.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds a new Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
cli/commands/commands.go (1)
627-633: Consider adding--format jsonsupport for scripting use cases.The command registration follows the established pattern and is correctly feature-gated. As per coding guidelines, commands should support
--format jsonusing theFormatOptionspattern for machine-readable output. While this is a diagnostic command, JSON output would be useful for automation and monitoring scripts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/commands/commands.go` around lines 627 - 633, The runner status command should support machine-readable JSON output: update the command registration that dispatches "runner status" to include the common FormatOptions (e.g., add WithFormatOptions(FormatOptions{}) / WithOptions(FormatOptions) per project pattern) and modify the RunnerStatus handler to accept a FormatOptions (or options struct) parameter and emit JSON when the format option is "json" (serialize the status/configuration result to JSON and write to stdout) while preserving the existing human-readable output for the default case; reference the existing Dispatch("runner status")/Infer(..., RunnerStatus) registration and the RunnerStatus function to implement this change.cli/commands/runner_status.go (2)
14-17: Consider adding--format jsonsupport.As per coding guidelines, CLI commands should support
--format jsonusing theFormatOptionspattern withsnake_caseJSON struct tags. This would enable scripted health checks and monitoring integrations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/commands/runner_status.go` around lines 14 - 17, The RunnerStatus command lacks standardized --format support; update the RunnerStatus function signature to accept the existing FormatOptions struct (or add one if missing) and wire it into output generation so it supports --format json; when emitting JSON ensure the output struct uses snake_case JSON tags (e.g., for any status struct returned by RunnerStatus) and use the FormatOptions to switch between human-readable and JSON rendering so scripted consumers can pass --format json to receive snake_case JSON.
81-87: Sandbox count may include non-running sandboxes.Counting directory entries in the containerd task directory includes all tasks regardless of state (running, stopped, paused). This could overcount "running" sandboxes. Consider this a known limitation for this diagnostic command.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/commands/runner_status.go` around lines 81 - 87, The current logic in runner_status.go counts all directory entries in sandboxDir (constructed as filepath.Join(opts.DataPath, "containerd", "io.containerd.runtime.v2.task", "miren")) and prints that as running, which can include non-running tasks; replace the simple os.ReadDir count with iterating the entries returned by os.ReadDir(sandboxDir) and for each entry determine whether the task is actually running (e.g., inspect a per-task state/status file in the task directory or query containerd task state via the containerd client) and only increment the running counter for those whose state is "running"; then use ctx.Printf to print the running count and preserve the existing error handling when os.ReadDir fails.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cli/commands/runner_start.go`:
- Around line 269-279: The discoverOutboundIP function assumes an IPv4 local
address and will panic if addr.IP.To4() returns nil for IPv6; update
discoverOutboundIP to handle both IPv4 and IPv6 by checking addr.IP.To4() for
nil and returning a clear error or using addr.IP.To16() when appropriate (or
explicitly document and enforce IPv4-only by using "udp4" and returning a
descriptive error when To4() is nil). Locate the function discoverOutboundIP and
add a nil check on addr.IP.To4() (and fall back to To16() or return an error) so
the function no longer panics on IPv6 local addresses.
In `@cli/commands/runner_status.go`:
- Around line 38-45: The current containerd readiness check only inspects the
socket file via socketPath (built with filepath.Join) and can report false
positives; replace the os.Stat-based check in the containerd block with an
actual Unix socket dial (e.g., net.Dial or net.DialTimeout to socketPath) to
verify the daemon is responsive, treat a successful connection as "running" and
failures as "not running", and ensure the connection is closed after the check;
update the code around socketPath and the Containerd output (the ctx.Printf
calls) accordingly.
- Around line 27-36: The code prints the coordinator address and then prints a
second "Coordinator:" line for reachability, causing duplicate labels; update
the logic around cfg.CoordinatorAddress and the net.DialTimeout check so you
emit a single consolidated line that includes both the address and status (e.g.,
"Coordinator: <address> (reachable)" or "Coordinator: <address> (unreachable:
<err>)"). Locate the initial ctx.Printf that prints cfg.CoordinatorAddress and
the reachability block that uses net.DialTimeout and ctx.Printf; remove the
standalone address print and instead print one combined ctx.Printf after the
DialTimeout result, including the error text when unreachable.
In `@components/coordinate/coordinate.go`:
- Around line 212-214: The code currently ignores errors from os.ReadFile for
clientCertFile and clientKeyFile which can produce an EtcdTLSConfig with empty
CertPEM/KeyPEM; update the block that reads clientPEM and clientKey to check and
handle errors from os.ReadFile for both clientCertFile and clientKeyFile (e.g.,
return the read error or trigger the certificate regeneration path instead of
proceeding), and ensure any returned EtcdTLSConfig (or the caller) does not
receive empty CertPEM/KeyPEM silently; reference the clientCertFile,
clientKeyFile, clientPEM, clientKey, and EtcdTLSConfig symbols when making the
change.
---
Nitpick comments:
In `@cli/commands/commands.go`:
- Around line 627-633: The runner status command should support machine-readable
JSON output: update the command registration that dispatches "runner status" to
include the common FormatOptions (e.g., add WithFormatOptions(FormatOptions{}) /
WithOptions(FormatOptions) per project pattern) and modify the RunnerStatus
handler to accept a FormatOptions (or options struct) parameter and emit JSON
when the format option is "json" (serialize the status/configuration result to
JSON and write to stdout) while preserving the existing human-readable output
for the default case; reference the existing Dispatch("runner
status")/Infer(..., RunnerStatus) registration and the RunnerStatus function to
implement this change.
In `@cli/commands/runner_status.go`:
- Around line 14-17: The RunnerStatus command lacks standardized --format
support; update the RunnerStatus function signature to accept the existing
FormatOptions struct (or add one if missing) and wire it into output generation
so it supports --format json; when emitting JSON ensure the output struct uses
snake_case JSON tags (e.g., for any status struct returned by RunnerStatus) and
use the FormatOptions to switch between human-readable and JSON rendering so
scripted consumers can pass --format json to receive snake_case JSON.
- Around line 81-87: The current logic in runner_status.go counts all directory
entries in sandboxDir (constructed as filepath.Join(opts.DataPath, "containerd",
"io.containerd.runtime.v2.task", "miren")) and prints that as running, which can
include non-running tasks; replace the simple os.ReadDir count with iterating
the entries returned by os.ReadDir(sandboxDir) and for each entry determine
whether the task is actually running (e.g., inspect a per-task state/status file
in the task directory or query containerd task state via the containerd client)
and only increment the running counter for those whose state is "running"; then
use ctx.Printf to print the running count and preserve the existing error
handling when os.ReadDir fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 725bc24d-779c-47f6-92dc-f9aa531ab5ac
📒 Files selected for processing (16)
cli/commands/commands.gocli/commands/runner_join.gocli/commands/runner_start.gocli/commands/runner_status.gocli/commands/server.gocomponents/containerd/containerd.gocomponents/coordinate/coordinate.gocomponents/runner/runner.gocontrollers/disk/disk_controller.gocontrollers/disk/disk_lease_controller.gocontrollers/disk/disk_mount_watch_controller.gocontrollers/disk/disk_volume_watch_controller.gocontrollers/disk/sandbox_integration_test.gocontrollers/scheduler/scheduler.gocontrollers/scheduler/scheduler_test.goservers/runner/registration.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cli/commands/runner_start.go (1)
167-171: Handle host mapping failures explicitly for diagnosability.
SplitHostPortandhostMapper.SetHosterrors are currently ignored; when mapping fails,cluster.localresolution can silently degrade and is harder to debug.Suggested refactor
- coordinatorHost, _, _ := net.SplitHostPort(cfg.CoordinatorAddress) - if addr, err := netip.ParseAddr(coordinatorHost); err == nil { - hostMapper.SetHost("cluster.local", addr) - ctx.Log.Info("mapped cluster.local to coordinator", "addr", addr) - } + coordinatorHost, _, err := net.SplitHostPort(cfg.CoordinatorAddress) + if err != nil { + ctx.Log.Warn("could not parse coordinator host for cluster.local mapping", + "coordinator_address", cfg.CoordinatorAddress, "error", err) + } else if addr, err := netip.ParseAddr(coordinatorHost); err == nil { + if err := hostMapper.SetHost("cluster.local", addr); err != nil { + ctx.Log.Warn("failed to map cluster.local to coordinator", "addr", addr, "error", err) + } else { + ctx.Log.Info("mapped cluster.local to coordinator", "addr", addr) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cli/commands/runner_start.go` around lines 167 - 171, The current mapping of "cluster.local" ignores errors from net.SplitHostPort and hostMapper.SetHost which hides failures; change the block that reads cfg.CoordinatorAddress (the net.SplitHostPort call) to check and handle its returned error, validate coordinatorHost via netip.ParseAddr as before, and call hostMapper.SetHost while checking its error result; on any failure log a clear ctx.Log.Error with the error and relevant context (cfg.CoordinatorAddress/coordinatorHost) and avoid silently proceeding so diagnostics show when mapping fails instead of only logging success via ctx.Log.Info.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cli/commands/runner_start.go`:
- Around line 167-171: The current mapping of "cluster.local" ignores errors
from net.SplitHostPort and hostMapper.SetHost which hides failures; change the
block that reads cfg.CoordinatorAddress (the net.SplitHostPort call) to check
and handle its returned error, validate coordinatorHost via netip.ParseAddr as
before, and call hostMapper.SetHost while checking its error result; on any
failure log a clear ctx.Log.Error with the error and relevant context
(cfg.CoordinatorAddress/coordinatorHost) and avoid silently proceeding so
diagnostics show when mapping fails instead of only logging success via
ctx.Log.Info.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4799823f-aea9-4981-9398-69b35808ee64
📒 Files selected for processing (3)
cli/commands/runner_start.gocli/commands/runner_status.gocomponents/coordinate/coordinate.go
✅ Files skipped from review due to trivial changes (1)
- cli/commands/runner_status.go
🚧 Files skipped from review as they are similar to previous changes (1)
- components/coordinate/coordinate.go
eee8231 to
1abe22b
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/coordinate/coordinate.go`:
- Around line 201-234: The reuse path currently only checks SANs/expiry and
client file readability in loadX509Cert/validateAPICertificate and then returns
an EtcdTLSSetupResult, which can leave a stale or mismatched CA or bad key
files; update the reuse branch in the function containing the
existing/validateAPICertificate logic to (1) load the CA via
ca.GetCACertificate() and verify both serverCertFile and clientCertFile chain to
that CA, (2) load and parse serverKeyFile and clientKeyFile and ensure each
private key matches its corresponding certificate public key, (3) ensure CAFile
exists and its contents match the CA returned by ca.GetCACertificate()
(refresh/write it if not), and only then return the EtcdTLSSetupResult with
CertsDir/CAFile populated; if any verification step fails, fallthrough to the
existing regenerate label so certificates are regenerated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cc4d176d-863e-4c32-952f-1763b81de900
📒 Files selected for processing (3)
cli/commands/runner_start.gocli/commands/runner_status.gocomponents/coordinate/coordinate.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cli/commands/runner_start.go
- cli/commands/runner_status.go
1abe22b to
c5e0663
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
components/runner/integration_test.go (1)
46-50: Assert the registered API address when usinglocalhost:0.This now depends on the runner replacing port
0with the actual bound port during registration. The test only checksREADY, so a regression that still advertiseslocalhost:0or:0would slip through while still satisfying the scheduler's non-emptyApiAddressgate. Please add an assertion that the node entity exposes a concrete API address before continuing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@components/runner/integration_test.go` around lines 46 - 50, The test currently creates RunnerConfig (Id "test-runner", ListenAddress "localhost:0") but only waits for the scheduler READY signal; add an assertion after registration that fetches the registered node entity (the node record created by the runner) and verifies its ApiAddress is a concrete address—i.e., non-empty and not "localhost:0" or ":0" and does not end with ":0"—so replace the single READY-only check with an extra check against the registered node's ApiAddress field to ensure the runner replaced port 0 with the actual bound port.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cli/commands/runner_start.go`:
- Around line 143-150: The code currently falls back to setting listenAddr =
":8444" on discoverOutboundIP(cfg.CoordinatorAddress) failure which produces a
non-empty ApiAddress and lets the scheduler accept an unusable node; instead,
when discoverOutboundIP returns an error in the block that computes listenAddr,
do NOT set listenAddr to ":"+port—log the error via ctx.Log.Warn/Error with the
error attached and return a startup error (or call os.Exit(1)) from the function
that invokes this code (the surrounding runner start routine) so the process
fails fast; update any callers or error handling around
listenAddr/discoverOutboundIP to propagate this failure rather than advertising
a bind-all address.
---
Nitpick comments:
In `@components/runner/integration_test.go`:
- Around line 46-50: The test currently creates RunnerConfig (Id "test-runner",
ListenAddress "localhost:0") but only waits for the scheduler READY signal; add
an assertion after registration that fetches the registered node entity (the
node record created by the runner) and verifies its ApiAddress is a concrete
address—i.e., non-empty and not "localhost:0" or ":0" and does not end with
":0"—so replace the single READY-only check with an extra check against the
registered node's ApiAddress field to ensure the runner replaced port 0 with the
actual bound port.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 488e8b3f-fe85-411d-893c-22627cda9cc8
📒 Files selected for processing (4)
cli/commands/runner_start.gocli/commands/runner_status.gocomponents/coordinate/coordinate.gocomponents/runner/integration_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cli/commands/runner_status.go
- discoverOutboundIP: guard against nil To4() instead of panicking - runner status: merge coordinator address and reachability into one line - etcd cert reuse: fail to regenerate if client cert/key files are unreadable
c5e0663 to
d0b881e
Compare
This is the big pile of fixes that got distributed runners from "joins but can't do anything" to "serving production traffic on Garden." Everything here was discovered during a day of dogfooding on the Garden cluster with two runner nodes.
The journey went roughly like this: runners couldn't connect to etcd (cert SANs were localhost-only), then couldn't find containerd (release dir not on PATH), then crashed on startup (missing subnet and network service init), then created duplicate node entities (no ident on join), then took down the entire cluster twice (once by getting scheduled work they couldn't handle, once by reconciling disk mounts for other nodes), then couldn't pull images (no registry resolution), and finally advertised
:8444instead of a real IP. Each fix unlocked the next failure.What's in here
The first commit is the bulk — issues 1 through 13 from the Linear ticket, squashed into one since they were developed iteratively. The remaining commits are the fixes that came after the first deploy:
FindReleasePath()pattern so runners find containerd in/var/lib/miren/release/--codeflag and stdin pipe so joins work in non-interactive shellsmiren runner statuscommand — new command for local health inspectioncluster.localto coordinator IP so runners can pull from the registryFollow-up issues split out to separate tickets: MIR-917 (runner remove), MIR-918 (version update on restart), MIR-919 (human-readable names), MIR-920 (API cert IP SANs).
Closes MIR-903