Skip to content

deploy: replace fixed delays with event-driven waits for SSH auth and docker#2887

Merged
berendt merged 2 commits into
mainfrom
rl_fix_ssh_race
May 8, 2026
Merged

deploy: replace fixed delays with event-driven waits for SSH auth and docker#2887
berendt merged 2 commits into
mainfrom
rl_fix_ssh_race

Conversation

@ideaship
Copy link
Copy Markdown
Contributor

@ideaship ideaship commented May 6, 2026

Bootstrap used two fixed 60-second delays as timing guards: one after detecting port 22 open, and one after the manager reboot. Fixed sleeps are neither reliable nor efficient — they can still be beaten by a slow VM and always add latency to a fast one.

This replaces both delays with active polls:

  • SSH auth (wait for manager ssh auth): polls ssh -o BatchMode=yes … true with the Terraform key and image user until it succeeds. Also adds retry to ssh-keyscan to handle early sshd startup. The race this addresses was observed twice in testing against an OpenStack deployment (once at the keyscan step, once at the first Ansible connection), confirming the fixed delay was not sufficient protection.
  • Docker (wait for docker after manager reboot): polls docker info over SSH as the dragon user after the manager reboot.

The second guard in particular may be incomplete — the original delay could have been covering more than docker readiness, but the old code gives no indication of what else it was waiting for. A probabilistic race cannot be proven correct by any finite number of successful runs, but these guards target concrete, observable conditions rather than guessing at timing. Local testing confirms both conditions resolve promptly with no failures observed.

@ideaship ideaship requested a review from berendt May 6, 2026 15:09
@ideaship ideaship moved this from Ready to In review in Human Board May 6, 2026
ideaship added 2 commits May 7, 2026 06:56
Bootstrap previously used a fixed 60 second delay after detecting that
port 22 was open on the manager. That only proved sshd had started and
could still race with cloud-init writing the image user's authorized_keys.
The next ssh-keyscan or Ansible connection could then fail transiently.

Retry ssh-keyscan so host key collection tolerates early sshd startup, and
replace the fixed delay with an active public-key authentication probe that
uses the same Terraform key and image user as the following manager
playbook. This lets bootstrap continue as soon as authenticated SSH is
actually ready while still giving cloud-init a bounded window to finish.

AI-assisted: Codex/GPT
Signed-off-by: Roger Luethi <luethi@osism.tech>
The deploy playbook still used a fixed 60 second delay after rebooting
the manager. This wait has not been observed to fail in practice, but the
first bootstrap wait already showed that relying on a fixed timeout is
brittle when service startup timing varies.

Replace the delay with an active readiness check over the same SSH path
used by the following deploy commands. The probe logs in as dragon with
the Terraform key and runs docker info, so bootstrap continues only once
SSH authentication works and Docker is available to the operator user.

AI-assisted: Codex/GPT
Signed-off-by: Roger Luethi <luethi@osism.tech>
@ideaship ideaship force-pushed the rl_fix_ssh_race branch from 6425ac4 to dea5107 Compare May 7, 2026 04:58
@berendt berendt merged commit 2d16f8d into main May 8, 2026
2 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in Human Board May 8, 2026
@berendt berendt deleted the rl_fix_ssh_race branch May 8, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants