deploy: replace fixed delays with event-driven waits for SSH auth and docker by ideaship · Pull Request #2887 · osism/testbed

ideaship · 2026-05-06T15:04:24Z

Bootstrap used two fixed 60-second delays as timing guards: one after detecting port 22 open, and one after the manager reboot. Fixed sleeps are neither reliable nor efficient — they can still be beaten by a slow VM and always add latency to a fast one.

This replaces both delays with active polls:

SSH auth (wait for manager ssh auth): polls ssh -o BatchMode=yes … true with the Terraform key and image user until it succeeds. Also adds retry to ssh-keyscan to handle early sshd startup. The race this addresses was observed twice in testing against an OpenStack deployment (once at the keyscan step, once at the first Ansible connection), confirming the fixed delay was not sufficient protection.
Docker (wait for docker after manager reboot): polls docker info over SSH as the dragon user after the manager reboot.

The second guard in particular may be incomplete — the original delay could have been covering more than docker readiness, but the old code gives no indication of what else it was waiting for. A probabilistic race cannot be proven correct by any finite number of successful runs, but these guards target concrete, observable conditions rather than guessing at timing. Local testing confirms both conditions resolve promptly with no failures observed.

Bootstrap previously used a fixed 60 second delay after detecting that port 22 was open on the manager. That only proved sshd had started and could still race with cloud-init writing the image user's authorized_keys. The next ssh-keyscan or Ansible connection could then fail transiently. Retry ssh-keyscan so host key collection tolerates early sshd startup, and replace the fixed delay with an active public-key authentication probe that uses the same Terraform key and image user as the following manager playbook. This lets bootstrap continue as soon as authenticated SSH is actually ready while still giving cloud-init a bounded window to finish. AI-assisted: Codex/GPT Signed-off-by: Roger Luethi <luethi@osism.tech>

The deploy playbook still used a fixed 60 second delay after rebooting the manager. This wait has not been observed to fail in practice, but the first bootstrap wait already showed that relying on a fixed timeout is brittle when service startup timing varies. Replace the delay with an active readiness check over the same SSH path used by the following deploy commands. The probe logs in as dragon with the Terraform key and runs docker info, so bootstrap continues only once SSH authentication works and Docker is available to the operator user. AI-assisted: Codex/GPT Signed-off-by: Roger Luethi <luethi@osism.tech>

osism-agent added this to Human Board May 6, 2026

github-project-automation Bot moved this to Ready in Human Board May 6, 2026

ideaship requested a review from berendt May 6, 2026 15:09

ideaship moved this from Ready to In review in Human Board May 6, 2026

ideaship added 2 commits May 7, 2026 06:56

ideaship force-pushed the rl_fix_ssh_race branch from 6425ac4 to dea5107 Compare May 7, 2026 04:58

berendt merged commit 2d16f8d into main May 8, 2026
2 checks passed

github-project-automation Bot moved this from In review to Done in Human Board May 8, 2026

berendt deleted the rl_fix_ssh_race branch May 8, 2026 06:25

ideaship mentioned this pull request May 29, 2026

deploy: capture manager console log on ssh fail #2897

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deploy: replace fixed delays with event-driven waits for SSH auth and docker#2887

deploy: replace fixed delays with event-driven waits for SSH auth and docker#2887
berendt merged 2 commits into
mainfrom
rl_fix_ssh_race

ideaship commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ideaship commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants