deploy: replace fixed delays with event-driven waits for SSH auth and docker#2887
Merged
Conversation
Bootstrap previously used a fixed 60 second delay after detecting that port 22 was open on the manager. That only proved sshd had started and could still race with cloud-init writing the image user's authorized_keys. The next ssh-keyscan or Ansible connection could then fail transiently. Retry ssh-keyscan so host key collection tolerates early sshd startup, and replace the fixed delay with an active public-key authentication probe that uses the same Terraform key and image user as the following manager playbook. This lets bootstrap continue as soon as authenticated SSH is actually ready while still giving cloud-init a bounded window to finish. AI-assisted: Codex/GPT Signed-off-by: Roger Luethi <luethi@osism.tech>
The deploy playbook still used a fixed 60 second delay after rebooting the manager. This wait has not been observed to fail in practice, but the first bootstrap wait already showed that relying on a fixed timeout is brittle when service startup timing varies. Replace the delay with an active readiness check over the same SSH path used by the following deploy commands. The probe logs in as dragon with the Terraform key and runs docker info, so bootstrap continues only once SSH authentication works and Docker is available to the operator user. AI-assisted: Codex/GPT Signed-off-by: Roger Luethi <luethi@osism.tech>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bootstrap used two fixed 60-second delays as timing guards: one after detecting port 22 open, and one after the manager reboot. Fixed sleeps are neither reliable nor efficient — they can still be beaten by a slow VM and always add latency to a fast one.
This replaces both delays with active polls:
wait for manager ssh auth): pollsssh -o BatchMode=yes … truewith the Terraform key and image user until it succeeds. Also adds retry tossh-keyscanto handle early sshd startup. The race this addresses was observed twice in testing against an OpenStack deployment (once at the keyscan step, once at the first Ansible connection), confirming the fixed delay was not sufficient protection.wait for docker after manager reboot): pollsdocker infoover SSH as thedragonuser after the manager reboot.The second guard in particular may be incomplete — the original delay could have been covering more than docker readiness, but the old code gives no indication of what else it was waiting for. A probabilistic race cannot be proven correct by any finite number of successful runs, but these guards target concrete, observable conditions rather than guessing at timing. Local testing confirms both conditions resolve promptly with no failures observed.