Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Runner resilience part 1/N: startup resilience to server being down or going down #3087

Merged
merged 6 commits into from Mar 10, 2022

Conversation

mitchellh
Copy link
Contributor

This is the first body of work towards making runners more resilient to servers being down (and ultimately vice versa). This part focuses on the server going down or already being down during the runner startup process.

This PR makes the runner able to tolerate the following conditions:

  1. Server down prior to the runner ever starting - the runner blocks and attempts to reconnect
  2. Server goes down while waiting on adoption - the runner reconnects and restarts the adoption (which is idempotent)
  3. Server goes down after adoption but before the initial config stream - reconnect and re-establish initial config stream
  4. Server goes down after start (adoption and config established) - reconnect to the config stream in the background

Prior to this PR, all four scenarios above would crash or exit the runner.

All four scenarios are unit tested.

This PR does not improve runner resilience with regards to job execution. That is a future PR.

@mitchellh mitchellh requested review from evanphx and a team March 10, 2022 16:30
@github-actions github-actions bot added the core label Mar 10, 2022
Copy link
Member

@briancain briancain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the tests! 👨🏻‍🍳 💋

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants