Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow bootstrap re-apply for Fedora CoreOS GCP #687

Merged
merged 1 commit into from Mar 29, 2020

Conversation

dghubble
Copy link
Member

@dghubble dghubble commented Mar 29, 2020

  • Problem: Fedora CoreOS images are manually uploaded to GCP. When a cluster is created with a stale image, Zincati immediately checks for the latest stable image, fetches, and reboots. In practice, this can unfortunately occur exactly during the initial cluster bootstrap phase.
  • Recommended: Upload the latest Fedora CoreOS image regularly
  • Mitigation: Allow a failed bootstrap.service run (which won't touch the done ConditionalPathExists) to be re-run by running terraforma apply again. Add a known issue to CHANGES
  • Update docs to show the current Fedora CoreOS stable version to reduce likelihood users see this issue

Longer term ideas:

  • Ideal: Fedora CoreOS publishes a stable channel. Instances will always boot with the latest image in a channel. The problem disappears since it works the same way AWS does
  • Timer: Consider some timer-based approach to have zincati delay any system reboots for the first ~30 min of a machine's life. Possibly just configured on the controller node systemd: Activative via zincati.timer, not by default coreos/zincati#251
  • External coordination: For Container Linux, locksmith filled a similar role and was disabled to allow CLUO to coordinate reboots. By running atop Kubernetes, it was not possible for the reboot to occur before cluster bootstrap
  • Rely on agent: delay reboot if ongoing interactive sessions coreos/zincati#115 to delay the reboot since bootstrap involves an SSH session
  • Use path-based activation of zincati on controllers and set that path at the end of the bootstrap process

Related: coreos/fedora-coreos-tracker#239

* Problem: Fedora CoreOS images are manually uploaded to GCP. When a
cluster is created with a stale image, Zincati immediately checks
for the latest stable image, fetches, and reboots. In practice,
this can unfortunately occur exactly during the initial cluster
bootstrap phase.

* Recommended: Upload the latest Fedora CoreOS image regularly
* Mitigation: Allow a failed bootstrap.service run (which won't touch
the done ConditionalPathExists) to be re-run by running `terraforma apply`
again. Add a known issue to CHANGES
* Update docs to show the current Fedora CoreOS stable version to
reduce likelihood users see this issue

 Longer term ideas:

* Ideal: Fedora CoreOS publishes a stable channel. Instances will always
boot with the latest image in a channel. The problem disappears since
it works the same way AWS does
* Timer: Consider some timer-based approach to have zincati delay any
system reboots for the first ~30 min of a machine's life. Possibly just
configured on the controller node coreos/zincati#251
* External coordination: For Container Linux, locksmith filled a similar
role and was disabled to allow CLUO to coordinate reboots. By running
atop Kubernetes, it was not possible for the reboot to occur before
cluster bootstrap
* Rely on coreos/zincati#115 to delay the
reboot since bootstrap involves an SSH session
* Use path-based activation of zincati on controllers and set that
path at the end of the bootstrap process

Rel: coreos/fedora-coreos-tracker#239
@dghubble dghubble force-pushed the workaround-gcp-zincati-issue branch from 5ab00de to 70bdc9e Compare March 29, 2020 01:13
@dghubble dghubble merged commit 70bdc9e into master Mar 29, 2020
@dghubble dghubble deleted the workaround-gcp-zincati-issue branch March 29, 2020 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant