Allow bootstrap re-apply for Fedora CoreOS GCP #687

dghubble · 2020-03-29T01:08:09Z

Problem: Fedora CoreOS images are manually uploaded to GCP. When a cluster is created with a stale image, Zincati immediately checks for the latest stable image, fetches, and reboots. In practice, this can unfortunately occur exactly during the initial cluster bootstrap phase.
Recommended: Upload the latest Fedora CoreOS image regularly
Mitigation: Allow a failed bootstrap.service run (which won't touch the done ConditionalPathExists) to be re-run by running terraforma apply again. Add a known issue to CHANGES
Update docs to show the current Fedora CoreOS stable version to reduce likelihood users see this issue

Longer term ideas:

Ideal: Fedora CoreOS publishes a stable channel. Instances will always boot with the latest image in a channel. The problem disappears since it works the same way AWS does
Timer: Consider some timer-based approach to have zincati delay any system reboots for the first ~30 min of a machine's life. Possibly just configured on the controller node systemd: Activative via zincati.timer, not by default coreos/zincati#251
External coordination: For Container Linux, locksmith filled a similar role and was disabled to allow CLUO to coordinate reboots. By running atop Kubernetes, it was not possible for the reboot to occur before cluster bootstrap
Rely on agent: delay reboot if ongoing interactive sessions coreos/zincati#115 to delay the reboot since bootstrap involves an SSH session
Use path-based activation of zincati on controllers and set that path at the end of the bootstrap process

Related: coreos/fedora-coreos-tracker#239

* Problem: Fedora CoreOS images are manually uploaded to GCP. When a cluster is created with a stale image, Zincati immediately checks for the latest stable image, fetches, and reboots. In practice, this can unfortunately occur exactly during the initial cluster bootstrap phase. * Recommended: Upload the latest Fedora CoreOS image regularly * Mitigation: Allow a failed bootstrap.service run (which won't touch the done ConditionalPathExists) to be re-run by running `terraforma apply` again. Add a known issue to CHANGES * Update docs to show the current Fedora CoreOS stable version to reduce likelihood users see this issue Longer term ideas: * Ideal: Fedora CoreOS publishes a stable channel. Instances will always boot with the latest image in a channel. The problem disappears since it works the same way AWS does * Timer: Consider some timer-based approach to have zincati delay any system reboots for the first ~30 min of a machine's life. Possibly just configured on the controller node coreos/zincati#251 * External coordination: For Container Linux, locksmith filled a similar role and was disabled to allow CLUO to coordinate reboots. By running atop Kubernetes, it was not possible for the reboot to occur before cluster bootstrap * Rely on coreos/zincati#115 to delay the reboot since bootstrap involves an SSH session * Use path-based activation of zincati on controllers and set that path at the end of the bootstrap process Rel: coreos/fedora-coreos-tracker#239

dghubble force-pushed the workaround-gcp-zincati-issue branch from 5ab00de to 70bdc9e Compare March 29, 2020 01:13

dghubble mentioned this pull request Mar 29, 2020

Uploading to cloud platforms: GCP coreos/fedora-coreos-tracker#147

Closed

dghubble merged commit 70bdc9e into master Mar 29, 2020

dghubble deleted the workaround-gcp-zincati-issue branch March 29, 2020 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow bootstrap re-apply for Fedora CoreOS GCP #687

Allow bootstrap re-apply for Fedora CoreOS GCP #687

dghubble commented Mar 29, 2020 •

edited

Allow bootstrap re-apply for Fedora CoreOS GCP #687

Allow bootstrap re-apply for Fedora CoreOS GCP #687

Conversation

dghubble commented Mar 29, 2020 • edited

dghubble commented Mar 29, 2020 •

edited