Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a repo option to auto-prune other deployments (e.g. rollback) when starting upgrade #2670

Closed
cgwalters opened this issue Jul 8, 2022 · 23 comments · Fixed by #2847
Closed
Assignees
Labels

Comments

@cgwalters
Copy link
Member

In RHCOS we're running up against space constraints https://bugzilla.redhat.com/show_bug.cgi?id=2104619

I think we should support something like

[sysroot]
deployments-max=2

This would tell ostree to auto-prune the rollback deployment (and others) when starting an upgrade.

@travier
Copy link
Member

travier commented Jul 8, 2022

This would also help with an often requested feature where some folks want to keep more than 2 deployments by default.

@jmarrero jmarrero self-assigned this Jul 8, 2022
@dbnicholson
Copy link
Member

Isn't that already the logic for sysroot cleanup, or are you basically describing #2510? Or do you mean you don't want the temporary ballooning to 3 deployments prior to rebooting into the new deployment? It would be nice to have a config option for the number of deployments to keep, though.

@jmarrero
Copy link
Member

jmarrero commented Jul 8, 2022

I think the intent is to allow the user to set the number of deployments to keep. So if it's set to 1, you still would temporary balloon to 2 prior rebooting.

@cgwalters
Copy link
Member Author

Or do you mean you don't want the temporary ballooning to 3 deployments prior to rebooting into the new deployment?

Yep, this.

@travier
Copy link
Member

travier commented Jul 8, 2022

If I read this correctly, the number could never be less than 2.

@jmarrero
Copy link
Member

jmarrero commented Jul 8, 2022

Ahh, I guess that makes sense. But that means the minimum number is 2 right?

@dbnicholson
Copy link
Member

Bikeshedding, but I think I'd expect the maximum deployments to be the non-ballooned value. The ballooned number of deployments is a temporary implementation detail. As a user it would bug me that I said deployments-max=2 but there was only 1 deployment under normal circumstances.

Also, when you're in the ballooned situation, one of the deployments is shown as staged. I think it would be reasonable to interpret deployments-max as the maximum number of non-staged deployments on disk.

In other words, I think the deployments-max minimum can be 1 and that means you'll normally have 1 deployment on disk and temporarily 2 during an upgrade.

@travier
Copy link
Member

travier commented Jul 8, 2022

I think the idea is to have 2 mean that we keep 2 deployments on the disk, and when performing an upgrade, just before deploying the new one, we remove the old one to keep ourselves under 2 deployments for disk usage.

@dbnicholson
Copy link
Member

Right, I thought of that while after I walked away.

They're actually 2 orthogonal concepts to me. Specifying the maximum number of deployments allows you to say you want no rollback deployment or more than 1 rollback deployment. Saying that you want to delete a rollback deployment before upgrading so that the number of deployments is strictly capped is slightly different.

For example, does deployments-max=2 mean that you should pre-delete the rollback deployment or that you want to allow ballooning to 2 deployments and only have 1 under normal circumstances.

So, I really think there are 2 knobs you want:

  • deployments-max - The number of active + rollback deployments to keep under normal circumstances.
  • pre-delete-rollback (can't think of a better name at the moment) - Whether to delete a rollback deployment prior to pulling a new deployment.

I'd say for the RHCOS bug you want the second knob. I.e., don't worry about changing the number of deployments right now, but allow systems to opt in to aggressively pruning the rollback deployment before upgrading to keep disk space constrained.

@cgwalters
Copy link
Member Author

Yeah fair. I think they're strongly related, but yes, viewing them orthogonally makes sense too.

Perhaps we start with just prune-rollback-on-upgrade=true.

But...there are corner cases here, specifically: what happens if there are more than 2 deployments (do we only remove 1?) - this is really the grey area between deployments-max and prune-rollback-on-upgrade=true.

(Also, another corner cases is "what happens if the rollback is pinned?" but I think we should probably silently have the pin win)

@dbnicholson
Copy link
Member

Good points. I think if there are pinned deployments, they should just be ignored for the purposes of pruning deployments. For more than 2 non-pinned deployments, I think they should all be removed. Basically, the same thing ostree_sysroot_cleanup would do after finalizing a new deployment. But I don't have all the logic in front of me, so consider that pretty handwavey.

@jlebon
Copy link
Member

jlebon commented Jul 8, 2022

One risk with this approach worth highlighting is that a regression in the upgrade path where something fails after the rollback cleanup could result in the host with no means of going back to a deployment with working upgrade code. (E.g. failing to merge /etc, or copy binaries into /boot, or update the bootloader.)

@jmarrero
Copy link
Member

jmarrero commented Jul 8, 2022

How so? If the current deployment works. You got the one you are executing the upgrade from "safe" from deleting.

@travier
Copy link
Member

travier commented Jul 11, 2022

If you update from A into a broken code base B that is not capable of doing updates past the cleanup stage, then once you remove A to make room for C with the B code, and then the update fails, you are stuck with B with no rollback option.

@jlebon
Copy link
Member

jlebon commented Jul 11, 2022

Right. This is why we have upgrade tests. Container Linux was especially susceptible to this with its A/B partition update scheme. If an update bug happened before the secondary partition was nuked, you could just rollback. But if it happened after, you'd be stuck (see e.g. coreos/bugs#2457 (comment)).

libostree is better in this regard by only cleaning up the rollback after most fallible operations were done. This would re-introduce some of that fallibility. The tradeoff might be worth it though in scenarios where reprovisioning is easier such as clusters.

@cgwalters
Copy link
Member Author

OK I'd like to propose that this option is:

[sysroot]
auto-cleanup=space-limited

Basically we only do the cleanup if doing so would allow us to install a kernel/initramfs when we otherwise couldn't.

@dustymabe
Copy link
Contributor

Basically we only do the cleanup if doing so would allow us to install a kernel/initramfs when we otherwise couldn't.

This sounds ideal... Almost like it should be the default, though? We only take this (slightly more risky) code path if we couldn't succeed otherwise (not enough space).

@dustymabe
Copy link
Contributor

Another thing we could do to mitigate risk is to move the old kernel/initrd to a different filesystem (tmpfs or any kind of tmp) rather than deleting it. Upon failure we could attempt to restore the original files back.

I think it would be nice to make progress on this sooner than later as it appears the compression mitigation might not be enough for ppc64le in FCOS.

@cgwalters cgwalters self-assigned this Oct 3, 2022
@cgwalters
Copy link
Member Author

I briefly looked at this, it's quite messy due to the internal design of trying to do a "transactional swap" of the deployments - we end up needing something like a "pre-pass". Or maybe the higher level code can pass down a separate "list of deployments to keep if you can". Needs a bit of thought/design.

@jmarrero jmarrero removed their assignment Oct 5, 2022
@ericcurtin
Copy link
Collaborator

Just curious @cgwalters @jmarrero what are the options for the usecase today where you just want to store 2 rollbacks or 3, etc. ?

@jlebon
Copy link
Member

jlebon commented Feb 15, 2023

Is the complexity in handling this arising from trying to handle ENOSPC at the last minute and unwinding? Another approach that should be more tractable is doing space calculations up front and engaging the auto-prune behaviour if (new kernel + new initrd) - (old kernel + old initrd) > space remaining. That's along the lines of a "pre-pass" as suggested by @cgwalters above.

@cgwalters
Copy link
Member Author

Yeah, agree that's easier.

@dustymabe
Copy link
Contributor

dustymabe commented Feb 23, 2023

I just hit a similar problem here in rawhide on an aarch64 machine. I was running sudo rpm-ostree override replace --reboot https://bodhi.fedoraproject.org/updates/FEDORA-2023-22011eaa7c to test a kernel (this one happened to be a debug kernel so it's larger in size than usual).

After reboot I see the update didn't apply and:

[core@cosa-devsh ~]$ systemctl --failed
  UNIT                         LOAD   ACTIVE SUB    DESCRIPTION         
● ostree-boot-complete.service loaded failed failed OSTree Complete Boot

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
[core@cosa-devsh ~]$ 
[core@cosa-devsh ~]$ journalctl -b0 -u ostree-boot-complete.service 
Feb 23 19:32:31 localhost systemd[1]: Starting ostree-boot-complete.service - OSTree Complete Boot...
Feb 23 19:32:31 localhost ostree[838]: error: ostree-finalize-staged.service failed on previous boot: Installing kernel: Copying sun50i-a64-amarula-relic.dtb: regfile copy: No space left on device
Feb 23 19:32:31 localhost systemd[1]: ostree-boot-complete.service: Main process exited, code=exited, status=1/FAILURE
Feb 23 19:32:31 localhost systemd[1]: ostree-boot-complete.service: Failed with result 'exit-code'.
Feb 23 19:32:31 localhost systemd[1]: Failed to start ostree-boot-complete.service - OSTree Complete Boot.
[core@cosa-devsh ~]$ df -kh /boot/
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda3       350M  346M     0 100% /boot


@jlebon jlebon assigned jlebon and unassigned cgwalters Mar 27, 2023
@jlebon jlebon added the jira label Mar 27, 2023
jlebon added a commit to jlebon/ostree that referenced this issue Apr 13, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_EXP_AUTO_EARLY_PRUNE`). Once we gain more experience with it,
we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning. This is
however mitigated by the fact that the heuristic is opportunistic: the
rollback is pruned *only if* it's the only way for the system to update.

[1]: coreos/fedora-coreos-tracker#1247

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue Apr 14, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with
it, we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning (see [[2]]
and following). This is however mitigated by the fact that the heuristic
is opportunistic: the rollback is pruned *only if* it's the only way for
the system to update.

[1]: coreos/fedora-coreos-tracker#1247
[2]: ostreedev#2670 (comment)

Closes: ostreedev#2670
jlebon added a commit to jlebon/ostree that referenced this issue May 1, 2023
During the early design of FCOS and RHCOS, we chose a value of 384M
for the boot partition. This turned out to be too small: some arches
other than x86_64 have larger initrds, kernel binaries, or additional
artifacts (like device tree blobs). We'll likely bump the boot partition
size in the future, but we don't want to abandon all the nodes deployed
with the current size.[[1]]

Because stale entries in `/boot` are cleaned up after new entries are
written, there is a window in the update process during which the bootfs
temporarily must host all the `(kernel, initrd)` pairs for the union of
current and new deployments.

This patch determines if the bootfs is capable of holding all the
pairs. If it can't but it could hold all the pairs from just the new
deployments, the outgoing deployments (e.g. rollbacks) are deleted
*before* new deployments are written. This is done by updating the
bootloader in two steps to maintain atomicity.

Since this is a lot of new logic in an important section of the
code, this feature is gated for now behind an environment variable
(`OSTREE_ENABLE_AUTO_EARLY_PRUNE`). Once we gain more experience with
it, we can consider turning it on by default.

This strategy increases the fallibility of the update system since one
would no longer be able to rollback to the previous deployment if a bug
is present in the bootloader update logic after auto-pruning (see [[2]]
and following). This is however mitigated by the fact that the heuristic
is opportunistic: the rollback is pruned *only if* it's the only way for
the system to update.

[1]: coreos/fedora-coreos-tracker#1247
[2]: ostreedev#2670 (comment)

Closes: ostreedev#2670
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants