Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the concept of "checkpoint" commits #1228

Open
cosimoc opened this issue Sep 29, 2017 · 14 comments
Open

Add the concept of "checkpoint" commits #1228

cosimoc opened this issue Sep 29, 2017 · 14 comments

Comments

@cosimoc
Copy link
Contributor

cosimoc commented Sep 29, 2017

Sometimes a new OS version contains an incompatible change that requires an new version of the system updater in order to be completed.

One occurrence that we found here at Endless is when we move out some files from the OSTree (e.g. we start using Flatpak to distribute an app that was previously part of the core OS); we have a new version of the updater that knows how to perform the migration, but users coming from previous versions won't have the migration code available and would be left in a new tree without the app at all.

The implementation idea I have in mind for this is to add the concept of "mandatory checkpoint" commits in OSTree: when going from commit X to commit Y, if a commit M between them is marked as mandatory, OSTree/eos-updater would make sure to first deploy M before going to Y. In our case, M would contain the new version of the updater, capable of downloading new flatpaks. A nice byproduct of this approach is that we can also easily use M as a landmark to make deltas to/from.

Another interesting question is how to bootstrap this process, since a checkpoint-aware updater must exist to perform the first migration. Right now we are planning to solve that by adding a new ref (e.g. if stable was the current OS ref, this would be stable-v2) and making the last commit of stable the one that introduces the checkpoint-aware updater. A different idea would be to have the server return a different commit checksum for the same ref depending on client capabilities/version, but I know that at least in the past there was a strong desire to keep the HTTP server "dumb".

CC @ramcq @pwithnall

@ramcq
Copy link
Contributor

ramcq commented Sep 29, 2017

A different idea would be to have the server return a different commit checksum for the same ref depending on client capabilities/version, but I know that at least in the past there was a strong desire to keep the HTTP server "dumb".

In any case we don't use ostree's HTTP server on our infrastructure, so any kind of ref swizzling would require co-ordination with some other HTTP daemon, any CDNs/proxies/etc. I can't see that being very popular. :)

@pwithnall
Copy link
Member

@cgwalters, I guess this is something you have already thought about for Project Atomic?


One key question is how to arrange the metadata for marking commits as checkpoints so that clients can efficiently find the sequence of checkpoints they need to download to get to the latest commit. How about an ostree.last-checkpoint key stored in the metadata for each commit, which contains the checksum of the most recent checkpoint commit? For each checkpoint commit itself, it would contain the checksum of the previous checkpoint. The ref in refs/heads/blah and the summary file would continue to point to the most recent commit* for that ref, as normal.

So to update, the client would pull the .commit file for the checksum pointed to by the ref in the summary file, and check its ostree.last-checkpoint value against a locally-stored last-checkpoint value specific to this remote/ref combination. If they’re the same, the client pulls the latest commit as normal. If they differ, the client pulls the last checkpoint .commit file, and looks at its ostree.last-checkpoint value. Repeat until they match the locally stored value. Once pulling is complete, update the locally stored value.

This would be complicated slightly by the (probable) need to reboot for each checkpoint we pull, but the concept should still work.

*Assuming we’re past the bootstrapping process which Cosimo mentioned.

@pwithnall
Copy link
Member

Another approach would be to use something like the ostree.endoflife-rebase functionality to rewrite the ref for each checkpoint. I think we’d still need to modify that a bit to ensure that each endoflife-rebase commit gets deployed before the next one is pulled and deployed. The current code path for endoflife-rebase will eagerly update the ref and pull the new commit and deploy that, skipping over the checkpoint.

(Although I’ve just noticed: the code in ostree-sysroot-upgrader.c only checks for one endoflife-rebase commit, and doesn’t recursively follow them, so will break if a client needs to follow through several rebases to get to the latest version of an OS. I’ve filed issue #1229.)

@jlebon
Copy link
Member

jlebon commented Sep 29, 2017

The idea of checkpoints is an interesting one, and I like that it provides natural static-delta targets. That said, specific to this use case, couldn't the updater just pull the latest commit as usual and perform the migration at boot time?

@pwithnall
Copy link
Member

That said, specific to this use case, couldn't the updater just pull the latest commit as usual and perform the migration at boot time?

We want to download the new flatpaks as part of the same pull process as for the OS, and deploy them at the same time as the new OS OSTree. i.e. The whole thing needs to be atomic. We do not want any period of time where the user has the new OS (without the bundled, say, LibreOffice) but doesn’t have the new flatpaks (which would make for a very angry LibreOffice user).

@dustymabe
Copy link
Contributor

We want to download the new flatpaks as part of the same pull process as for the OS, and deploy them at the same time as the new OS OSTree. i.e. The whole thing needs to be atomic.

/me curious - how are you achieving the 'bundling' ?

@cgwalters
Copy link
Member

ostree's HTTP server

?? I dearly hope no one is actually using the trivial-httpd for any kind of production use!

There's basically two parts to this - the mechanics of booting into a checkpoint, and actually implementing the transition.

For the first part, I assume what you were thinking here is having the HTTP server do dynamic dispatch based on e.g. the User-Agent or a custom injected header that has the source version. That seems like it'd work.

OTOH, the way I'd have considered approaching this would be using distinct refs - the client system updates to exampleos/42/x86_64/os, then a need for an epoch appears, introduce exampleos/v1/47/x86_64/os (where v1 signifies a major distinction) and land code in the last commit to exampleos/42/x86_64/os to perform the migration (including changing the base ref).

So rather than having the libostree pull code follow the endoflife-rebase key, it's implemented at boot time.

To make it truly atomic gets a little tricky as you'd likely need to e.g. make some changes in /etc after it's written but before doing the bootloader entry. Supporting that would need some extensions to the libostree. It'd get even more interesting with flatpak apps in /var/lib/flatpak, but you don't have that case today.

@cgwalters
Copy link
Member

(I'm elaborating on/agreeing with what @jlebon said here - doing this at boot time seems to me to be the most powerful/flexible way to do it)

@ramcq
Copy link
Contributor

ramcq commented Sep 29, 2017

Our implementation is not truly atomic because we rely on a boot task to go from a deployed flatpak to an exported one (ie the user can see ti). So we're closing the majority of failure cases that get but not eliminating it, because we'd prefer to not have the app appear twice under the old OS version if there is a failure in the OS update.

However that's by-the-by because what we need is the updater (which does the downloading and deploying of the OS, and the flatpaks) to ensure that both downloads and deploys are successfully completed before the active OS is switched over, hence the checkpoint requirement.

@pwithnall
Copy link
Member

pwithnall commented Oct 10, 2017

@cgwalters: OOI, has Project Atomic hit this problem (needing some kind of checkpoint in an update stream) before? I’d be surprised if not — if you can essentially support upgrading directly from version 0 of Project Atomic through to now.

I guess the main difference I can think of between the two potential approaches here (tagging checkpoint commits using metadata; or tagging them as the last commit on a ref before changing the ref name) is how they interact with LAN sharing of updates, and caching of checkpoints to serve to other peers on the LAN.

If checkpoints are tagged in the commit metadata, libostree needs to be aware of the caching policy so that it doesn’t prune the old checkpoint commits, since they might be needed to serve to peers at some point in the future.

If checkpoints are tagged by renaming the ref, those checkpoint commits are only pruned if the old refs are deleted, which means that the caching policy moves up to a higher layer — it’s now controlled by the system updater (eos-updater in our case), which is in charge of renaming the refs when a checkpoint is reached, and deleting the old checkpoint refs when it thinks no more LAN peers are going to want them.

That leads me to lean towards implementing checkpoints by renaming refs, which needs less support from libostree, and mostly just needs to be implemented in our updater. However, I would still be very interested in knowing what Project Atomic does (or might eventually do), so that we don’t end up diverging unnecessarily.

The downside of implementing checkpoints by renaming refs, though, is that going back in time (pulling and deploying an old version of the OS) is a bit harder, since you have to undo the ref rename. This could probably be mitigated by putting some metadata in the first commit of the new ref which points back to the old ref; just like the last commit of the old ref will point to the new ref.

@cgwalters
Copy link
Member

cgwalters commented Oct 11, 2017

Not really no (well at least if we're talking about the CentOS/RHEL Atomic Host). For fun I tried booting CentOS-Atomic-Host-7.1.2-GenericCloud.qcow2 with:

# rpm-ostree status                                                                                                                                                                  
  TIMESTAMP (UTC)         VERSION   ID             OSNAME                 REFSPEC                                                                                                             
* 2015-06-17 21:14:52     7.1.2     5aad058fd2     centos-atomic-host     centos-atomic-host:centos-atomic-host/7/x86_64/standard

Then started a basic nginx container, which stays running while I rpm-ostree upgrade (and man, this predates rpm-ostree being a daemon...and woah is the status display ugly). Almost every package changes, lots of archive fetches (no static deltas), but it works. Reboot to:

# rpm-ostree status
State: idle
Deployments:
● centos-atomic-host:centos-atomic-host/7/x86_64/standard
                Version: 7.1708 (2017-09-15 15:32:30)
                 Commit: 33b4f0442242a06096ffeffadcd9655905a41fbd11f36cd6f33ee0d974fdb2a8
           GPGSignature: 1 signature
                         Signature made Fri 15 Sep 2017 05:17:39 PM UTC using RSA key ID F17E745691BA8335
                         Good signature from "CentOS Atomic SIG <security@centos.org>"

  centos-atomic-host:centos-atomic-host/7/x86_64/standard
                Version: 7.1.2 (2015-06-17 21:14:52)
                 Commit: 5aad058fd206c624abf1c531997ae40126f03226f1da9a59c23a67b1cfd5140c
           GPGSignature: 1 signature
                         Signature made Wed 17 Jun 2015 09:18:29 PM UTC using RSA key ID F17E745691BA8335
                         Good signature from "CentOS Atomic SIG <security@centos.org>"

Container is still there and still works.

Remember our audience is servers with active sysadmins - so there are potential transitions like to overlayfs but they know how to handle these types of things.

@cgwalters
Copy link
Member

That leads me to lean towards implementing checkpoints by renaming refs, which needs less support from libostree, and mostly just needs to be implemented in our updater.

I think this is the approach to take. Having it be in the commit stream makes it feel a lot more "invisible/magical".

@cgwalters
Copy link
Member

However, I would still be very interested in knowing what Project Atomic does (or might eventually do), so that we don’t end up diverging unnecessarily.

So for Red Hat Enterprise Linux Atomic Host (and similarly CentOS AH) we simply haven't done a transition of this form. However, for Fedora, we do require admins to explicitly rebase - this matches the general current Fedora model.

However, we are talking about more of a "single stream" experience there; see this issue for example.

@jlebon
Copy link
Member

jlebon commented Apr 28, 2021

Just an update on this for completeness: for Fedora CoreOS and RHCOS, we implemented this using metadata describing a graph of permissible upgrade paths fed to higher-level software which in turn drives rpm-ostree. Relevant projects: https://github.com/coreos/fedora-coreos-cincinnati/, https://github.com/openshift/cincinnati/, https://github.com/coreos/zincati/, https://github.com/openshift/machine-config-operator. This does lead to interesting problems though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants