Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ostree=aboot for signed Android Boot Images #2844

Closed
wants to merge 1 commit into from

Conversation

ericcurtin
Copy link
Collaborator

@ericcurtin ericcurtin commented Apr 5, 2023

Some kernel images are delivered in a signed kernel + cmdline +
initramfs + dtb blob. When this is added to the commit server side, only
after this do you know what the cmdline is, this creates a recursion
issue. To avoid this, in the case where we have ostree=aboot karg
set, do the bls parsing in the initramfs instead, so we can take
advantage of existing bls logic.

@openshift-ci
Copy link

openshift-ci bot commented Apr 5, 2023

Hi @ericcurtin. Thanks for your PR.

I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ericcurtin
Copy link
Collaborator Author

Related to conversations in #2753 ... Boots fine

[root@localhost ~]# cat /proc/cmdline
root=LABEL=root ro loglevel=4 efi=runtime libahci.ignore_sss=1 console=ttyAMA0 rw
[root@localhost ~]# rpm-ostree status
State: idle
Deployments:
● auto-sig:cs9/aarch64/abootqemu-minimal
                  Version: 9 (2023-04-05T15:27:58Z)
                   Commit: 210167d098a681908be52d0ea79b87c6b96570584c7c7ee643bf722c2160d653

Although I want to do some more testing with some greenboot changes.

@ericcurtin ericcurtin force-pushed the boot_latest_symlink branch 2 times, most recently from bdc955c to eea13f2 Compare April 5, 2023 19:24
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this!

src/boot/ostree-remount.service Outdated Show resolved Hide resolved
src/boot/ostree-prepare-root.service Outdated Show resolved Hide resolved
src/switchroot/ostree-mount-util.h Outdated Show resolved Hide resolved
src/switchroot/ostree-mount-util.h Outdated Show resolved Hide resolved
src/switchroot/ostree-mount-util.h Outdated Show resolved Hide resolved
src/switchroot/ostree-mount-util.h Outdated Show resolved Hide resolved
src/switchroot/ostree-mount-util.h Outdated Show resolved Hide resolved
@ericcurtin ericcurtin force-pushed the boot_latest_symlink branch 3 times, most recently from d1b620e to 1d845c7 Compare April 6, 2023 11:11
@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Apr 6, 2023

More info of what a system booted like this looks like below. Just to point out, aboot bootloaders do not parse BLS entries at all. But we can still use that data, that might be useful, we can just parse BLS in the initramfs to decide exactly where we need to boot from, once we've figured which is the right BLS entry file to use.

BLS cmdline does not match the /proc/cmdline when booted in this way, but I think that's ok. A person debugging this should know aboot does not parse BLS, it's a pretty simple booloader it boots the boot image in slot A or slot B, but the data there can still help keep the aboot technique consistent with a BLS bootloader like grub.

$ rpm-ostree status; echo; cat /proc/cmdline; echo; cat /boot/loader.1/entries/ostree-1-centos.conf
Deployments:
● auto-sig:cs9/aarch64/abootqemu-minimal
                  Version: 9 (2023-04-06T09:23:50Z)
                   Commit: 3bc0523161590c853b6aaefa43f81e1072226ab1ca49c0e188ec5768139ffd17

root=LABEL=root ro loglevel=4 efi=runtime libahci.ignore_sss=1 console=ttyAMA0 rw

title CentOS AutoSD 9 (ostree:0)
version 1
options root=LABEL=root ro loglevel=4 efi=runtime libahci.ignore_sss=1 console=ttyAMA0 ostree=/ostree/boot.1/centos/7cc01e1ce3c47a67d5a60a0c3e515dafc5bbaed2596a42c830f80234e782fea1/0 rw
linux /ostree/centos-7cc01e1ce3c47a67d5a60a0c3e515dafc5bbaed2596a42c830f80234e782fea1/vmlinuz-5.14.0-999.124.pre.ES2.CCU.el9iv.aarch64
initrd /ostree/centos-7cc01e1ce3c47a67d5a60a0c3e515dafc5bbaed2596a42c830f80234e782fea1/initramfs-5.14.0-999.124.pre.ES2.CCU.el9iv.aarch64.img
abootcfg /ostree/deploy/centos/deploy/3bc0523161590c853b6aaefa43f81e1072226ab1ca49c0e188ec5768139ffd17.0/usr/lib/ostree-boot/aboot.cfg
aboot /ostree/centos-7cc01e1ce3c47a67d5a60a0c3e515dafc5bbaed2596a42c830f80234e782fea1/aboot-5.14.0-999.124.pre.ES2.CCU.el9iv.aarch64.img
fdtdir /ostree/centos-7cc01e1ce3c47a67d5a60a0c3e515dafc5bbaed2596a42c830f80234e782fea1/dtb

@bauen1
Copy link

bauen1 commented Apr 11, 2023

So the current logic simply picks the latest, as defined by modification timestamp, file from /sysroot/ostree/boot.$x/$stateroot/$bootcsum/$checkoutnum as checkout to use ?

I think that will lead to unexpected results at best, and unexpected vulnerabilities at worst, and I guess that's why this is a Draft currently.

Basing the decision on the modified timestamp could have the very slim chance of actually booting an older, but later modified, deployment, I guess that situation would be rare.

It would also mean, that you can boot a (signed) kernel with a different checkout than it was deployed with, which could open up new vulnerabilities, e.g. new deployment relies on kernel feature / initramfs changes that aren't present.

But the $bootcsum should be computable by the loader (or maybe even the kernel/initramfs itself), so while you can't embed the hash inside the images (which is the data being hashed), perhaps it could be passed by the UKI-stub or loader.
Then you wouldn't need to "guess" it, and can ensure that a deployment would only ever be booted by a image it was build with.
(Side-Note: That could even work if the bootcsum ends up being the hash of the UKI, but in that case it could be preferable to use the PECOFF hash that would be signed).

Rollbacks / multiple deployments, seem to be impossible with this logic, since even with an older image the latest checkout is used.
I think the information of which deployment to use comes inherently from an unverifiable source and the best you can do is probably to limit the image+deployment combinations to ones where the image booted and the image of the deployment match (bootcsum).
And I believe that is true even if you implement automatic health checks, since at some point you need to trust something, and realistically you want to give users the option of choosing to boot rollback image+deployment (e.g. by means of UEFI boot selection / sd-boot).

In any case you're putting your trust in nobody tampering with the actual filesystem, which is quite a far shot by today's standards ...

Comment on lines 123 to 155
static inline int
append_latest_link(char* dir_str, int* len, int* cap) {
char* append_from = dir_str + *len;
int appended_len = 0;
DIR __attribute__ ((cleanup(close_dir))) *dir = opendir(dir_str);
if (!dir)
{
fprintf(stderr, "'%s' %d", strerror(errno), __LINE__);
return 1;
}

struct dirent *ent = 0;
for (time_t latest = 0; (ent = readdir(dir)); )
{
if (ent->d_type != DT_LNK)
continue;

struct stat lsb;
lsb.st_mtime = 0;
char ent_str[PATH_MAX + sizeof (ent->d_name)];
snprintf(ent_str, sizeof (ent_str), "%s/%s", dir_str, ent->d_name);
if (lstat(ent_str, &lsb) == -1)
{
fprintf(stderr, "'%s' %d", strerror(errno), __LINE__);
return 1;
}

if (latest <= lsb.st_mtime)
{
appended_len = snprintf(append_from, *cap, "/%s", ent->d_name);
latest = lsb.st_mtime;
}
}

*len += appended_len;
*cap = PATH_MAX - *len;

return 0;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as I take it, this method searches $prefix/ostree/ and appends the latest symlink.

Maybe it would be good to only look at $prefix/ostree/boot* (perhaps even limited to all possible combinations) ?

Otherwise creating a link in there (even if you shouldn't do so) seems to have potential to break any attempt to boot.

Sorting by modification timestamps seems a bit fragile to me, can't that be done by simply looking at the number in the link ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the current logic simply picks the latest, as defined by modification timestamp, file from /sysroot/ostree/boot.$x/$stateroot/$bootcsum/$checkoutnum as checkout to use ?

Yes.

I think that will lead to unexpected results at best, and unexpected vulnerabilities at worst, and I guess that's why this is a Draft currently.

Yes again, I'm just trying things at the moment.

Basing the decision on the modified timestamp could have the very slim chance of actually booting an older, but later modified, deployment, I guess that situation would be rare.

Yes I am worried about this too, expect the implementation to change.

It would also mean, that you can boot a (signed) kernel with a different checkout than it was deployed with, which could open up new vulnerabilities, e.g. new deployment relies on kernel feature / initramfs changes that aren't present.

But the $bootcsum should be computable by the loader (or maybe even the kernel/initramfs itself), so while you can't embed the hash inside the images (which is the data being hashed), perhaps it could be passed by the UKI-stub or loader. Then you wouldn't need to "guess" it, and can ensure that a deployment would only ever be booted by a image it was build with. (Side-Note: That could even work if the bootcsum ends up being the hash of the UKI, but in that case it could be preferable to use the PECOFF hash that would be signed).

I am using a very limited bootloader that boots either an A or B boot partition, that contains an Android Boot Image. But there is very limited data available in the GPT partition table that I can be able to use in order to identify if I should use the latest or the latest-1 version. If that makes sense.

UKI's probably have much more options.

Rollbacks / multiple deployments, seem to be impossible with this logic, since even with an older image the latest checkout is used. I think the information of which deployment to use comes inherently from an unverifiable source and the best you can do is probably to limit the image+deployment combinations to ones where the image booted and the image of the deployment match (bootcsum). And I believe that is true even if you implement automatic health checks, since at some point you need to trust something, and realistically you want to give users the option of choosing to boot rollback image+deployment (e.g. by means of UEFI boot selection / sd-boot).

I only want Android style A/B rollbacks, nothing else.

In any case you're putting your trust in nobody tampering with the actual filesystem, which is quite a far shot by today's standards ...

Will need to be hardened.

I think you can take this PR with a pinch of salt at the minute, it will be drastically changed. But thanks for the early feedback.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using a very limited bootloader that boots either an A or B boot partition, that contains an Android Boot Image. But there is very limited data available in the GPT partition table that I can be able to use in order to identify if I should use the latest or the latest-1 version. If that makes sense.

Ah, that makes things a bit easier 🤔
I think it might end up being a very similar trade-off like you have to do with UKI, since the information on what to boot (A or B) comes from an untrusted / unverified source.

In any case you're putting your trust in nobody tampering with the actual filesystem, which is quite a far shot by today's standards ...
Will need to be hardened.

It was more of a general comment, a lot of things currently depend on the filesystem not having been tampered with.

I think you can take this PR with a pinch of salt at the minute, it will be drastically changed. But thanks for the early feedback.

It is very nice to see some attempts at implementing it, even if it wouldn't fit my use case in the end 🚀
As always, an actual (incomplete) implementation is much easier to argue about than some abstract idea.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Apr 26, 2023

@cgwalters @jlebon @bauen1 @alexlarsson Just getting back to this, how would you guys feel about writing a very simple bls parser to achieve this, so we can behave like a bls-based bootloader but only in the initramfs rather than the bootloader? It would just do a rpmvercmp and then pull the "^options " line to get the ostree= arg from there... I probably wouldn't reuse src/libostree/ostree-bootconfig-parser.c because I think it would drag in too many dependencies for initramfs... It could be triggered if cmdline ostree= arg was something like "ostree=aboot", "ostree=initramfs_bls_parser" or as previously suggested "ostree=latest" or "ostree=no" (as you can tell I'm not very opinionated on the name)... This bootloader is looking like it's basically gonna have a single bit that's set in the GPT partition table to decide when to fallback or not... I'm gonna try and test the hell out of this of course for corner cases eventually...

@cgwalters
Copy link
Member

The code seems tractable...it's the maintenance and testing that seems like the hard part here. Having two ways to do it permanently seems like a problem.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented Apr 27, 2023

I'll push what I have tomorrow and you can take a look, it's working for the upgrade path, not for the fallback path yet, I'd love to actually re-use stuff and not rewrite things to minimise the initramfs. But if there's an industry that's pushing hard for small initramfs's and fast boot times it's automotive 😅

@ericcurtin ericcurtin force-pushed the boot_latest_symlink branch 2 times, most recently from b38f5f9 to ffadf85 Compare April 27, 2023 20:52
@ericcurtin ericcurtin changed the title Remove hard condition on ostree= karg Add ostree=aboot for signed Android Boot Images Apr 27, 2023
@ericcurtin
Copy link
Collaborator Author

Ok pushed tonight instead had twenty minutes to spare

@ericcurtin
Copy link
Collaborator Author

@cgwalters So, in aboot the deploy is not really the writing of the partition. There are two boot partitions, boot_a and boot_b, when you are booted to boot_a, you write the new deployment to boot_b, and the switch is not the writing of the partition. Instead it is at the end when you flip a bit in the partition table making boot_b the default boot partition. This would be done after writing the boot loader files.

Yup that flipping the bit is the switch. Although now we have two switches, because we use BLS also.

In case of a rollback we would boot the old boot_a partition, and the idea is to have the initrd notice with partition we booted from, and if it is A (instead of the expected B), then it would pick the BLS file that points to the older userspace.

Yeah it's gonna be similar to this.

BTW. I wonder if we could just generate a "build id" (essentially a random number) for each new ostree commit build, and encode that in the kernel command line (but no deploy id). Then we could use that (instead of the version) to find the matching BLS file and find the deloy id from it.

This sounds like a good idea, it does mean we would have to write to the Android Boot Partitions everytime, even for userspace upgrades, which is actually ok, not a big deal. Lets keep this in mind if we need it. It could simplify things quite a bit.

@ericcurtin
Copy link
Collaborator Author

So, overall I like this approach. However I'm not sure I agree with the use of the BLS version field. You may use the ostree to switch to an older version, and you want if that fails to fall back to the one that was booted when you deployed the older version. Version ordering doesn't help you here. What does help is the ostree deployment index, which will be "1" for the newly deployed thing and "2" for the previous one.
This is encoded in the filename, like ostree-1-fedora-coreos.conf for index 1 and ostree-2-fedora-coreos.conf for index 2. Should we not just always boot the file that starts with "ostree-1-" and fall back to "ostree-2-"?

I'll test a few more rollbacks before continuing to avoid wasted effort, you are probably right. My first version of this yesterday compared filenames before I changed it to the "version " field. Literally copying and pasting my logic from yesterday:

Which is more correct when deciding what is the latest BLS file to pick, do a vercmp of the filename or do a vercmp of the actual "version " field in the file text? Or are they guaranteed to be the same (because if they are guaranteed to be the same case I'll just check the filename it's quicker and shorter code, don't have to open in the file in C)? See my grep below of two BLS files on an ostree system

grep "version" /boot/loader/entries/ostree-*
/boot/loader/entries/ostree-1-centos.conf:version 1
/boot/loader/entries/ostree-2-centos.conf:version 2

Nevermind actually I'll open the file to be consistent with some other ostree code.

Btw @cgwalters did you see this question. Which is the correct way to do a vercmp check in ostree, via filename or "version " field or are they the same?

@ericcurtin ericcurtin force-pushed the boot_latest_symlink branch 2 times, most recently from 607ac1b to 5ec1ae9 Compare May 2, 2023 10:19
@@ -124,9 +124,13 @@ resolve_deploy_path (const char * root_mountpoint)
{
char destpath[PATH_MAX];
struct stat stbuf;
char *ostree_target, *deploy_path;
char __attribute__ ((cleanup (free_char))) *ostree_target;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this was leaking here and in src/switchroot/ostree-system-generator.c all along, but both are short lived processes so you wouldn't really notice.

Some kernel images are delivered in a signed kernel + cmdline +
initramfs + dtb blob. When this is added to the commit server side, only
after this do you know what the cmdline is, this creates a recursion
issue. To avoid this, in the case where we have ostree=aboot karg
set, do the bls parsing in the initramfs instead, so we can take
advantage of existing bls logic.
@ericcurtin
Copy link
Collaborator Author

Added a uname check

@cgwalters cgwalters added enhancement difficulty/hard hard complexity/difficutly issue reward/medium Fixing this will be notably useful labels May 2, 2023
@cgwalters
Copy link
Member

/ok-to-test

@jlebon
Copy link
Member

jlebon commented May 2, 2023

In case of a rollback we would boot the old boot_a partition, and the idea is to have the initrd notice with partition we booted from, and if it is A (instead of the expected B), then it would pick the BLS file that points to the older userspace.

I'm not sure I follow this. If the initrd knows from which partition it booted from, can't it pick the latest BLS in its partition? Or maybe that's what you're saying but in different words.

It feels confusing to have both multiple boot partitions and multiple BLS entries per partition. IIUC, I think ideally we'd have ostree just emit a single BLS entry instead in that case, but that's clearly a more invasive patch. It seems to me though like we should just pretend the older BLS entries don't exist (this rules out any potential mismatching issues).

And then, rather than reimplementing a BLS parser, we could have some bootloader option that writes out the deployment root path in some file (or maybe as a symlink) under the /boot/loader symlink.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented May 2, 2023

In case of a rollback we would boot the old boot_a partition, and the idea is to have the initrd notice with partition we booted from, and if it is A (instead of the expected B), then it would pick the BLS file that points to the older userspace.

I'm not sure I follow this. If the initrd knows from which partition it booted from, can't it pick the latest BLS in its partition? Or maybe that's what you're saying but in different words.

This is a possible technique, we do not have A/B system partitions (because it doesn't make as much sense when it's managed by ostree that can manage multiple versions). But both Android Boot Images (the are like UKIs) can be from booted from in either boot partition A or B.

It feels confusing to have both multiple boot partitions and multiple BLS entries per partition. IIUC, I think ideally we'd have ostree just emit a single BLS entry instead in that case, but that's clearly a more invasive patch. It seems to me though like we should just pretend the older BLS entries don't exist (this rules out any potential mismatching issues).

We plan on having one set for BLS entries for both partitions.

And then, rather than reimplementing a BLS parser, we could have some bootloader option that writes out the deployment root path in some file (or maybe as a symlink) under the /boot/loader symlink.

Could do, but BLS is easy to parse, it's a nice simple format tbh, even if we started using a separate file, BLS file format would make sense as it's easy to parse in C.

And the thing is, there is useful data there whether we create a separate file or not.

@cgwalters
Copy link
Member

cgwalters commented May 2, 2023

Btw @cgwalters did you see this question. Which is the correct way to do a vercmp check in ostree, via filename or "version " field or are they the same?

Well, the right way to answer this question is to look at the consumers. For e.g. zipl, it's
https://github.com/ibm-s390-linux/s390-tools/blob/e6be1c1b793915e001ceaf3f6f93a00c396618dc/zipl/src/scan.c#L757

which is parsing the version field, as far as I can tell. AIUI the original spec versions called for sorting by filenames too? I may be wrong.

Reading the code for grub2 in https://src.fedoraproject.org/rpms/grub2/blob/rawhide/f/0021-blscfg-add-blscfg-module-to-parse-Boot-Loader-Specif.patch it looks like it's sorting by filename...

@jlebon
Copy link
Member

jlebon commented May 3, 2023

I'm not sure I follow this. If the initrd knows from which partition it booted from, can't it pick the latest BLS in its partition? Or maybe that's what you're saying but in different words.

This is a possible technique, we do not have A/B system partitions (because it doesn't make as much sense when it's managed by ostree that can manage multiple versions). But both Android Boot Images (the are like UKIs) can be from booted from in either boot partition A or B.

It feels confusing to have both multiple boot partitions and multiple BLS entries per partition. IIUC, I think ideally we'd have ostree just emit a single BLS entry instead in that case, but that's clearly a more invasive patch. It seems to me though like we should just pretend the older BLS entries don't exist (this rules out any potential mismatching issues).

We plan on having one set for BLS entries for both partitions.

Ack thanks, this helps a lot. So IIUC, the A/B boot partitions are purely for the UKI-like blobs, but otherwise are separate from /boot where the BLS entries are written to as usual?

So then my next question is re. this bit:

Well if we get interrupted at any point, we won't tell the system to switch slots, so we would still be booting old kernel and userspace.

There's two critical events: the symlink update in the bootfs for pointing at the new set of BLS entries, and the slot switch. If we get interrupted after the symlink update but before the slot switch, isn't the system state inconsistent? I.e. won't the active slot (containing the booted kernel) try to boot the now latest deployment (containing the new userspace)?

Sorry if this is obvious to everyone else. I'm just trying to understand how this all fits since we're bound to get more UKI-related things in the future.

@alexlarsson
Copy link
Member

alexlarsson commented May 3, 2023

I'm not sure I follow this. If the initrd knows from which partition it booted from, can't it pick the latest BLS in its partition? Or maybe that's what you're saying but in different words.

This is a possible technique, we do not have A/B system partitions (because it doesn't make as much sense when it's managed by ostree that can manage multiple versions). But both Android Boot Images (the are like UKIs) can be from booted from in either boot partition A or B.

It feels confusing to have both multiple boot partitions and multiple BLS entries per partition. IIUC, I think ideally we'd have ostree just emit a single BLS entry instead in that case, but that's clearly a more invasive patch. It seems to me though like we should just pretend the older BLS entries don't exist (this rules out any potential mismatching issues).

We plan on having one set for BLS entries for both partitions.

Ack thanks, this helps a lot. So IIUC, the A/B boot partitions are purely for the UKI-like blobs, but otherwise are separate from /boot where the BLS entries are written to as usual?

Yeah. The boot_a and boot_b partitions are "raw" data partitions with just the UKI blob. They have no filesystem on them at all. Typically in a aboot system, boot_a would be paired with a read-only dm-verity:ed userspace fs image in the system_a partition and boot_b would be paired with system_b partition. However, since our system partition is writable with the ostree repo on it we will only have one system partition.

So then my next question is re. this bit:

Well if we get interrupted at any point, we won't tell the system to switch slots, so we would still be booting old kernel and userspace.

There's two critical events: the symlink update in the bootfs for pointing at the new set of BLS entries, and the slot switch. If we get interrupted after the symlink update but before the slot switch, isn't the system state inconsistent? I.e. won't the active slot (containing the booted kernel) try to boot the now latest deployment (containing the new userspace)?

Sorry if this is obvious to everyone else. I'm just trying to understand how this all fits since we're bound to get more UKI-related things in the future.

Yeah, this is true. This is why I proposed adding a "randomly generated" build identifier to the initrd, which would be paired with the BLS file. Then whichever initrd gets booted it can finds its own correct BLS file in either of the boot directories.

@ericcurtin
Copy link
Collaborator Author

ericcurtin commented May 3, 2023

I'm not sure I follow this. If the initrd knows from which partition it booted from, can't it pick the latest BLS in its partition? Or maybe that's what you're saying but in different words.

This is a possible technique, we do not have A/B system partitions (because it doesn't make as much sense when it's managed by ostree that can manage multiple versions). But both Android Boot Images (the are like UKIs) can be from booted from in either boot partition A or B.

It feels confusing to have both multiple boot partitions and multiple BLS entries per partition. IIUC, I think ideally we'd have ostree just emit a single BLS entry instead in that case, but that's clearly a more invasive patch. It seems to me though like we should just pretend the older BLS entries don't exist (this rules out any potential mismatching issues).

We plan on having one set for BLS entries for both partitions.

Ack thanks, this helps a lot. So IIUC, the A/B boot partitions are purely for the UKI-like blobs, but otherwise are separate from /boot where the BLS entries are written to as usual?

Yeah. The boot_a and boot_b partitions are "raw" data partitions with just the UKI blob. They have no filesystem on them at all. Typically in a aboot system, boot_a would be paired with a read-only dm-verity:ed userspace fs image in the system_a partition and boot_b would be paired with system_b partition. However, since our system partition is writable with the ostree repo on it we will only have one system partition.

So then my next question is re. this bit:

Well if we get interrupted at any point, we won't tell the system to switch slots, so we would still be booting old kernel and userspace.

There's two critical events: the symlink update in the bootfs for pointing at the new set of BLS entries, and the slot switch. If we get interrupted after the symlink update but before the slot switch, isn't the system state inconsistent? I.e. won't the active slot (containing the booted kernel) try to boot the now latest deployment (containing the new userspace)?
Sorry if this is obvious to everyone else. I'm just trying to understand how this all fits since we're bound to get more UKI-related things in the future.

Yeah, this is true. This is why I proposed adding a "randomly generated" build identifier to the initrd, which would be paired with the BLS file. Then whichever initrd gets booted it can finds its own correct BLS file in either of the boot directories.

Going along with the idea of a "random generated" build identifier, we could actually just make this a checksum of the diff before we commit and store it as a non-random thing (almost like generating a commit, but not actually committing it), that way the id is reproducible in builds and it's more verifiable it's correct. We could use something like sha1sum, we want a checksumming algorithm that's really fast. This way we might be able to afford to get away with no fallback flag written by the firmware into the GPT partition table because you would know from this build identifier to keep rolling back commits until the build id matches.

I'm setting up an environment at the moment to test all different kinds of fake upgrades etc. The above works with an "rpm-ostree install" kind of upgrade. But I want to try a normal userspace upgrade, a kernel upgrade etc. So whatever technique we use I can test it covers all bases.

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jlebon
Copy link
Member

jlebon commented May 4, 2023

I think I have an idea on how to resolve the race condition here more systematically. The approach is essentially to never change the position of the booted deployment during an update. Let's say you're booted in the default deployment and you're ready to upgrade. Then you would:

  • deploy the update in rollback position (if using the CLI, this can be achieved via ostree admin deploy --not-as-default)
  • do the slot switch

If we're interrupted in between the two steps, we'll just boot right back in the old deployment. As soon as we do the slot switch, we're set to boot into the new deployment.

In the next update, you'd do the opposite (deploy to the pending deployment position).

Essentially, this makes the deployment index not really matter. 0 is always matched to slot A and 1 is always matched to slot B.

Obviously if you do something like rpm-ostree rollback, it messes up the order, but it wouldn't be hard to add some sysroot option in libostree that enforces that the booted deployment never changes index too.

@ericcurtin
Copy link
Collaborator Author

I am heading on a short break for a few days, but I will be back next week, just to give an update. I am working with the upstream maintainer of this tool:

https://gitlab.com/sdm845-mainline/qbootctl/-/merge_requests

to do slot switching on a real Qualcomm based solution, up to now I'd been testing in single slot mode.

Expect there to be another re-spin of this, thanks for the feedback so far.

@cgwalters
Copy link
Member

This one could probably benefit from a realtime conversation. I'm looking at time slots for this now, tentatively thinking May 31 @ 9:30am EST at https://meet.gnome.org/col-hab-hej-e5i

@ericcurtin
Copy link
Collaborator Author

I am out next week 29 May-2 June, I spoke to @alexlarsson and we have something in mind that should require no changes to ostree-prepare-root, thanks to the design of composefs, it really simplified things. I should be able to complete in aboot-deploy abstraction requiring minimal to no further changes to ostree.

I'm gonna close this PR, but would be more than happy to have the conversation to keep everyone sync'd up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty/hard hard complexity/difficutly issue enhancement needs-rebase ok-to-test reward/medium Fixing this will be notably useful
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants