Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for full overlayfs for / #3113

Closed
cgwalters opened this issue Dec 7, 2023 · 25 comments · Fixed by #3114
Closed

Add support for full overlayfs for / #3113

cgwalters opened this issue Dec 7, 2023 · 25 comments · Fixed by #3114

Comments

@cgwalters
Copy link
Member

It'd greatly improve compatibility with things like RPMs that install in /opt if we supported a full "original docker" style model where / is a transient overlayfs. We'd still keep our semantics for /etc and /var by default, but e.g. we'd stop recommending /opt ➡️ /var/opt, so /opt would be on the overlayfs.

Note this all aligns with composefs, where we'd actually be making / a read-only overlayfs by default; it'd be really nice of course to implement this by just making the composefs overlayfs writable, but I am not sure we can hard require composefs for this right now.

Something like

root.transient = "true"

in /usr/lib/ostree/prepare-root.conf.

Downsides

The major downside is that people could be surprised if files they write to e.g. /opt don't persist across upgrades. But, that's already again how it works since Docker started.

@alexlarsson
Copy link
Member

I worry about this in general, because it breaks a basic container-like workload, because you can't (safely) rebase to a new version while keeping the existing overlay uppper-dir. (Any random change to a file will always replace a new version of the file from the image.

This works in containers, because you start new containers whenever you rebase, which means throwing away the upper layer. Do we intend to do the same here?

@alexlarsson
Copy link
Member

I see you mention the upgrade issue. However, if we do that, is this really a useful feature that users want?

@mrguitar
Copy link

mrguitar commented Dec 7, 2023

I believe this will perfectly solve the issue we have with some of the nvidia rpms. I think applications like crowstrike will struggle as their client performs a type of "activation" and maintains state with the controller all under the /opt/crowdstrike/ directory. ...but if I understand this correctly the software should still work, but they'd have to reestablish that activation with each os update. I think we can make this work, but this is an area where more feedback will hopefully help.

@cgwalters
Copy link
Member Author

@alexlarsson On upgrade indeed the old overlayfs data will be gone. But that's how it works since Docker first came out.

@alexlarsson
Copy link
Member

alexlarsson commented Dec 7, 2023

I think applications like crowstrike will struggle as their client performs a type of "activation" and maintains state with the controller all under the /opt/crowdstrike/ directory. ...but if I understand this correctly the software should still work, but they'd have to reestablish that activation with each os update. I think we can make this work, but this is an area where more feedback will hopefully help.

I feel like this would be better solved by adding to the image a bind mount uniit from /opt/crowdstrike into /var instead, so that the activation can be persisted? Or am I missing something here?

(or a symlink)

@mrguitar
Copy link

mrguitar commented Dec 7, 2023

I agree that would be much better. I think the challenge is mainly around users knowing these details like this about their applications, and also knowing rpm-ostree/ostree. Docker makes them care more about where things are persisted, traditional rpm systems do not.

Earlier today I think @cgwalters or @stefwalter mentioned using the VOLUME key in the containerfile to help. Would it makes sense to create bindmounts like this, or something similar, for volumes that users declare?

@alexlarsson
Copy link
Member

It would be cool if we could somehow mark a volume as needing to persist, and have bootc automatically set up the /var mapping for it.

cgwalters added a commit to cgwalters/ostree that referenced this issue Dec 7, 2023
Closes: ostreedev#3113

It'd greatly improve compatibility with things like RPMs that install
in `/opt` if we supported a full "original docker" style model where
`/` is a transient overlayfs.  We'd still keep our semantics for `/etc`
and `/var` by default, but e.g. we'd stop recommending
`/opt` ➡️ `/var/opt`, in this model,
so `/opt` would be on the overlayfs.

Note this all aligns with composefs, where we'd actually be making
`/` a *read-only* overlayfs by default; it'd be really nice of course
to *implement* this by just making the composefs overlayfs writable,
but I am not sure we can hard require composefs for this right now.

So this change adds support for `root.transient = true`
in `/usr/lib/ostree/prepare-root.conf`.

The major downside is that people could be surprised if files they
write to e.g. `/opt` don't persist across upgrades.  But, that's
already again how it works since Docker started.

Note as part of the implementation of this, we need to add a whole
new "backing" directory distinct from the deployment directories.

(Tangentially related to this, it's tempting to switch to always
 using a *read-only* overlay mount by default.
cgwalters added a commit to cgwalters/ostree that referenced this issue Dec 7, 2023
Closes: ostreedev#3113

It'd greatly improve compatibility with things like RPMs that install
in `/opt` if we supported a full "original docker" style model where
`/` is a transient overlayfs.  We'd still keep our semantics for `/etc`
and `/var` by default, but e.g. we'd stop recommending
`/opt` ➡️ `/var/opt`, in this model,
so `/opt` would be on the overlayfs.

Note this all aligns with composefs, where we'd actually be making
`/` a *read-only* overlayfs by default; it'd be really nice of course
to *implement* this by just making the composefs overlayfs writable,
but I am not sure we can hard require composefs for this right now.

So this change adds support for `root.transient = true`
in `/usr/lib/ostree/prepare-root.conf`.

The major downside is that people could be surprised if files they
write to e.g. `/opt` don't persist across upgrades.  But, that's
already again how it works since Docker started.

Note as part of the implementation of this, we need to add a whole
new "backing" directory distinct from the deployment directories.

(Tangentially related to this, it's tempting to switch to always
 using a *read-only* overlay mount by default.
@stefwalter
Copy link
Contributor

It would be cool if we could somehow mark a volume as needing to persist, and have bootc automatically set up the /var mapping for it.

This would be amazing and match the behavior of people starting to use docker with volumes.

I would go as far to say that this is something that they have to specify before a persistent volume (even /var) is created. Shouldn't we match the basic container behavior as much as we can?

@cgwalters
Copy link
Member Author

I would go as far to say that this is something that they have to specify before a persistent volume (even /var) is created. Shouldn't we match the basic container behavior as much as we can?

This is a good debate to have. I do think some use cases would be happy with a transient /var...but note that it would mean:

  • All systemd journal logs go away
  • All container images go away
  • Everything in user home directories go away (including e.g. provisioned SSH keys)

That latter one would break Anaconda kickstarts setting up ssh keys.

The other "volume" ostree sets up by default is /etc, which is merged on upgrade (this is a relatively unique feature of ostree). Making that transient would have equally large fallout, breaking basically every kickstart verb that exists in Anaconda - forcing all config to live in the container image (or be fetched dynamically every boot from some external source of truth, e.g. kubelet talking to API server). Which is actually what some use cases want, especially "sealed" systems (that are signed/locked down). But not having a persistent /etc would break setting up NetworkManager configs with static IP addresses via kickstart for example.

I like the conceptual purity of making things work exactly the same as when docker first came out, but the fallout of doing so is really large and will force the majority of users into configuring persistent volumes by default. A really pernicious problem will be that things will appear to work until you do an OS update...and that problem is why ostree has the strict model it does by default.

@cgwalters
Copy link
Member Author

Tangentially related...an interesting debate to have is whether the root should reset on reboots that aren't OS updates. The existing PR (unlike systemd.volatile=overlay) keeps things persistent - which model OS reboots more like stopping/starting an existing container. (A tangential concern I have with systemd's default tmpfs is that I think for potentially nontrivial data sizes we really do want to spill to the backing drive by default, and not require distinct swap; but we can still implement rm -rf of the state on boot if we do want it to be transient)

@cgwalters
Copy link
Member Author

cgwalters commented Dec 8, 2023

For those interested in trying this out I've hacked up quay.io/cgwalters/centos-bootc-dev:stream9 (source) and notice the difference:

Previously:

$ podman build -t quay.io/cgwalters/puppet-test .
STEP 1/2: FROM quay.io/centos-bootc/centos-bootc-dev:stream9
STEP 2/2: RUN rpm -Uvh https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.27.0-1.el9.x86_64.rpm
warning: /var/tmp/rpm-tmp.jfV1Ri: Header V4 RSA/SHA512 Signature, key ID 9e61ef26: NOKEY
Retrieving https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.27.0-1.el9.x86_64.rpm
Verifying...                          ########################################
Preparing...                          ########################################
Updating / installing...
puppet-agent-7.27.0-1.el9             error: failed to open dir opt of /opt/: File exists
error: unpacking of archive failed on file /opt/puppetlabs: cpio: open failed - No such file or directory
error: puppet-agent-7.27.0-1.el9.x86_64: install failed
########################################
Error: building at STEP "RUN rpm -Uvh https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.27.0-1.el9.x86_64.rpm": while running runtime: exit status 1
$

With updated test image, configured for this:

$ podman build -t quay.io/cgwalters/ostest .
STEP 1/2: FROM quay.io/cgwalters/centos-bootc-dev:stream9
STEP 2/2: RUN rpm -Uvh https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.27.0-1.el9.x86_64.rpm
warning: /var/tmp/rpm-tmp.YJTk9M: Header V4 RSA/SHA512 Signature, key ID 9e61ef26: NOKEY
Retrieving https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.27.0-1.el9.x86_64.rpm
Verifying...                          ########################################
Preparing...                          ########################################
Updating / installing...
puppet-agent-7.27.0-1.el9             ########################################
COMMIT quay.io/cgwalters/ostest
--> daedd2278950
$

And then:

[root@localhost ~]# bootc status
note: The format of this API is not yet stable
apiVersion: org.containers.bootc/v1alpha1
kind: BootcHost
metadata:
  name: host
spec:
  image:
    image: quay.io/cgwalters/ostest
    transport: registry
    signature: insecure
status:
  staged: null
  booted:
    image:
      image:
        image: quay.io/cgwalters/ostest
        transport: registry
        signature: insecure
      version: stream9.20231208.0
      timestamp: null
      imageDigest: sha256:8099318ffb4a510dd0d1ac9b9e99e2f90352e1d7330bc5b2aa93041105688475
    incompatible: false
    pinned: false
    ostree:
      checksum: dc84da42f86743121e0bdb11dedbfa27d0ae86e1da0becede90259a0abacc095
      deploySerial: 0
...
[root@localhost ~]# systemctl start puppet
[root@localhost ~]# systemctl status puppet
● puppet.service - Puppet agent
     Loaded: loaded (/usr/lib/systemd/system/puppet.service; disabled; preset: disabled)
     Active: active (running) since Fri 2023-12-08 22:46:28 UTC; 1s ago
       Docs: man:puppet-agent(8)
   Main PID: 682 (puppet)
      Tasks: 2 (limit: 50583)
     Memory: 74.1M
        CPU: 449ms
     CGroup: /system.slice/puppet.service
             └─682 /opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/puppet/bin/puppet agent --no-daemonize

Dec 08 22:46:28 localhost.localdomain systemd[1]: Started Puppet agent.
Dec 08 22:46:29 localhost.localdomain puppet-agent[682]: Starting Puppet client version 7.27.0

@jlebon
Copy link
Member

jlebon commented Dec 11, 2023

@alexlarsson On upgrade indeed the old overlayfs data will be gone. But that's how it works since Docker first came out.

This is true, but it's still confusing UX given that all the other packages that keep state in /var will not behave this way. So we're not really shielding people from /opt being special if they're required to take extra measures to deal with lost state.

I worry about this in general, because it breaks a basic container-like workload, because you can't (safely) rebase to a new version while keeping the existing overlay uppper-dir. (Any random change to a file will always replace a new version of the file from the image.

Hmm, keeping the upperdir while we rebase the lowerdir sounds very close to what we want IMO. If we constrain it to /opt (it's not clear whether this makes sense globally on the whole rootfs like rootfs.transient), the data in the upperdir should just be state files, so I wonder how much of an issue the "copied up version wins" part would really be in practice. Even so, it wouldn't be too hard to handle that manually if needed (e.g. like a "transient copyup" model, where we always remove files from the upperdir when newer versions of those files come in from an upgrade, which basically matches the RPM model for regular non-config files).

It seems like this should work well enough to mark coreos/rpm-ostree#233 as closed.


Testing this idea, it seems to work fine at least for Puppet (doing the install part client-side, but it could be made to work just as well in a container layering flow of course):

core@cosa-devsh:~$ sudo rpm-ostree install https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.27.0-1.el9.x86_64.rpm
Downloading https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.27.0-1.el9.x86_64.rpm...done
...
Added:
  puppet-agent-7.27.0-1.el9.x86_64
Changes queued for next boot. Run "systemctl reboot" to start a reboot

core@cosa-devsh:~$ sudo reboot
...
core@cosa-devsh:~$ sudo -i
root@cosa-devsh:~# ls -l /usr/lib/opt
total 0
drwxr-xr-x. 1 root root 20 Jan  1  1970 puppetlabs

Puppet won't start because it can't create under /opt/puppetlabs/puppet/cache/:

root@cosa-devsh:~# systemctl start puppet
root@cosa-devsh:~# systemctl status puppet
× puppet.service - Puppet agent
...
     Active: failed (Result: exit-code) since Mon 2023-12-11 01:11:25 UTC; 1s ago
...
Dec 11 01:11:25 cosa-devsh puppet-agent[1608]: Read-only file system @ dir_s_mkdir - /opt/puppetlabs/puppet/cache/facts.d
Dec 11 01:11:25 cosa-devsh puppet-agent[1608]: (/File[/opt/puppetlabs/puppet/cache/facts.d]/ensure) change from 'absent' to 'directory' failed: Could not set 'directory' on ensure: Read-only file system @ dir_s_mkdir - /opt/puppetlabs/puppet/cache/facts.d
...

(I don't know Puppet at all; Ben's example might be a better test for this given that losing state affects functionality.)

Slap on an overlay:

root@cosa-devsh:~# mkdir -p /var/tmp/puppet/{workdir,upper}
root@cosa-devsh:~# mount -t overlay overlay -olowerdir=/usr/lib/opt,upperdir=/var/tmp/puppet/upper,workdir=/var/tmp/puppet/workdir /usr/lib/opt
root@cosa-devsh:~# systemctl start puppet
root@cosa-devsh:~# systemctl status puppet
● puppet.service - Puppet agent
...
     Active: active (running) since Mon 2023-12-11 01:11:36 UTC; 1s ago
...
Dec 11 01:11:36 cosa-devsh systemd[1]: Started puppet.service - Puppet agent.
Dec 11 01:11:37 cosa-devsh puppet-agent[1615]: Starting Puppet client version 7.27.0

Let's mock an upgrade (in this case a downgrade actually since 7.27.0 is the latest):

root@cosa-devsh:~$ rpm-ostree uninstall puppet-agent --install https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.26.0-1.el9.x86_64.rpm
Downloading https://yum.puppetlabs.com/puppet/el/9/x86_64/puppet-agent-7.26.0-1.el9.x86_64.rpm...done
...
Downgraded:
  puppet-agent 7.27.0-1.el9 -> 7.26.0-1.el9
Changes queued for next boot. Run "systemctl reboot" to start a reboot
root@cosa-devsh:~$ reboot

Reslap on the upperdir on reboot:

root@cosa-devsh:~# mount -t overlay overlay -olowerdir=/usr/lib/opt,upperdir=/var/tmp/puppet/upper,workdir=/var/tmp/puppet/workdir /usr/lib/opt
root@cosa-devsh:~# systemctl start puppet
root@cosa-devsh:~# systemctl status puppet
● puppet.service - Puppet agent
     Loaded: loaded (/usr/lib/systemd/system/puppet.service; disabled; preset: disabled)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: active (running) since Sat 2023-12-11 01:14:57 UTC; 1s ago
       Docs: man:puppet-agent(8)
   Main PID: 1395 (puppet)
      Tasks: 2 (limit: 2230)
     Memory: 64.6M
        CPU: 622ms
     CGroup: /system.slice/puppet.service
             └─1395 /opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/puppet/bin/puppet agent --no-daemonize

Dec 11 01:14:57 cosa-devsh systemd[1]: Started puppet.service - Puppet agent.
Dec 11 01:14:57 cosa-devsh (puppet)[1395]: puppet.service: Referenced but unset environment variable evaluates to an empty string: PUPPET_EXTRA_OPTS
Dec 11 01:14:58 cosa-devsh puppet-agent[1395]: Starting Puppet client version 7.26.0

The nice thing about this is that we're still fitting the OSTree model; code in /usr and data in /var. But yet, it looks and feels just like on traditional systems. Keeping state in /var also means that e.g. attaching a larger disk to /var will still work as expected (e.g. if the state grows large).

Another nice thing is that /opt would now be a symlink to /usr/lib/opt, which matches what rpm-ostree does today when importing these RPMs. It wouldn't break anything writing to /opt currently since it'd now be writable. (We'd want to make /var/opt also link to /usr/lib/opt for compatibility with things that hardcoded that path too.)

I think /usr/local could be structured the same way as /opt: /usr/local would now be a directory in both the OSTree commit itself (now allowed to RPMs) and in deployment, but with an overlay on top. I don't think intermixed state is an issue for /usr/local, but doing this would allow installing RPMs that put content in there without breaking code that currently assume it's writable.

Since this doesn't break existing flows (e.g. via Ignition or the MCO), this approach could also more easily be rolled out to FCOS and OCP.

Thoughts? I'm sure I'm probably missing something.

@cgwalters
Copy link
Member Author

So we're not really shielding people from /opt being special if they're required to take extra measures to deal with lost state.

This is definitely right; in the general case people will need to be aware of it. But the scope of necessary changes is much smaller than forcing them to entirely move where things are placed into /usr or /var.

Hmm, keeping the upperdir while we rebase the lowerdir sounds very close to what we want IMO.

Overlayfs support for changes to the lower while mounted offline used to not be supported at all; looks like nowadays it is, if the appropriate mount options are enabled. I don't have much experience with this. Hmm, turning of "metadata only copy up" would be a big performance hit on systems with reflinks.

I think the strong argument against this though is basically: It's not how Docker worked when it came out. The semantic for / is easy to explain with the current code.

(e.g. like a "transient copyup" model, where we always remove files from the upperdir when newer versions of those files come in from an upgrade, which basically matches the RPM model for regular non-config files).

This however is not easy to explain or implement (especially in a transactional way)! You're arguing to treat all of / as ostree treats /etc effectively right?

Another nice thing is that /opt would now be a symlink to /usr/lib/opt, which matches what rpm-ostree does today when importing these RPMs.

Except not in a container where we want this (which is fixable but more importantly) it's not what dnf or rpm do.

It feels like bigger picture the debate here is basically:

  • Make a system more like ostree
  • Make a system more like docker

We already have the first; the goal is the second.

@jlebon
Copy link
Member

jlebon commented Dec 11, 2023

Hmm, turning of "metadata only copy up" would be a big performance hit on systems with reflinks.

But note we only expect state files in the upper layer, so a copyup shouldn't be that common.

I think the strong argument against this though is basically: It's not how Docker worked when it came out. The semantic for / is easy to explain with the current code.

OTOH just to highlight, the main argument for it is that it fits quite well in the OSTree model as it exists today. :) We could fix /opt and /usr/local for any OSTree user (without breaking existing flows), not just OSTree + bootc.

(e.g. like a "transient copyup" model, where we always remove files from the upperdir when newer versions of those files come in from an upgrade, which basically matches the RPM model for regular non-config files).

This however is not easy to explain or implement (especially in a transactional way)! You're arguing to treat all of / as ostree treats /etc effectively right?

No, only /opt and /usr/local. I mentioned in the comment: "it's not clear whether this makes sense globally on the whole rootfs like rootfs.transient". Re. the "transient copyup", I'll stress that it's unclear whether we really need this at all. Sure, a user could start mucking around in parts of /opt they shouldn't, but presumably they know what they're doing if they're doing that. (Yes, it's not as strong a guarantee as we can provide for /usr today; I think for that, we'd need an overlay mount option that blocks copyup entirely.)

Another nice thing is that /opt would now be a symlink to /usr/lib/opt, which matches what rpm-ostree does today when importing these RPMs.

Except not in a container where we want this (which is fixable but more importantly) it's not what dnf or rpm do.

Right, for the container flow, it would still be /opt and moved to /usr/lib/opt on import.

It feels like bigger picture the debate here is basically:

  • Make a system more like ostree
  • Make a system more like docker

Agreed. I'm likely missing context here, but ISTM the base problem we're trying to fix is /opt packages, right? We did that by adding a concept of "transient rootfs to feel like Docker", but that's more an alternative way of thinking about the host than a direct fix for /opt, right? I definitely see room for that approach, but in that mode, I think it'd be conceptually much clearer to truly make it transient as @stefwalter suggested. As way of example, the list in coreos/rpm-ostree#4719 (/opt, /media, /mnt, and usr/local) seems semi-arbitrary and points more towards actually being a way to make /opt and /usr/local RPMs work (otherwise, for consistency we'd also enable transient overlays for /usr too).

If we fix the /opt case independently, then we have more room to have a transient model that truly matches containers, and yes requiring users to mount "PVCs" most of the time (though not retaining any state at all is definitely a use case).

@jlebon
Copy link
Member

jlebon commented Dec 11, 2023

All that said, I won't harp on this too long. I see a huge opportunity here to possibly fix a longstanding issue (and I think the code paths touched are very similar to those touched for rootfs.transient), and happy to put code where my mouth is if there's general consensus. :) But I'm also OK with shipping an intermediate solution for now to get unblocked. Maybe let's rename rootfs.transient to e.g. exp.rootfs.transient and leave the non-experimental version for when we actually do a fully transient mode?

@cgwalters
Copy link
Member Author

As way of example, the list in coreos/rpm-ostree#4719 (/opt, /media, /mnt, and usr/local) seems semi-arbitrary

Note that most of that code is only necessary (as the comment says) because the Fedora filesystem package is totally wacky and actually makes all of its content in a Lua %pretrans which we don't run.

If we didn't have that bug then the overrides in place for root.transient would just be the /home and /root symlinks.

and points more towards actually being a way to make /opt and /usr/local RPMs work (otherwise, for consistency we'd also enable transient overlays for /usr too).

Yes, you are right that it is inconsistent for /usr/local to still be read-only. However, IME much less code does dynamic writes to /usr/local. Code using that is generally still designed to log to /var/log (or use the journal). Whereas /opt has a much bigger legacy of trying to implement a "self contained app dir" model.

If we do run into issues with code that expects writability to /usr/local then it's an obvious extension to add usr.transient = true as well.

@jlebon
Copy link
Member

jlebon commented Dec 11, 2023

and points more towards actually being a way to make /opt and /usr/local RPMs work (otherwise, for consistency we'd also enable transient overlays for /usr too).

Yes, you are right that it is inconsistent for /usr/local to still be read-only.

I mean /usr as a whole, not just /usr/local. From a user's point of view, why should rootfs.transient have different semantics for /opt and /usr?

@cgwalters
Copy link
Member Author

I mean /usr as a whole, not just /usr/local. From a user's point of view, why should rootfs.transient have different semantics for /opt and /usr?

This is a reasonable question. But flipping things around a bit...we will have different mounts set up by default. For example, we're not compromising on /tmp being a tmpfs (and be cleaned up by systemd-tmpfiles). Even more important, we're retaining our default persistent /var which I think most use cases want.

The /usr-as-readonly-OS has a very long history (kickstarted by systemd at least 10+ years ago). We've had reasonable success with it since then. We haven't had significant problems with it AFAIK - only with things like /opt and also use cases like wanting random new toplevel mountpoints (e.g. /afs) etc. that a transient root also addresses.

I guess the way I am thinking of things, in this mode we keep the "ostree core model of /etc, /usr, /var" semantics - but things outside of it now have "transient overlay" semantics instead of just being disallowed.

(One tangential thing here is systemd is very focused on "/usr-is-OS" and "/etc is empty", which is even stricter than ostree semantics, but they are compatible)

@cgwalters
Copy link
Member Author

A different tangential but notable thing: because overlayfs doesn't implement atime for lower in this mode we effectively start ignoring atime changes for all the OS state in /usr too, which is honestly just great! (But we get that with composefs too)

@cgwalters
Copy link
Member Author

Maybe let's rename rootfs.transient to e.g. exp.rootfs.transient and leave the non-experimental version for when we actually do a fully transient mode?

Can you clarify here: Is "fully transient mode" here just having /usr also be writable/transient?

@jlebon
Copy link
Member

jlebon commented Dec 11, 2023

Maybe let's rename rootfs.transient to e.g. exp.rootfs.transient and leave the non-experimental version for when we actually do a fully transient mode?

Can you clarify here: Is "fully transient mode" here just having /usr also be writable/transient?

Yeah, exactly. But of course it still wouldn't be consistent, because programs in /usr keep their state in /var, which would persist, whereas for programs in /opt that keep their state in /opt, it will not. But it'd be closer to the "Docker style" it's looking to emulate.

Another approach is to keep it as is, but not frame it as a "Docker style" knob because the bits that retain state far outweigh the bits that don't.

I would still like to investigate the rebased upperdir approach though. I don't think it conceptually conflicts with root.transient. Would you agree?

@cgwalters
Copy link
Member Author

cgwalters commented Dec 11, 2023

Another approach is to keep it as is, but not frame it as a "Docker style" knob because the bits that retain state far outweigh the bits that don't.

While the term "docker" is definitely in this issue, it's not in the current docs (which are pretty small, and clearly need elaboration and probably graphics).

I would still like to investigate the rebased upperdir approach though. I don't think it conceptually conflicts with root.transient. Would you agree?

I'm not opposed in theory to supporting something like this as it'd just be a tweak on top of what overlayfs supports, but I'd like to dig deeper into exactly what it would fix, what the config option would look like, and really ideally have concrete examples of existing software that would be fixed.

Are you thinking of cases like the one Ben mentioned with

I think applications like crowstrike will struggle as their client performs a type of "activation" and maintains state with the controller all under the /opt/crowdstrike/ directory.

?

If so then yes I think your ("rootfs.merge") would fix cases like that automatically.

But a notable tricky thing here is the case of "what happens if the app modifies a file that's in the base image". With RPM (and dpkg I think), unless something is explicitly marked as a config file, it will get replaced on update, even if it's modified (I'm not sure RPM even checks if it's changed). So for app binaries if they happen to be written to at runtime, we'll still reliably get updated binaries.

I'm worried about corner cases like Python apps that end up recompiling the bytecode at runtime (just because it's writable and timestamps...) and hence touch the .pyc/.pyo file. If we implement this with overlayfs and your proposal, then cases like this will break. And the evil thing is this breakage will not be very obvious. (Of course we could add a diff command, but still...I think most admins would only turn to that when they were close to figuring out the problem anyways)

@jlebon
Copy link
Member

jlebon commented Dec 11, 2023

While the term "docker" is definitely in this issue, it's not in the current docs (which are pretty small, and clearly need elaboration and probably graphics).

👍

I'm not opposed in theory to supporting something like this as it'd just be a tweak on top of what overlayfs supports, but I'd like to dig deeper into exactly what it would fix, what the config option would look like, and really ideally have concrete examples of existing software that would be fixed.

Heh, I would posit the reverse: since most apps have state, I would expect some loss of functionality from not retaining it. Obviously the degree of functionality loss varies. E.g. even the Puppet example we used, which probably works OK in root.transient mode, would lose data about its last run (and basically anything under $statedir and $publicdir on that page).

But the other thing I'll stress here is that the state overlay approach is compatible with existing systems which means we'd fix the /opt and /usr/local cases for those too without having to force them through the backwards incompatible transition to root.transient.

But a notable tricky thing here is the case of "what happens if the app modifies a file that's in the base image". With RPM (and dpkg I think), unless something is explicitly marked as a config file, it will get replaced on update, even if it's modified (I'm not sure RPM even checks if it's changed). So for app binaries if they happen to be written to at runtime, we'll still reliably get updated binaries.

Yeah exactly. This is what I'm talking about in #3113 (comment) (sentence starting with "Even so"). I think we'd need to address it eventually, but we'd get quite far I think without worrying about it to start.

I'm worried about corner cases like Python apps that end up recompiling the bytecode at runtime (just because it's writable and timestamps...) and hence touch the .pyc/.pyo file. If we implement this with overlayfs and your proposal, then cases like this will break. And the evil thing is this breakage will not be very obvious. (Of course we could add a diff command, but still...I think most admins would only turn to that when they were close to figuring out the problem anyways)

Yes good point. Something to investigate.

@cgwalters
Copy link
Member Author

Heh, I would posit the reverse: since most apps have state, I would expect some loss of functionality from not retaining it.

Right, that's why we've had the strict model since forever.

But the other thing I'll stress here is that this approach is compatible with existing systems which means we'd fix the /opt and /usr/local cases for those too without having to force them through the backwards incompatible transition to root.transient.

I think your proposal here needs an explicit name (rootfs.merge or something) because the this pronoun gets ambiguous quickly otherwise. I think here "this" = rootfs.merge, not rootfs.transient right?

I'm a bit confused though because on existing systems that have e.g. /opt -> /var/opt...if we were trying to transition them in place to rootfs.merge we'd need to move all the files back into an /opt directory to do a merge, right? Or are you thinking something like /opt would become an overlayfs that would include content from /var/opt in its lower?

jlebon added a commit to jlebon/ostree that referenced this issue Dec 14, 2023
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Dec 14, 2023
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
@jlebon
Copy link
Member

jlebon commented Dec 14, 2023

I think your proposal here needs an explicit name (rootfs.merge or something) because the this pronoun gets ambiguous quickly otherwise. I think here "this" = rootfs.merge, not rootfs.transient right?

I'm using the term "state overlays" in code. Updated the comment for clarification.

I'm a bit confused though because on existing systems that have e.g. /opt -> /var/opt...if we were trying to transition them in place to rootfs.merge we'd need to move all the files back into an /opt directory to do a merge, right? Or are you thinking something like /opt would become an overlayfs that would include content from /var/opt in its lower?

Migrating existing nodes is certainly possible. At its core, I think we'd basically move /var/opt content to the upper dir (e.g. /var/ostree/state-overlay/usr-lib-opt/upper) before assembling the overlay. This automatically will make all its content be upperdir state. (We could even make the directory itself become the upper dir; unless the user mounted a different filesystem under /var/opt, that should just be a single rename() call).

I think the tricky thing is doing it in a resilient way that will work seamlessly across a rollback. For that, I think probably we'll have to ship a systemd service ahead of time that does the /var/opt move to /var/ostree/... and then makes it a symlink to either that directory (before the transition) or /usr/lib/opt (after the transition).

Anyway, for now, I think let's explore how well this option works before worrying about migration. I opened #3120 and coreos/rpm-ostree#4728. The other big piece missing of course is changing the container path to move /opt to /usr/lib/opt at import time (filed ostreedev/ostree-rs-ext#573 for that).

jlebon added a commit to jlebon/ostree that referenced this issue Dec 14, 2023
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
jlebon added a commit to jlebon/ostree that referenced this issue Dec 14, 2023
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
jlebon added a commit to jlebon/ostree that referenced this issue Dec 14, 2023
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 9, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 9, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 9, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 9, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 9, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/ostree that referenced this issue Jan 9, 2024
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
jlebon added a commit to jlebon/ostree that referenced this issue Jan 9, 2024
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
jlebon added a commit to jlebon/ostree that referenced this issue Jan 10, 2024
In the OSTree model, executables go in `/usr`, state in `/var` and
configuration in `/etc`. Software that lives in `/opt` however messes
this up because it often mixes code *and* state, making it harder to
manage.

More generally, it's sometimes useful to have the OSTree commit contain
code under a certain path, but still allow that path to be writable by
software and the sysadmin at runtime (`/usr/local` is another instance).

Add the concept of state overlays. A state overlay is an overlayfs
mount whose upper directory, which contains unmanaged state, is carried
forward on top of a lower directory, containing OSTree-managed files.

In the example of `/usr/local`, OSTree commits can ship content there,
all while allowing users to e.g. add scripts in `/usr/local/bin` when
booted into that commit.

Some reconciliation logic is executed whenever the base is updated so
that newer files in the base are never shadowed by a copied up version
in the upper directory. This matches RPM semantics when upgrading
packages whose files may have been modified.

For ease of integration, this is exposed as a systemd template unit which
any downstream distro/user can enable. The instance name is the mountpath
in escaped systemd path notation (e.g.
`ostree-state-overlay@usr-local.service`).

See discussions in ostreedev#3113 for
more details.
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 11, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 24, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 24, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Jan 26, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: coreos#233
cgwalters pushed a commit to coreos/rpm-ostree that referenced this issue Jan 30, 2024
This solves the `/opt` problem by using the new state overlay concept in
OSTree: an overlay filesystem is mounted on top of `/usr/lib/opt` and
the upper dir is automatically "rebased" whenever new content comes in.
Concretely, this means that app state is carried forward, all while
allowing the (OSTree-managed) package contents to be updated.

We also solve the `/usr/local` problem the same way. The app state issue
isn't really present there, but `/usr/local` has traditionally been
system state. We want to keep supporting dropping files there all while
also supporting shipping OSTree-owned content.

See also: ostreedev/ostree#3113
Fixes: #233
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants