New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deploy: Add a 5s max timeout on global filesystem sync()
#2519
Conversation
6277c0c
to
05c1ba0
Compare
|
This seems sane to me, though OTOH if the goal is to eventually drop it then it seems like a good opportunity to try it out and see what breaks. E.g. we can add an env var or knob that disables the global sync before we make it the default and then enable it in FCOS rawhide and next. |
https://bugzilla.redhat.com/show_bug.cgi?id=2003532 Basically there's a systemd bug where it's losing the `_netdev` aspect of Ceph filesystem mounts. This means the network is taken down before Ceph is unmounted. In turn, our invocation of `sync()` blocks on Ceph, which won't succeed. And this in turn manifests as a failure to transition to the new deployment. I initially did this patch to just rip out the global `sync()`. I am pretty sure we don't need it anymore. We've been doing individual `syncfs()` on `/sysroot` and `/boot` for a while now, and those are the only filesystems we should be touching. But *proving* that is a whole other thing of course. To be conservative, let's instead just add a timeout of 5s on our invocation of `sync()`. It doesn't return any information on success/error anyways. To allow testing without the `sync()` invocation, we also support a new `OSTREE_SYSROOT_OPT_SKIP_SYNC=1` environment variable. For staged deployments, this needs to be injected via e.g. systemd unit overrides into `ostree-finalize-staged.service`. Implementing this is a bit hairy - we need to spawn a thread. I debated blocking in arecursive mainloop, but I think `g_cond_wait_until()` is also fine here.
05c1ba0
to
cb73129
Compare
Good idea! Done. |
|
I did double-check the threading/synchronization part and it looks correct to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Re. the env var, a good follow-up would be to make it a repo config knob indeed so we can more cleanly turn it on in some FCOS streams.
Isn't that actually better via a systemd drop-in? |
Was more thinking of an |
https://bugzilla.redhat.com/show_bug.cgi?id=2003532
Basically there's a systemd bug where it's losing the
_netdevaspect of Ceph filesystem mounts. This means the network is taken
down before Ceph is unmounted. In turn, our invocation of
sync()blocks on Ceph, which won't succeed.
And this in turn manifests as a failure to transition to the new
deployment.
I initially did this patch to just rip out the global
sync(). Iam pretty sure we don't need it anymore. We've been doing individual
syncfs()on/sysrootand/bootfor a while now, and thoseare the only filesystems we should be touching. But proving that
is a whole other thing of course.
To be conservative, let's instead just add a timeout of 5s on
our invocation of
sync(). It doesn't return any information onsuccess/error anyways.
Implementing this is a bit hairy - we need to spawn a thread. I
debated blocking in arecursive mainloop, but I think
g_cond_wait_until()is also fine here.