Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploy: Add a 5s max timeout on global filesystem sync() #2519

Merged
merged 1 commit into from Jan 18, 2022

Conversation

cgwalters
Copy link
Member

https://bugzilla.redhat.com/show_bug.cgi?id=2003532

Basically there's a systemd bug where it's losing the _netdev
aspect of Ceph filesystem mounts. This means the network is taken
down before Ceph is unmounted. In turn, our invocation of sync()
blocks on Ceph, which won't succeed.

And this in turn manifests as a failure to transition to the new
deployment.

I initially did this patch to just rip out the global sync(). I
am pretty sure we don't need it anymore. We've been doing individual
syncfs() on /sysroot and /boot for a while now, and those
are the only filesystems we should be touching. But proving that
is a whole other thing of course.

To be conservative, let's instead just add a timeout of 5s on
our invocation of sync(). It doesn't return any information on
success/error anyways.

Implementing this is a bit hairy - we need to spawn a thread. I
debated blocking in arecursive mainloop, but I think g_cond_wait_until()
is also fine here.

@cgwalters cgwalters force-pushed the syncfs-only branch 3 times, most recently from 6277c0c to 05c1ba0 Compare January 17, 2022 18:50
@jlebon
Copy link
Member

jlebon commented Jan 17, 2022

This seems sane to me, though OTOH if the goal is to eventually drop it then it seems like a good opportunity to try it out and see what breaks. E.g. we can add an env var or knob that disables the global sync before we make it the default and then enable it in FCOS rawhide and next.

https://bugzilla.redhat.com/show_bug.cgi?id=2003532

Basically there's a systemd bug where it's losing the `_netdev`
aspect of Ceph filesystem mounts.  This means the network is taken
down before Ceph is unmounted.  In turn, our invocation of `sync()`
blocks on Ceph, which won't succeed.

And this in turn manifests as a failure to transition to the new
deployment.

I initially did this patch to just rip out the global `sync()`.  I
am pretty sure we don't need it anymore.  We've been doing individual
`syncfs()` on `/sysroot` and `/boot` for a while now, and those
are the only filesystems we should be touching.  But *proving* that
is a whole other thing of course.

To be conservative, let's instead just add a timeout of 5s on
our invocation of `sync()`.  It doesn't return any information on
success/error anyways.

To allow testing without the `sync()` invocation, we also support
a new `OSTREE_SYSROOT_OPT_SKIP_SYNC=1` environment variable.  For
staged deployments, this needs to be injected via e.g. systemd unit
overrides into `ostree-finalize-staged.service`.

Implementing this is a bit hairy - we need to spawn a thread.  I
debated blocking in arecursive mainloop, but I think `g_cond_wait_until()`
is also fine here.
@cgwalters
Copy link
Member Author

E.g. we can add an env var or knob that disables the global sync before we make it the default and then enable it in FCOS rawhide and next.

Good idea! Done.

@lucab
Copy link
Member

lucab commented Jan 18, 2022

I did double-check the threading/synchronization part and it looks correct to me.
Leaving the last pass/stamp to @jlebon for the new phase out flag.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Re. the env var, a good follow-up would be to make it a repo config knob indeed so we can more cleanly turn it on in some FCOS streams.

@cgwalters
Copy link
Member Author

Re. the env var, a good follow-up would be to make it a repo config knob indeed so we can more cleanly turn it on in some FCOS streams.

Isn't that actually better via a systemd drop-in?

@cgwalters cgwalters merged commit a05b02f into ostreedev:main Jan 18, 2022
@jlebon
Copy link
Member

jlebon commented Jan 18, 2022

Re. the env var, a good follow-up would be to make it a repo config knob indeed so we can more cleanly turn it on in some FCOS streams.

Isn't that actually better via a systemd drop-in?

Was more thinking of an image.yaml knob which turns on the repo option in create_disk.sh. I guess the question is whether we want to affect all nodes or just new nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants