-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
systemd zvol target #9016
base: master
Are you sure you want to change the base?
systemd zvol target #9016
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9016 +/- ##
==========================================
+ Coverage 75.17% 79.10% +3.92%
==========================================
Files 402 419 +17
Lines 128071 123696 -4375
==========================================
+ Hits 96283 97846 +1563
+ Misses 31788 25850 -5938
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some scalability testing with this code.
Not related to this code, but I've noticed that /usr/lib/systemd/system-generators/zfs-mount-generator
introduced in #7329 is not a valid path for generators on Ubuntu, so I had to move this to /lib/systemd/system-generators/zfs-mount-generator
to have this work. I'll file a separate bug for that.
I've tested this on the same system as for #8975, which had ~1000 zvols. I've noticed that many of the units timed-out (269 in total) with this kind of message:
sh[6709]: zvol domain0/VOLS/zvol499 unavailable
We could probably extend the timeout if needed.
One issue scalability-wise is that since this feature requires enabling history_event-zfs-list-cacher.sh
which does a zfs list
on each dataset addition or removal, this may not be practical on systems with a large amount of datasets. There are also scalability issues with systemd itself, but that's another matter.
There is definitely value in this feature for systems that have a lower number of zvols, so I think it's reasonable to keep this as an opt-in feature for now. I believe this feature should be able to coexist with #8975, which doesn't provide granularity but does provide scalability and a stand-alone script to wait for zvols to be ready (it can be called after a pool has been imported manually). To keep the naming a bit more consistent, I can rename zfs-volume.target
to zfs-volumes.target
.
DefaultDependencies=no | ||
|
||
[Service] | ||
ExecStart=sh -c 'timeout=30 ; while [ $timeout -gt 0 ] ; do [ -e "/dev/zvol/$0" ] && exit 0 ; sleep 1 ; timeout=$((timeout-1)); done ; echo "zvol $0 unavailable"' %I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be /bin/sh
.
systemd requires full path for ExecStart scripts (At least on Ubuntu).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The man page (in Debian and Ubuntu), says that "It is thus safe to use just the executable name in case of executables located in any of the "standard" directories". It does recommend absolute paths to "avoid ambiguity". Could you confirm that this is actually broken in Ubuntu? If there's even one example where this could be a problem, I think it's worth adjusting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Omfg. Ubuntu 12.04, 18.10, 19.04 allow non-absolute paths while 14.04, 16.04, 18.04 requires absolute paths (they fail with Executable path is not absolute, ignoring: sh -c). WTF are they doing...
Short: yes, ExecStart must be absolute here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c0d3z3r0 Strange! Thanks for testing, I've updated this to use /bin/sh
.
Thanks for looking at this @pzakha!
That's a packaging issue as far I can tell. The Debian source (which I've made sure stayed synchronized) has the correct settings (in
I was worried about this, actually. Earlier version of the zed scriptlet actually tried to only update the affected datasets. We dropped it at the time because it was getting complicated. Perhaps it can be revisited, given the demonstrated need.
I feel like |
@aerusso @pzakha I think we should strongly consider pulling this change and #8975 in to a single PR. This would let us easily resolve any discrepancies, for example we'll need to settle on My suggestion would be to lay this out the same way as the
Does that sound like a reasonable way forward for both of these PRs? |
I was planning to do 2 changes to #8975 to make it consistent with this PR:
I think given those changes, this PR could be a natural follow-up, as bundling them together in a single review might delay integration of #8975. @aerusso what do you think? |
Yes, you are right. It looks like we do not explicitly set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine in my preliminary testing, although I'm wondering if we should extend the default timeout of 30 seconds.
@@ -46,6 +46,8 @@ fi | |||
# avoid regressions, this dependency is reduced to "wants" rather than | |||
# "requires". **THIS MAY CHANGE** | |||
req_dir="${dest_norm}/local-fs.target.wants/" | |||
zvol_reqdir="${dest_norm}/zfs-volumes.target.requires" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we integrated #8975, do you think we still need to add an additional requires
on zfs-volumes.target
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so (enabling the zfs-volume@
units should presumably allow the user to disable zfs-volume-wait.service
).
I feel like I may be misunderstanding something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original idea was to only use zfs-volume-wait.service
as a dependency for zfs-volumes.target
, but that's mostly because using it along with zfs-volume@
as a dependency was redundant. Disabling zfs-volume-wait.service
when we are using zfs-volume@
would be another way to do this.
692600b
to
bc3ce98
Compare
I've rebased on master, and implemented the changes. I have not tested it to see how it interacts with #8975, but I've disabled the template units by default. Presumably, we should add some documentation about disabling |
That sounds good to me. I think we may also want to mention the trade-offs between the 2 implementations, especially regarding scalability. I'm also wondering if the 30 seconds timeout is too aggressive. If the |
etc/systemd/system/50-zfs.preset.in
Outdated
@@ -4,6 +4,8 @@ disable zfs-import-scan.service | |||
enable zfs-import.target | |||
enable zfs-mount.service | |||
enable zfs-share.service | |||
disable zfs-volumes.target | |||
disable zfs-volume@.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if that is needed since the service only comes into effect when some zfs-volume@volname is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will gladly enable this if this is the consensus. We seem to be moving forward very cautiously with this zfs-mount-generator project, and I don't want to be too aggressive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be reasonable to enable this and the zfs-mount-generator by default in the next tagged release. Why don't we go ahead and merge this so it's currently disabled, then open a new PR to enable the generators by default. This way it'll be enabled in the master branch by default well before anything is tagged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I‘m ok with the target disabled but disable the template is nonsense as it does not disable anything ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, line 8 can be dropped but 7 must remain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, one more thing... zfs-volume@.service isn't even an installable unit, so there is not need for the disabled line because it can't be enabled anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, disable zfs-volume@*.service
might parse (the documentation on these preset files is pretty sparse), but it's also unnecessary, since it would only exist if you were trying to use this.
Also, line 7 should also go. I added that disable out of an abundance of caution (not creating and enabling a feature). This patch doesn't create that `.target' anymore, so I'm not going to disable it.
DefaultDependencies=no | ||
|
||
[Service] | ||
ExecStart=/bin/sh -c 'timeout=30 ; while [ $timeout -gt 0 ] ; do [ -e "/dev/zvol/$0" ] && exit 0 ; sleep 1 ; timeout=$((timeout-1)); done ; echo "zvol $0 unavailable"' %I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is systemd, no need for such complicated construct :-)
BindsTo=%i.device
After=%i.device
Conflicts=shutdown.target # maybe not needed, needs testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's a lot nicer!
I'm not super familiar with how the timeout on this works. If this device is slow to show up, will this give up and enter a failed state, or forever remain in a waiting state? Is this what JobTimeoutSec=
does (if so, when does it become "queued"?) The other timeout setting I see in man systemd.unit
seems to do with once the job has fired. Is my understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my late response, I Will have a look again later rhis day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JobTimeoutSec is there for a) setting a timeout other than the default (infinity) for normal units and b) for overriding the default device unit timeout (which is specified by DefaultTimeoutStartSec=). In our case b) applies, which we should not modify upstream. DefaultTimeoutStartSec is 90s for Debian for example.
When the timeout is reached, the unit gets aborted and enters "failed" state. What happens then depends on the dependencies specified by "Wants" and "Requires". When the zvol target requires all zvol@.service and the zvol target is required by multi-user.target, then systemd will completely refuse to boot and go to recovery mode. This is not what we want but since zfs-volumes.target is only WantedBy=zfs.target, it only gets marked as failed.
bc3ce98
to
a667bc3
Compare
Currently it seems the generator does not create the target directory, but I don't know why, yet |
@c0d3z3r0 Could you comment what you mean it "does not create the target directory"? Are you calling the generator manually? This was not something I ever experienced during testing. |
Sorry, that was unclear - I meant the zvol_reqdir doesn't get created |
@c0d3z3r0 I'm very confused how that's possible. Is the |
not calling it manually; I will do a fresh zfs setup later and test this again |
@c0d3z3r0 I've actually think I have a simpler way to do all of this. Let me get that up before wasting time testing an older patch. |
great, lmk when it's ready for testing :) |
a667bc3
to
9fde4ad
Compare
@c0d3z3r0 Take a look at this version; I think you will be pleased. It drops the extra template unit entirely, opting just to I also was unable to reproduce any issues creating the dependency directory here. |
@aerusso hah, very nice solution! congrats :)) LGTM |
It seems like we are now depending on I've manually created a new zvol Is there something else required to have |
@pzakha I'm running systemd 243 (Debian unstable). I was worried I was going to have to do something, but creating the zvol "just worked for me." In
and,
So this is supposedly "default" systemd behavior---does your distribution's man page reflect that? |
@pzakha Were you testing this on a Delphix Engine? We disabled .device units for zvols because they were causing us to hit the hardcoded systemd limit for number of units on a system: delphix/delphix-platform#80 |
Okay, thanks for the clarifications, I was not aware we had it disabled explicitly in the product, so sorry for the confusion. So given that udev seems to already provide per-zvol units, which I believe was the main selling point for the original PR, what other advantages do we get with this new approach?
|
So in this case, the idea would be to manually edit the FSLIST to include only the zvols that the administrator cares about and not use Edit: The alternative would be to add the following statements to whatever service needs those zvols (which should work without this PR):
|
This is not my intention with this PR. The idea is to have a "proper" integration into systemd, that "does the right thing" automatically. I read through your changes to The way I see it, if people want to fine-tune their systemd unit dependencies, they should depend on specific |
Okay, that makes more sense. Once we bring this to feature parity (exclude the right zvols), it would make sense to remove Keeping the This will still have the disadvantages of not scaling well, but I think it should be fine for most folks, and for others they can keep using |
Can the code, instead of using polling alone, use a combination of inotify/fanotify and polling, or perhaps just inotify/fanotify alone, since it knows ahead of time which volumes are supposed to appear? Does this work well with ZFS encryption of subvolumes? |
@aerusso if you still want to move forward with this tighter systemd integration, can you get this rebased and add the missing functionality. |
@behlendorf I'm going to wait for #12138 to land, rather than force it to rebase on this patch (or lose the race and then rebase this a second time). |
So, I have a (local, untested) version that is rebased on the new C version of list_zvols() {
read -r default_volmode < /sys/module/zfs/parameters/zvol_volmode
zfs list -t volume -H -o \
name,volmode,receive_resume_token,redact_snaps |
while IFS=" " read -r name volmode token redacted; do # IFS=\t here!
# /dev links are not created for zvols with volmode = "none"
# or for redacted zvols.
[ "$volmode" = "none" ] && continue
[ "$volmode" = "default" ] && [ "$default_volmode" = "3" ] &&
continue
[ "$redacted" = "-" ] || continue
# We also ignore partially received zvols if it is
# not an incremental receive, as those won't even have a block
# device minor node created yet.
if [ "$token" != "-" ]; then
# Incremental receives create an invisible clone that
# is not automatically displayed by zfs list.
if ! zfs list "$name/%recv" >/dev/null 2>&1; then
continue
fi
fi
echo "$name"
done
} The problem is that I'd like to bounce the following proposal off people: refactor zvol_wait to be able to act on a single dataset (or on all datasets, by default, with no command line options). Then, call that from a systemd unit. Unfortunately, this gets rid of a lot of the beautiful simplicity we achieved in the last version (where everything was handled nicely by systemd). We also can't use |
fd6bb46
to
80452d5
Compare
Do not link the pthreads library into zfs-mount-generator Signed-off-by: Antonio Russo <aerusso@aerusso.net>
zvol_wait provides a mechanism to check whether a zvol has been made available. By design, it only acts on zvols that are expected to become ready in the normal course of operation. This functionality is embedded inside an inner loop, making it unusable for other purposes. This patch isolates that logic and moves it into a common function. Signed-off-by: Antonio Russo <aerusso@aerusso.net>
This extends zvol_wait to be able to wait for a specific zvol. Signed-off-by: Antonio Russo <aerusso@aerusso.net>
80452d5
to
f91b83d
Compare
The zfs-mount-generator is used to produce granular systemd mount units that integrate the mounting of zfs datasets into systemd. Extend the zfs-mount-generator to also inform systemd about the initialization of zvols. Signed-off-by: Antonio Russo <aerusso@aerusso.net>
f91b83d
to
f34b249
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick glimpse and the new logic seems to make sense to me. I'll try to do a full review later this week. |
Motivation and Context
This is an alternative to #8975, implementing granular zfs-volume@.service units for each zvol.
Description
Introduce a zfs-volumes.target, which Requires= zfs-volume@.service units that are dynamically generated for each zvol by the existing systemd-generator infrastructure at boot time.
Each unit polls once per second, exiting if it detects the zvol link, succeeding unconditionally after a 30 second timeout.
How Has This Been Tested?
It builds. The target takes 30 seconds to start if there is a missing link to a zvol, starts up within a second of the link being created. Appropriate systemd links were created after a
systemctl daemon-reload
. The modified version of zfs-list.cache did not cause interference with mismatched versions of zfs-mount-generator.Types of changes
Checklist:
Signed-off-by
.