Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd.mount integration #7329

Merged
merged 2 commits into from
Apr 6, 2018
Merged

Conversation

aerusso
Copy link
Contributor

@aerusso aerusso commented Mar 21, 2018

Implements the "systemd generator" protocol in zfs-mount-generator

Description

zfs-mount-generator implements the "systemd generator" protocol, producing systemd.mount units from the (possibly cached) output of zfs list during early boot, giving full systemd integration.

The most visible benefit of this is that /etc/fstab can safely refer to zfs mount points, because systemd will take care to mount filesystems in the correct order.

Motivation and Context

This PR takes a different approach from #6974, which modified /etc/fstab to reflect ZFS mountpoints. Here, instead, ZFS mounts are tracked by directly creating native systemd .mount units, at early boot from the output of zfs list -H -t filesystem -oname,mountpoint,canmount. Because pools may not be imported, the output of this command can be saved in /etc/zfs/zfs-list.cache. If the pools are for some reason mounted at early boot (e.g., zfs on root), this file can be omitted and the command will be run.

This generator is not required; it does not interfere with zfs-mount.service, so anything missing from the cache file (or from an unimported pool) will be mounted as before.

As mentioned before, this allows for complex mount hierarchies (e.g., bind mounts that must happen after zfs mounts are made; any other filesystem mounted on top of any ZFS). Notice that ZFS on root users are most likely to want such features, and will not have to create the zfs-list.cache file.

How Has This Been Tested?

I've been using several incarnations of this generator for several months, allowing for some maturity in the patches. E.g., a dependency has been reduced from Requires to Wants to prevent filesystems from being unmounted when zfs systemd units are shuffled around during upgrades.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@aerusso aerusso force-pushed the pulls/systemd-generator branch 2 times, most recently from 4ed3050 to 8d27bc6 Compare March 22, 2018 06:40
@Fabian-Gruenbichler
Copy link
Contributor

concept LGTM at a first glance - not yet tested though ;) hopefully I'll find some time soon.

I wonder whether we want some kind of integration for re-generating the cached dataset list? it does sound a bit cumbersome to have to remember to re-run it after creating or renaming a dataset.. OTOH I am not sure how we'd be able to automate that easily.. maybe some kind of (optional) zfs-zed integration?

@aerusso
Copy link
Contributor Author

aerusso commented Mar 22, 2018

So, I took a look at cmd/zed/zed.d/README, and decided that I wasn't going to try to implement a new zed event (at least right now). However, I did fix up the script a bit (it works with bash and dash, instead of requiring bash). All of shellcheck.net valid complaints have been addressed.

Also, I changed the semantics around a little bit:

  1. The dependencies of local-fs.target are reduced to Wants. This reduces the chance of unpleasant surprises if a mount fails. Maybe it gets raised to Requires after it gets a little wider testing.
  2. Instead of clobbering existing .mounts, abort instead.
  3. zfs list is run unconditionally, and its output is always preferred to /etc/zfs/zfs-list.cache.

This should further reduce the need to populate zfs-list.cache (in fact, just running systemctl daemon-reload will re-run generators, producing the desired mount units).

It would be nice if someone running zfs on root could give this script a try (just toss it in /etc/system/system-generators after replacing @sysconfdir@ and @sbindir@ with the correct values).

@AttilaFueloep
Copy link
Contributor

@aerusso Thanks for that, works like a charm!

I'm running root on ZFS with homegrown boot environments and a complex fs layout (see below). This is a current Arch Linux with zfs-git kmod. Up to now all mount points were set to legacy and got mounted via /etc/fstab. Setting all mount points to the appropriate directories, removing them from fstab and rebooting gave me, as expected, only a mounted /. Then I tossed in your generator and rebooted again. Now all fs got mounted properly. Nice!

I'm still seeing a systemd[1]: var.mount: Directory /var to mount over is not empty, mounting anyway on boot (and a failed to unmount /var" message on shutdown) but this was also the case with the old legacy mounts, so nothing new. Looks like systemd (journald ?) is writing to /var to early. This caused no havoc yet, so I can live with that.

zfs mount without generator:

rpool/ROOT/lx-4.15.10-1-ARCH_1   /

zfs mount with generator:

rpool/ROOT/lx-4.15.10-1-ARCH_1  /
rpool/home/root             /root
rpool/var                   /var
rpool/home                  /home
rpool/opt                   /opt
rpool/vbox                  /vbox
rpool/var/cache             /var/cache
rpool/home/me              /home/me
rpool/var/temp              /var/temp
rpool/vbox/me              /vbox/me
rpool/tmp/me               /home/me/tmp
rpool/vbox/me/vms          /vbox/me/vms
rpoolubia/home/me/stuff      /home/me/stuff

I don't know much about systemd, but if you want me to review anyway I would happily do so. Please just drop me a note then.

Thanks again.

@aerusso
Copy link
Contributor Author

aerusso commented Mar 24, 2018

@AttilaFueloep I'm glad it's working! Thanks for testing it. My setup is very similar, except root is not ZFS (yet). I also cannot get /var/log to unmount at system shutdown (but this is IMO a systemd journald bug/feature depending on how you feel about it).

As for the unclean mounting, I'd guess you have some residual files/directories under /var/ that haven't been cleaned up. systemd is usually pretty careful to not open /var/log/journal until the directory is mounted (check journalctl -b, look for "Starting Flush Journal to Persistent Storage"). If indeed that's happening after /var/ is mounted, you can inspect that by bind mounting / somewhere, and peeking inside /var: i.e.,

# mount --bind / /mnt
# cd /mnt/var
# ls

Then carefully remove/move anything inside there. The original motivation for this patch was the contamination of these mountpoints when things didn't get set up correctly and some services got started.

@Fabian-Gruenbichler
Copy link
Contributor

Fabian-Gruenbichler commented Mar 25, 2018 via email

@aerusso aerusso force-pushed the pulls/systemd-generator branch 2 times, most recently from 600dc6f to 8778d62 Compare March 26, 2018 23:00
@aerusso
Copy link
Contributor Author

aerusso commented Mar 26, 2018

I've reworked this again using @Fabian-Gruenbichler's suggestions:

  1. Instead of clobbering existing .mounts, abort instead.

should we maybe log this? after all, this means there is a potential
conflict between a manually set up .mount unit and a generated one
(previously generated ones are cleared before the generator is called)

Done.

I wonder whether it would not be better to drop the call to zfs list
altogether and invest some energy into ZED integration to keep the cache
file current? we'd basically need to hook:

  • pool creation
  • pool import
  • pool export (or not?)
  • filesystem creation (including receive and clone)
  • filesystem destruction (or not?)
  • filesystem rename
  • filesystem mountpoint property changes
  • filesystem canmount property changes

No call to zfs list is made in the systemd-generator anymore---if there's any concern over long hangs or serious bugs, that absolutely must not be allowed to interfere with system startup.

The new patch implements a history_event-zfs-list-cacher.sh "ZEDLET" updating zfs-list.cache. It tracks destroy, rename, mountpoint and canmount changes. I don't ever add new ones automatically, keeping the administrator in control (How do we feel about this?).

Copy link
Contributor

@Fabian-Gruenbichler Fabian-Gruenbichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder whether we want to deal with blacklisting datasets (besides canmount=off), since there can only be one unit for each mountpoint in systemd, but more than one dataset with the same mountpoint value in ZFS.

. "${ZED_ZEDLET_DIR}/zed-functions.sh"

zed_exit_if_ignoring_this_event

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a zed_check_cmd "${ZFS}" with appropriate exit here

(and maybe for wc, sort, diff, xargs, grep, .. ? all of them are essential on Debian, not sure about other distros/ecosystems?)

# because the history event only reports the command issued, rather than
# the affected ZFS, a rename or recursive destroy can affect a child ZFS.
# Instead of trying to figure out if we're affected, we instead just
# regenerate the cache *for that pool* on every modification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given this limitation, wouldn't it make more sense to regenerate also when new datasets are added (e.g., zfs create or even zpool import)?

e.g., right now when a dataset gets renamed, it gets dropped from the cache file but not added again under the new name.

I'd rather have the distinction between "cache file is curated manually, no ZED interference" and "cache file is managed by ZED and regenerated on every potential change event". it might make sense to introduce a zed.rc variable to distinguish these modes, depending on which one gets made the default.

# only act if the mountpoint or canmount setting is altered
printf '%s' "${ZEVENT_HISTORY_INTERNAL_STR}" |
grep -q '^\(mountpoint\|canmount\)=' || exit 0
;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems very roundabout - is there no bash/dash compatible way of checking a substring from the start of the string? (I wish we could just write the whole thing in perl to be honest :-P)


# Get the information for the affected pool
grep "^${ZEVENT_POOL}[/$(printf '\t')]" "${FSLIST}" |
grep -o '[^'"$(printf '\t')"'*' |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep: Unmatched [ or [^

I think the intended semantics was that of cut -f 1?

# Get the information for the affected pool
grep "^${ZEVENT_POOL}[/$(printf '\t')]" "${FSLIST}" |
grep -o '[^'"$(printf '\t')"'*' |
xargs @sbindir@/zfs list -H -oname,mountpoint,canmount \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbindir@/zfs should be ${ZFS}

grep "^${ZEVENT_POOL}[/$(printf '\t')]" "${FSLIST}" |
grep -o '[^'"$(printf '\t')"'*' |
xargs @sbindir@/zfs list -H -oname,mountpoint,canmount \
>"${FSLIST_TMP}" 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on whether we want to react to any changes (see bigger comment above), the whole preceding section could be replaced with
${ZFS} list -H -t filesystem -oname,mountpoint,canmount -r ${ZEVENT_POOL} > ${FSLIST_TMP}

@aerusso aerusso force-pushed the pulls/systemd-generator branch 2 times, most recently from ae79617 to 2a71bed Compare March 28, 2018 00:13
@aerusso
Copy link
Contributor Author

aerusso commented Mar 28, 2018

I'm possibly over-engineering this patch, but I've included the ability to specify regular expressions for datasets that should be automatically added. I.e., if you want everything you import, create, or receive to be tracked with this, you can put a line

+ .*

If you just want everything in some pool, say bigpool

+ bigpool/.*

Also, renamed datasets should now be correctly followed, and inherit events are now also tracked (since they can change mountpoints). Note that renamed datasets remain tracked, even if they no longer match the regular expression that caused them to be tracked initially. Exported pools have their entries removed from the cache.

@codecov
Copy link

codecov bot commented Mar 28, 2018

Codecov Report

Merging #7329 into master will increase coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7329      +/-   ##
==========================================
+ Coverage   76.35%   76.39%   +0.03%     
==========================================
  Files         329      329              
  Lines      104214   104191      -23     
==========================================
+ Hits        79576    79599      +23     
+ Misses      24638    24592      -46
Flag Coverage Δ
#kernel 76.03% <ø> (-0.03%) ⬇️
#user 65.6% <ø> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1724eb6...61fbffb. Read the comment docs.

@Fabian-Gruenbichler
Copy link
Contributor

ha, yes, that looks a bit over-engineered to me as well ;) how about the following:

  1. regenerate cache from scratch based on currently available filesystems, no matter which event triggered
  2. filter for those matched by a + ... expression
  3. remove those matched by a - ... expression

that would end up being a lot less code for almost the same result?

I wonder whether moving the expressions to two variables in zed.rc (whitelist and blacklist) and implying a whitelist of .* unless explicitly specified might not make matters easier? I guess the main use case is to exclude certain subtrees of a pool (e.g. where backups are received or cloned to).

minor unrelated nit: most other zedlets use 4 spaces for indentation, it probably makes sense to follow that style. the case statements are currently also pretty weirdly indented in parts..

@Fabian-Gruenbichler
Copy link
Contributor

if we want to keep the current "try to update changed parts of existing cache file" approach, I would highly recommend moving to something other than shell (then we could just map the existing cache file to some hash/dict structure, update that as needed, and write it back out). not sure whether that is acceptable given that all the current zedlets are shell scripts?

maybe @behlendorf can chime in on that point?

@behlendorf
Copy link
Contributor

The ZEDLET infrastructure was originally designed so they could be any manor of executable. However, the hope was that the majority of them would be simple enough that they could be written in shell so they'd be trivial to inspect and customize. These are getting a little on the long side but I don't think they're too unweildy yet!

@aerusso aerusso force-pushed the pulls/systemd-generator branch 4 times, most recently from f02608c to 59de246 Compare March 29, 2018 21:25
@aerusso
Copy link
Contributor Author

aerusso commented Mar 29, 2018

Here's another stab.

  1. It doesn't try to track changes. It just runs zfs list on the affected pool.
  2. The "cache" files are really just a cache---there's no configuration information hiding inside the file. This was bothering me, and I think @Fabian-Gruenbichler was honing in on this, too.
  3. It's much simpler. It doesn't parse the output at all.

Also, I've tested it out with a broken entry in the cache. It, as expected, doesn't interfere with the boot more than just showing that a unit failed (because the dependency is at Wants rather than Requires). I honestly can't really think of any good reason to disable this on any datasets, so I think the complicated policies allowing administrators to disable this feature at a very fine level are a waste.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to test this, I've only read through the PR but the approach looks reasonable. @Fabian-Gruenbichler do you have any additional concerns?

@aerusso how would you like to proceed. Since the zedlet isn't enabled by default merging this is low risk. That woud make it a little easier to get some additional testing on other distributions.

@aerusso
Copy link
Contributor Author

aerusso commented Apr 4, 2018

@behlendorf I'm happy with it, and the biggest weakness is the lack of testing, though I have it running on several machines, and the systemd generator seems robust on them. We can always enable the zedlet later, after wider testing.

@Fabian-Gruenbichler
Copy link
Contributor

Fabian-Gruenbichler commented Apr 4, 2018 via email

zfs-mount-generator implements the "systemd generator" protocol,
producing systemd.mount units from the cached outputs of zfs list,
during early boot, integrating with systemd.

Each pool has an indpendent cache of the command

  zfs list -H -oname,mountpoint,canmount -tfilesystem -r $pool

which is kept synchronized by the ZEDLET

  history_event-zfs-list-cacher.sh

Datasets not in the cache will be loaded later in the boot process by
zfs-mount.service, including pools without a cache.

Among other things, this allows for complex mount hierarchies.

Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
@AttilaFueloep
Copy link
Contributor

@aerusso Thanks for your explanations. I really feel bad for not reading the whole thread thoroughly before posting, everything was there. To my excuse, maybe my broken system made me a bit to nervous.

I've already added a toggle of the canmount property of a test filesystem on the root pool to my boot environment script. If I understand correctly this should trigger the zed script to update the cache file. On a side note, while this works for updating the active BE, it won't for updating a non-active BE and I've no easy solution here. What would you think of making history_event-zfs-list-cacher.sh callable outside of zed? Maybe something along the lines of this?

@rlaager
Copy link
Member

rlaager commented Apr 21, 2018 via email

@AttilaFueloep
Copy link
Contributor

In my setup /boot is on the zfs root filesystem and therefore not shared. I have separate filesystems for eg. /home and /var and initramfs (initcpio) does not mount them. Before this PR I had all mountpoints set to legacy and handled mounting via /etc/fstab, which is quite inconvenient. So this PR is a substantial improvement. If I could call history_event-zfs-list-cacher.sh directly I could script something like /newbe/etc/zfs/zed.d/history_event-zfs-list-cacher.sh rpool >/newbe/etc/zfs/zfs-list.cache/rpool. That would be more than enough.

@ccope
Copy link

ccope commented Apr 30, 2018

Should this work on Ubuntu 16.04? I've copied this zedlet into /etc/zfs/zed.d and chown+chmod to root:root 755), but it doesn't fire when I run zpool import/export. Verbose output from zed:

Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "history_event-zfs-list-cacher.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "resilver.finish-notify.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "scrub.finish-notify.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Ignoring "zed.rc": not executable by user
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "all-syslog.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "io-notify.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "io-spare.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "data-notify.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "checksum-spare.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: Ignoring "zed-functions.sh": not executable by user
Apr 30 16:47:50 htpc-nuc zed[11655]: Registered zedlet "checksum-notify.sh"
Apr 30 16:47:50 htpc-nuc zed[11655]: ZFS Event Daemon 0.6.5.6-0ubuntu20 (PID 11655)
Apr 30 16:47:50 htpc-nuc zed[11655]: Processing events since eid=22
Apr 30 16:48:14 htpc-nuc zed[11655]: Invoking "all-syslog.sh" eid=23 pid=11715
Apr 30 16:48:14 htpc-nuc zed[11655]: Finished "all-syslog.sh" eid=23 pid=11715 exit=0

@aerusso
Copy link
Contributor Author

aerusso commented May 1, 2018

You have to touch /etc/zfs/zfs-list.cache/POOLNAME to enable this for a given pool. Did you also grab the zfs-mount-generator and put it in the right systemd directory?

Notice also that #7453 will change these formats relatively soon.

@ccope
Copy link

ccope commented May 1, 2018

Ah, I missed the "touch" step. I generated the cache file by running the zfs list command by hand and then the generator started working. Thanks for the heads up, I'll subscribe to that PR. Edit: I copied the latest versions of the generator and zedlet from master today, so I think I'm good.

[ -d "${FSLIST}" ] || exit 0

do_fail() {
printf 'zfs-mount-generator.sh: %s\n' "$*" > /dev/kmsg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aerusso Shouldn't that read zfs-mount-generator instead of zfs-mount-generator.sh?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, of course, would also apply to all occurrences of zfs-mount-generator.sh inside this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've got a queue of documentation typos, I'll add these to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's just cosmetic. I now tend to use myname=$(basename "$0") in my scripts after tripping over it as well.

@rugubara
Copy link
Contributor

May I suggest a minor edit to the man page?
The suggested command misses -r and -t filesystem options. If the pools have zvols, generator fails with invalid canmount. So the correct initialization command that worked for me:
zfs list -t filesystem -r -H -oname,mountpoint,canmount

The generator code doesn't provide any hint in the message when this fails. I suggest to replace
do_fail "invalid canmount" with
do_fail "invalid canmount ${1} ${2} ${3}"

@aerusso
Copy link
Contributor Author

aerusso commented Oct 22, 2018

@rugubara How did a zvol wind up in the zfs-list cache? That's a bug in the ZED-let.

@rugubara
Copy link
Contributor

@aerusso, I couldn't get zedlet working for me yet. I wrote /etc/zfs/zfs-list.cache/pool myself using the zfs list command from man page.

@openzfs openzfs deleted a comment from rugubara Oct 22, 2018
@bunder2015
Copy link
Contributor

Just a head's up, github is still having problems from last night, you might not see new posts but I believe they are still going through.

@aerusso
Copy link
Contributor Author

aerusso commented Oct 22, 2018

Thank you for confirming that I'm not the only one who cannot see all the replies!

The man page has a little, tricky, caveat:

for datasets that should be mounted by systemd

though the real bug is that the zedlet isn't working for you. What happened there?

@openzfs openzfs deleted a comment from rugubara Oct 22, 2018
@openzfs openzfs deleted a comment from rugubara Oct 22, 2018
@openzfs openzfs deleted a comment from rugubara Oct 22, 2018
@openzfs openzfs deleted a comment from rugubara Oct 22, 2018
@rugubara
Copy link
Contributor

@aerusso, I forgot to symlink history_event-zfs-list-cacher.sh to /etc/zfs/zed.d. Once symlinked, everything works ok.

@mskarbek
Copy link
Contributor

Is this something that could go live with 0.7.12 or do we have to wait for 0.8.0?

@AttilaFueloep
Copy link
Contributor

@aerusso

I also cannot get /var/log to unmount at system shutdown (but this is IMO a systemd journald bug/feature depending on how you feel about it).

In case you still have the unmounting problem, the solution is described in #8060 (comment). That fixed it for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants