Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS FS support (for NFS exports) #55

Merged
merged 8 commits into from
Aug 3, 2020

Conversation

dsonck92
Copy link
Contributor

@dsonck92 dsonck92 commented Jul 28, 2020

Adds ZFS support to the filesystem pools.

  • Includes /proc/mounts parsing to determine which pools are supported
  • Extracts btrfs into its own driver package
  • Reduces the fs package to just the API and common helpers for said API
  • Adds ZFS backend implementing the fs API
  • Extends the test config with a ZFS fs pool

These changes allow ZFS as a backend for both iSCSI targets as well as NFS targets as discussed in #54.

Closes: #54

@dsonck92
Copy link
Contributor Author

The failure at commit e515c4c is a bit sad. ZFS on Linux apparently fixed this use case somewhere between 0.7.5 and 0.8.4 and some thought should be given to "which ZFS version targetd aims to support." But given that the "hack" introduced in 6f909ff does fix it, it's possible to support 0.7.5.

@triluch I believe the current state covers most use cases. I will go through the code once to find any spots where I should do some additional sanity checks but other than that, it looks done.

@dsonck92 dsonck92 marked this pull request as ready for review July 28, 2020 20:39
@dsonck92
Copy link
Contributor Author

There are some outstanding points I did not address in here as I'm not known with the expectations from the API users:

  • fs_list will return the correct (nfs exportable) path based on the mountpoint, which is on par with what btrfs does. However in contrast to btrfs, its pool name is not a valid path but the path internal to ZFS. This simplifies further interaction with ZFS but relies on the fact that clients will not utilize the pool name as some location to get data.
  • ss_list currently only gives abstract locations about the snapshots. While it is possible to create a clone out of it, which in turn will have the correct path based on its mountpoint, care needs to be taken with the fact that clones lock the snapshot. It might be useful to get a "readonly path" location from ss_list which can then be served out of .zfs/snapshots and targetd_ss/ depending on the source being ZFS or btrfs.

@tasleson
Copy link
Member

There are some outstanding points I did not address in here as I'm not known with the expectations from the API users:

If the API docs are lacking https://github.com/open-iscsi/targetd/blob/master/API.md we should add additional information to make them clear.

* `fs_list` will return the correct (nfs exportable) path based on the mountpoint, which is on par with what btrfs does. However in contrast to btrfs, its pool name is not a valid path but the path internal to ZFS. This simplifies further interaction with ZFS but relies on the fact that clients will not utilize the pool name as some location to get data.

The pool name has no requirement for it to be some type of path. Actually it likely shouldn't, eg. when you create a FS from a pool you should be able to list all the FS and locate it by name, although name uniqueness should likely only be enforced for a specific pool, not across all pools.

* `ss_list` currently only gives abstract locations about the snapshots. While it is possible to create a clone out of it, which in turn will have the correct path based on its mountpoint, care needs to be taken with the fact that clones lock the snapshot. It might be useful to get a "readonly path" location from `ss_list` which can then be served out of `.zfs/snapshots` and `targetd_ss/` depending on the source being ZFS or btrfs.

From an API standpoint, a FS clone is a read/writable point in time copy of some FS. A FS snapshot is a read only copy of some FS. Again I would make the statement that the name need not be any kind of valid path. I think the assumption with the API was that a FS would always be exported before it's consumed and you do specify a export port path in that step. I suppose this does leave missing functionality with regards to using the API on a system and wanting to simply create a local mounted file system. Likely in that case we should have a API call to mount a FS for local consumption which specifies a local path. The API doesn't really outline what it means to clone a snapshot, snapshot a clone, or snapshot a snapshot or even if those operations are necessarily valid. I'm not sure what it means to lock a snapshot on ZFS.

@tasleson
Copy link
Member

@dsonck92 I'll take a look at this PR tomorrow or later this week. Thank you for adding to the client test in addition to the ZFS FS support changes.

One thing that I did see was the commit message sanity check. Please re-word this to something else, e.g. quick check, confidence check, coherence check, or precondition check. Thanks!

@dsonck92
Copy link
Contributor Author

dsonck92 commented Jul 28, 2020

There are some outstanding points I did not address in here as I'm not known with the expectations from the API users:

If the API docs are lacking https://github.com/open-iscsi/targetd/blob/master/API.md we should add additional information to make them clear.

I will add some notes regarding the availability of full_path (which is currently absent from API.md)

* `fs_list` will return the correct (nfs exportable) path based on the mountpoint, which is on par with what btrfs does. However in contrast to btrfs, its pool name is not a valid path but the path internal to ZFS. This simplifies further interaction with ZFS but relies on the fact that clients will not utilize the pool name as some location to get data.

The pool name has no requirement for it to be some type of path. Actually it likely shouldn't, eg. when you create a FS from a pool you should be able to list all the FS and locate it by name, although name uniqueness should likely only be enforced for a specific pool, not across all pools.

Agreed, all backends now keep track of a list of pools they handle, and they can contain the same volume name but together with pool this is unique. Beside this, the UUIDs in use also offer uniqueness for most operations and most even require their use.

* `ss_list` currently only gives abstract locations about the snapshots. While it is possible to create a clone out of it, which in turn will have the correct path based on its mountpoint, care needs to be taken with the fact that clones lock the snapshot. It might be useful to get a "readonly path" location from `ss_list` which can then be served out of `.zfs/snapshots` and `targetd_ss/` depending on the source being ZFS or btrfs.

From an API standpoint, a FS clone is a read/writable point in time copy of some FS. A FS snapshot is a read only copy of some FS. Again I would make the statement that the name need not be any kind of valid path. I think the assumption with the API was that a FS would always be exported before it's consumed and you do specify a export port path in that step. I suppose this does leave missing functionality with regards to using the API on a system and wanting to simply create a local mounted file system.

The current nfs_export_add and nfs_export_remove seem to take a local path only (export_create and export_destroy do take pool and vol and as such don't have any issue here). Considering NFS is the scope of this patch, this is the main reason I brought it up. It could be that this hasn't yet seen a lot of thought. Hence why it could be useful to present this same full__path at the snapshot listing too. Unless it is decided that it's better to abstract the NFS sharing too, and offer a nfs_export_add(pool,vol,host,virtual_path,options) api for it. This would give freedom to the backends to use their optimal locations (like targetd_ss/ and .zfs/snapshots/) for exporting but let the user of the API define logical export paths through virtual_path. The client does not actually need to care about the full_path in that case.

Likely in that case we should have a API call to mount a FS for local consumption which specifies a local path. The API doesn't really outline what it means to clone a snapshot, snapshot a clone, or snapshot a snapshot or even if those operations are necessarily valid. I'm not sure what it means to lock a snapshot on ZFS.

ZFS has the peculiarity that, once you create a clone based off of a snapshot, this snapshot cannot be deleted until the clone is removed. Which in turn also would mean that the original volume can't be removed. I'm not sure what btrfs does in this regard. This also has the side-effect of requiring more space for the snapshot as the original volume is diverging, and more space as the clone is diverging

The local exports is an interesting idea. In fact, considering that both btrfs and zfs have ways to mount their subvolumes at arbitrary paths, it could be possible to expose this. For ZFS this would be trivially handled through the zfs command as you can set the mountpoint directly, btrfs might needs some trickery with /etc/fstab or systemd-mount. However this pull request was mainly focused on adding another type of FS volume (as opposed to block volumes that are shared through iSCSI).

I realize that this project, being part of open-iscsi, is diverging from its "iSCSI" source but NFS support has already been added prior to this. Given the addition of ZFS filesystem datasets besides ZFS volumes, this means that people using ZFS as their main storage system can now create both block devices and filesystems and export those over iSCSI and NFS. As noted in #54 this could provide clusters that have external data storages with dynamically provisioned filesystems for both private (iSCSI) as well as shared (NFS) storage (where especially the shared version is currently limited to complex or cloud based solutions).

@dsonck92 dsonck92 changed the title ZFS FS support ZFS FS support (for NFS exports) Jul 29, 2020
@dsonck92
Copy link
Contributor Author

As I've touched the client test with the fs tests anyways, I figured I might as well do the same procedure for the regular block tests as well, ensuring that at least basic testing of LVM and ZFS is performed.

Both tests now just loop over a list of volumes, creating a subvolume, a snapshot or copy of said volume and in case of block devices also an export. After that they destroy everything and verify that things have been cleaned up. This would allow future expansion with other filesystems and block devices.

@dsonck92
Copy link
Contributor Author

While I had some time to think, something dawned on me which might be a bit strange and might need some second opinion:

Initially, when implementing the backend agnostic fs layer, I decided that fs_pools would simply list all mountpoints of the top volumes under which targetd creates its subvolumes. This is compatible with btrfs as its command takes the mountpath as argument. When I introduced the ZFS backend, I decided to inspect /proc/mounts to deduce which filesystem is in use (and as such it requires a specific subvolume to be mounted at the exact location listed, which works always for ZFS and can be done for btrfs in /etc/fstab too). This does provide information for the ZFS backend to recover the ZFS internal name.

This all works, but since ZFS then prefers its internal top-pool/volume/subvolume name over /mnt/zfs-mount/subvolume, the internal name becomes top-pool/zfs-mount/subvolume (or wherever it actually resides). I didn't think much of it at first, but while writing the client test, I noticed that for ZFS, now the pool name does not correspond to the config file defined pool name.

In the test setup, this is not really noticeable: /zfs_targetd/fs_pool becomes zfs_targetd/fs_pool. But when admins start mounting /mnt/zfs_targetd/fs_pool, this still becomes zfs_targetd/fs_pool. There are 2 ways to solve this:

  • Specify pools strictly as their internal name: zfs_targetd/fs_pool is defined as zfs_targetd/fs_pool. This creates a complication of determining what this pool means. Although a quick "hack" would be to look at "starts with /", this might break with further filesystem support. Another possibility is to be explicit about ZFS, so zfs_fs_pools instead of fs_pools for ZFS specific filesystem pools.
  • Specify pools as their mount point and map the pool name to the internal name in the backend. So a pool zfs_targetd/fs_pool mounted at /mnt/zfs_targetd/fs_pool is specified as /mnt/zfs_target/fs_pool in config and identified through /mnt/zfs_targetd/fs_pool during API interactions. This would require a mapping inside the ZFS backend to map the mountpoint names to their respective pool volumes but won't require adding additional backend specific configuration options. Transparently, more backends can be added in the future that also resolve their (potential) naming conventions to absolute paths.

@dsonck92
Copy link
Contributor Author

Well, as going for a walk always seems to have a good influence on the thought process, I've entertained the thought on mapping ZFS datasets transparently by their mount points (noting only the FS part, not Blockdevices).

My reasoning (as stated in the commit): nfs_export_add takes full NFS reachable mountpoints. My interpretation of filesystem pools is for them to be exported through nfs_export_add which takes the absolute path destined for /etc/exports. The initial /proc/mounts parsing allows targetd resolving the local (NFS reachable) mountpoints to their intended backends (btrfs/zfs). This means that an administrator can specify ["/local/pool"] in targetd, reference it in configuration for its clients and later transparently swap between btrfs or zfs as they choose. Clients then utilize the full_path to setup any dynamic NFS exports they require. Now just looking which code I overlooked that causes this check to fail.

@tasleson
Copy link
Member

@dsonck92 I'm looking at this PR, trying to understand all of what you're saying. I spun up a VM and I created a couple of zfs pools. Shouldn't I be able to specify in the targetd.yaml zfs_block_pools: [zfs_block_pool] and fs_pools: [/mnt/btrfs, zfs_fs_pool] ? I'm also wondering if it would simply be better to use zpool list instead of looking at /proc/mounts? Also is it possible to use the same zfs pool to create both block devices and file systems?

FYI: I'm not that well versed with ZFS.

@dsonck92
Copy link
Contributor Author

dsonck92 commented Jul 29, 2020

@dsonck92 I'm looking at this PR, trying to understand all of what you're saying. I spun up a VM and I created a couple of zfs pools. Shouldn't I be able to specify in the targetd.yaml zfs_block_pools: [zfs_block_pool] and fs_pools: [/mnt/btrfs, zfs_fs_pool] ?

This was the point I was trying to make 2 comments back. While it would be possible to deduce zfs_fs_pool being a zfs pool based on the lack of a leading slash, this might be troublesome in the long run when other filesystems pop up that offer similar features like subvols. However, this would be more in line with "all values supplied inside the targetd.yaml are in fact pool names"

I'm also wondering if it would simply be better to use zpool list instead of looking at /proc/mounts?

The /proc/mounts has the benefit of mapping NFS names (which currently do not use pool + volume addressing during creation) to managed pools regardless of how the tools interact. In addition, it makes an "fs_pool" actually mean "look at this location, I want to create mounted filesystems here" which, in my opinion, ties in with NFS. Imagine:

  • A system wants to configure a new NFS mount. NFS mount means that it should create a filesystem thus either btrfs or zfs filesystem volume
  • The administrator has set, at some point, the base in the client to something like "/mnt/nfs/btrfs_pool" or "/mnt/nfs/zfs_pool" and the client can simply create a subvol "claim_xyz" under /mnt/nfs/btrfs_pool, and later retrieves full_path to be /mnt/nfs/btrfs_pool/targetd_fs/claim_xyz
  • It then sends out a request for nfs_export_add to add /mnt/nfs/btrfs_pool/targetd_fs/claim_xyz and supplies this to whatever system originally wanted this claim

But other than that, zfs list does give a list of pools that exist and the backend could try to match this.

Also is it possible to use the same zfs pool to create both block devices and file systems?

Yes, I believe it should be possible to use the same pool for both block devices as well as file systems. As far as zfs is concerned, it can have mixed block devices and filesystems. The only question there is, what to do when they start having conflicting names.

While thinking about this part. I do realize that, if it's preferable to go the zfs-style poolnames, then zfs_block_pools could in fact upgrade to zfs_pools allowing it to host both filesystems and block devices. It would then basically either show up twice in the pool list (as both filesystem and block pool) or the API adapts to give multiple values for its type.

But I haven't thought about that much, its more of a thought brought up due to your question.

FYI: I'm not that well versed with ZFS.

In the end, the major differences between ZFS and btrfs/lvm are:

  • ZFS covers both btrfs and lvm allowing it to host both block devices and filesystems
  • ZFS has some rules regarding snapshots which make them less ideal for cloning (clones prevent snapshots to be deleted)
  • ZFS prefers to work by <zfs_root_pool>/sub/volumes@snapshot paths, and does not relate its filesystem to the mountpoint like btrfs does. This also means ZFS can be handled without being mounted.
  • ZFS allows block devices and filesystems to co-exist in the same dataset (volume) but I believe block devices can only belong to a filesystem dataset, so no block device under a block device

@dsonck92
Copy link
Contributor Author

dsonck92 commented Jul 29, 2020

Well, given that the build now succeeds. It's now possible to specify filesystems (which must be mounted) in fs_pools and it will properly deduce whether it's btrfs or zfs and map the names accordingly. Hence:

zfs create -o mountpoint=/mnt/zfs_fs_pool zfs_root/fs_pool
fs_pools: [/mnt/zfs_fs_pool]
pool_list() -> [{'name': '/mnt/zfs_fs_pool', 'size': 1024, 'free_size': 512, 'type': 'fs'}]

subsequently, it's possible to fs_create('/mnt/zfs_fs_pool','test-volume',256) noting that size is not actually taken into account much like btrfs (however, zfs does support quotas and could enforce the given size). After that:

fs_list() -> [{'name': 'test-volume', 'pool': '/mnt/zfs_fs_pool', 'uuid': '1234', 'full_path': '/mnt/zfs_fs_pool/test-volume', ...}]

I believe, considering the filesystem main use case is NFS exports (as iSCSI relies on block devices) this would be convenient for the end user of targetd. They only need to indicate which paths should be managed, and arrange on their system that either btrfs or zfs is mounted there.


That said, it makes use of the fact that zfs subvolumes inherit their parent path from their parent, so they automatically mount at /mnt/zfs_fs_pool/ if the parent dataset is mounted at /mnt/zfs_fs_pool.

On an entirely tangential note (which I believe would warrant a new issue to be opened altogether if wanted): the other solution is to make NFS exports in-line with iSCSI exports as noted a few comments back. This would mean that the API takes a pool + volume in addition to a path (which it currently does not for NFS). With this mindset, it would be possible to leverage zfs easy mountpoint management to let targetd manage mountpoints entirely, completely hiding away how the zfs pools are called behind the scenes as far as API clients are concerned. But this would create much more work.

Hence why I think the "map zfs volumes through /proc/mounts" is the most convenient, less change requiring method, with the most gain: fs_pools lists mountpoints, easily exportable, and the user can swap zfs/btrfs as they please by simply mounting another filesystem at the given fs_pool locations

@dsonck92
Copy link
Contributor Author

@triluch perhaps you also want to take a look at the solution as its currently standing.

@tasleson
Copy link
Member

Hence why I think the "map zfs volumes through /proc/mounts" is the most convenient, less change requiring method, with the most gain: fs_pools lists mountpoints, easily exportable, and the user can swap zfs/btrfs as they please by simply mounting another filesystem at the given fs_pool locations

The nfs handling code of targetd was written in such a way that an administrator could have nfs exports that were not managed by targetd. I haven't had a chance to fully review this PR so I don't know if utilizing /proc/mounts could cause issues with that. The nfs code treats /etc/exports for end user and utilizes /etc/exports.d for targetd configured nfs exports.

As for removing any ambiguity between this fs pool is btrfs or this fs pool is zfs, I would be fine with adding a new designation in the config file to remove that ambiguity, especially if it makes things simpler in the code.

@dsonck92
Copy link
Contributor Author

dsonck92 commented Jul 29, 2020

Hence why I think the "map zfs volumes through /proc/mounts" is the most convenient, less change requiring method, with the most gain: fs_pools lists mountpoints, easily exportable, and the user can swap zfs/btrfs as they please by simply mounting another filesystem at the given fs_pool locations

The nfs handling code of targetd was written in such a way that an administrator could have nfs exports that were not managed by targetd. I haven't had a chance to fully review this PR so I don't know if utilizing /proc/mounts could cause issues with that. The nfs code treats /etc/exports for end user and utilizes /etc/exports.d for targetd configured nfs exports.

Well, that's at least good to know for me considering my last comment on #54 as it would make me prefer targetd even more (I rely on /etc/exports not to change). I think I might not have been clear regarding /proc/mounts. In essence what I'm doing with it is:

  • the user specifies a filesystem in config (probably as this needs exporting somewhere down the line)
  • in order to deduce what backend should handle this filesystem, it reads (only reads) /proc/mounts to view which filesystem maps there
  • if it's either zfs or btrfs, it will store the device (/dev/sda or zfs-pool/subvol) and mountpoint (/mnt/btrfs_pool or /mnt/zfs-pool/subvol) to a list for the backends to consume later
  • lastly, when the backends are initialized, btrfs will just look at the mountpoint and ignore the device. Zfs will look at both the mountpoint and device, and keep a mapping of which mountpoint relates to which volume. It will use this mapping to later relate /mnt/zfs-pool/subvol to zfs-pool/subvol whenever an API call is made

An added bonus to this /proc/mounts read is, it can very early on error out when it has deduced that the fs_pool path given does not actually map to btrfs or zfs. Without it, btrfs would fail due to it not being able to create the filesystem and snapshot locations but that's essentially already late. The error would indirectly show it was due to it not being btrfs, but with the /proc/mounts check (which is also just done once at start) it actually errors out with "Unsupported filesystem ext4 for pool /var/lib/exportd"

As for removing any ambiguity between this fs pool is btrfs or this fs pool is zfs, I would be fine with adding a new designation in the config file to remove that ambiguity, especially if it makes things simpler in the code.

The added complexity to deduce paths through /proc/mounts is pretty minimal. There's a dedicated class for parsing /proc/mounts written that splits out the fields and provides constants to retrieve the names back (could be changed to a returned dict if that's preferred). The mapping of mount -> zfs is a simple dict. Given that mountpoints should be unique on linux, any given key will map unambiguously to a zfs pool. Only a single "zfs_pool = pools_fs[pool]" for each function accepting a single pool was necessary, and for those that go through the list, python offers for pool, zfs_pool in pools_fs.items()

Whether that's considered complex by other contributors, I cannot judge, but the end result is less complexity in the configuration file. But perhaps playing around with it locally might shed some light behind my chosen design.


I might add that, while btrfs errors out early if it can't create the required subvolumes, if an administrator actually made the directories beforehand, it would not error out at start, but subsequently fail to perform any btrfs tasks when an API call is made. So I think interfacing with a standard linux file (which has been documented) is justified for at least error handling. And in my opinion, when it's already read, why not utilize the additional information stored in there.

@tasleson
Copy link
Member

@dsonck92 I'm in the process of writing a more comprehensive unit test against the whole API. Thus far I've found 2 bugs, neither of which have anything to do with this PR. Once I get the unit test done and assuming no issues are found pertaining to the PR we'll get it merged.

@dsonck92
Copy link
Contributor Author

dsonck92 commented Jul 31, 2020

Good, I can rebase on the new unit test when it is done. I think it's then probably better to pull my unit test code out of the change for 2 reasons:

  1. If your unit tests are well written, then they should probably test zfs implicitly anyways. (So I do recommend taking this into consideration if you didn't do so already, say, a testcase list or something)
  2. It would prevent merge conflicts regarding the unit tests if it's living in that client thing. Unless you are strictly speaking about unit tests and not the integration test that's going on in client.

@dsonck92 dsonck92 closed this Jul 31, 2020
@dsonck92 dsonck92 reopened this Jul 31, 2020
@dsonck92
Copy link
Contributor Author

On phone they really should put an "are you sure you want to close" message

@tasleson tasleson mentioned this pull request Jul 31, 2020
@dsonck92
Copy link
Contributor Author

dsonck92 commented Aug 1, 2020

And I found another mistake in the NFS support. I'm actually testing this in my private (but production) environment and was hit by that. I was a bit confused as API.md did not mention export_path. Once I manage to fully get it to work on my production environment, we can look at how we want to proceed with merging the tests and this. I'm also fine with this being rebased on top of the tests or vice versa (tests on top of this).

targetd/fs.py Outdated Show resolved Hide resolved
@tasleson
Copy link
Member

tasleson commented Aug 2, 2020

@dsonck92 Unless you would like to squash a bit, I think this is OK to merge and I'll do so tomorrow. I'm planning on expanding the unit tests to increase the code coverage and address anything else that might be present before a new release is made.

@dsonck92
Copy link
Contributor Author

dsonck92 commented Aug 3, 2020

I'll do that, should be done in a few hours

- The original btrfs filesystem pools use mount path based pool names. ZFS differs from this by using special ZFS pool names starting with the top level ZFS pool name. This unifies the fs_pool configuration to always refer to mount points.

Rationale: as filesystem pools defined in targetd are likely to be exported over NFS, and since NFS paths are strongly tied to actual mount paths, it makes sense to specify filesystem pools in terms of their "soon to be exported" paths.
@dsonck92
Copy link
Contributor Author

dsonck92 commented Aug 3, 2020

@tasleson if you wish, I could combine all the other pull requests I made into this one although they should not conflict when applied individually.

@tasleson tasleson merged commit da844b5 into open-iscsi:master Aug 3, 2020
@dsonck92 dsonck92 deleted the zfs-nfs-exports-squash branch August 3, 2020 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ZFS datasets for fs_pool beside btrfs
2 participants