New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support filefrag #264
Comments
|
Brian, most Linux filesystems support FIEMAP today, including ext3/4 and Lustre, since it was originally based on a similar XFS ioctl, and has been adopted by other filesystems since then. While "filefrag" is packaged with e2fsprogs, it is a candidate to move into util-linux-ng at some point (along with lsattr/chattr), as have other e2fsprogs-developed libraries/tools like libblkid, libuuid, libcom_err, fsck, etc. The other alternative to FIEMAP is FIBMAP, but that ioctl is limited to root, and it only allows returning a single block at a time. Not only is FIBMAP very inefficient for large files, but it also misses some additional features that are nice to have on modern filesystems, like xattr mapping, unwritten extents, delalloc blocks, etc. |
|
I would recommend to start by looking at __generic_block_fiemap() in the kernel, to get an idea of what needs to be done from the FIEMAP side of the code. Note that there is also support in very new kernels to add the SEEK_HOLE Lustre uses a patched version of filefrag which allows returning the underlying device for each extent (fe_device), because the file is not located on a single LUN. For Lustre this is an index value (0-N). For ZFS one might either return the Linux block device (major << 16 | minor) or 32 bits of the VDEV GUID or similar. struct fiemap_extent {
__u64 fe_logical; /* logical offset in bytes for the start of
* the extent from the beginning of the file */
__u64 fe_physical; /* physical offset in bytes for the start
* of the extent from the beginning of the disk */
__u64 fe_length; /* length in bytes for this extent */
__u64 fe_reserved64[2];
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_device; /* device number */
__u32 fe_reserved[2];
};One of the fe_reserved64[2] fields was (in my mind at least) reserved for returning the actual length of the extent, for compressed blocks. I haven't given a lot of thought to whether the existing fe_length should be treated as the physical length or the logical length (currently both are the same), so ideas are welcome. If anyone goes down that road, please send me a patch for e2fsprogs filefrag so I can submit it upstream. As for the ZFS side, I'd recommend looking at the Solaris ZPL layer to see how they implement SEEK_HOLE and SEEK_DATA traversal of the blocks in the dnode. Don't forget about reporting dirty pages (ARC buffers) in memory, or copying a file that was just written will result in an empty copy. Even though they do not have blocks allocated yet, it should set FIEMAP_EXTENT_DELALLOC to indicate that the space is in use. |
|
The most recent (but as yet unlanded) patches for handling compressed extents in FIEMAP are at https://lwn.net/Articles/607552/ and http://linux-fsdevel.vger.kernel.narkive.com/S8u3GLaY/patch-0-6-v5-fiemap-introduce-data-compressed-and-phys-length-flags |
|
I'm the one who originally requested this in the zfsonlinux mailing list. in order to generate fragmentation reports. Knowing the actual extent of fragmentation is fairly important given how badly ZFS performance falls over when fragmentation becomes widespread (short version: It's pretty ugly. You don't want to experience it) Brian, what's the state of play? |
|
@Stoatwblr this issue is waiting for a developer with enough available time and interest to tackle it. I agree it would be great to have. |
|
This is also a requirement for running the shake defragmenter. |
|
@behlendorf, I saw your FIEMAP project was listed as the runner-up project for the ZFS DevCon Hackathon. Awesome that you had the opportunity to work on this. I'd be interested to review the patch when it is available, and would be interested to discuss how to handle ditto blocks with FIEMAP, if you haven't already implemented this. It would also be very useful to add FIEMAP support to Lustre osd-zfs so that it can export this information to the client. |
|
@adilger thanks! So I don't have any code ready to share just yet but I hope to fairly soon. When I have something I'd love to get your input. The OpenZFS summit was a good opportunity to give some careful thought as to how to best go about this. I was happy the see the FIEMAP interface already provides most of what's needed. Ditto blocks are one exception you mentioned, another concern is including the block device in the extent. |
|
For returning the block device in the extent, I would suggest to use the same mechanism as Lustre does for returning the OST index to the caller. We return the OST index (__u32) in a reserved field: #define fe_device fe_reserved[0] that hasn't yet been reserved in the upstream kernel, though I should probably do that at some point. For ZFS it could either return the actual __u32 block device number ( The Lustre-patched For handling ditto blocks, the |
|
@adilger that all makes good sense and I generally agree.
It would be great to get this reserved in the kernel. When looking at the existing interfaces I was surprised btrfs hasn't already done this. My feeling here is the
That's handy! Although if we're reporting the
The full vdev guid could additionally be returned in
What are you thoughts on adding a new
Handling mirrors is relatively straight forward since we can return the extent multiple times, one for each device. Although there are some complications with compression, more on that in a minute. As for RAIDZ and DRAID in order to return correct physical information we need to add an extent for every vdev spanned by the stripe. That's potentially a large number of small extents even assuming that we ignore the parity information. Will that cause any problems for
The
When ZFS splits a compressed or encrypted block when writing it, either because its RAIDZ or because a gang block is needed, there exists no meaningful logical to physical offset mapping for the extent. Given the current interface I'm not sure what those extents should look like. One option would be to add the extent multiple times with each entry referencing a different physical offset which includes a partial portion of the block. An alternative to all of this would be to report the extent from the perspective of the top-level RAIDZ vdev. That would push this issue up in to the caller but my feeling is that wouldn't be particularly useful since it would be difficult at best for them to calculate the physical offsets. |
|
@behlendorf see http://linux-fsdevel.vger.kernel.narkive.com/S8u3GLaY/patch-0-6-v5-fiemap-introduce-data-compressed-and-phys-length-flags for details on how compressed extents should be handled. The reserved64[0] field would become the physical length of the extent, and the existing length field would be the logical length. There should be a new extent flag for compressed blocks, but it makes sense to always fill in the physical length even if the blocks are not compressed. Ideally, that patch series could be refreshed and submitted upstream, to ensure that the flags/fields are fixed for the future. |
|
For all the ones who just want to display file fragmentation... it's not complete, and I'm not 100% sure it even works, but the script I wrote for #7110 might nevertheless be a good start. |
|
Brian, any news on this? Would it be possible to post a WIP/RFC patch so that it can be reviewed and (maybe) someone else can work on it? |
|
Nothing new to add. I wish there was a patch worth opening a PR for. But the prototype version I had was very basic having been thrown together for a hackathon. |
The FIEMAP ioctl is the standard Linux user space mechanism for inspecting physical layout of a file on disk. The ioctl returns a list of extents each of which describes a region of the file. Compatible blocks are merged in to larger extents when they are physically contiguous and have identical flags. The following per-extent flags are supported by ZFS. FIEMAP_EXTENT_LAST - The last extent in the mapping FIEMAP_EXTENT_UNKNOWN - Set on gang blocks FIEMAP_EXTENT_DELALLOC - Dirty extents not yet written FIEMAP_EXTENT_ENCODED - Set on compressed extents FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents FIEMAP_EXTENT_NOT_ALIGNED - Extent is not block aligned FIEMAP_EXTENT_DATA_INLINE - Set on embedded block pointers FIEMAP_EXTENT_UNWRITTEN - Set on holes (normally not reported) FIEMAP_EXTENT_MERGED - Multiple block pointers were merged FIEMAP_EXTENT_SHARED - Set on deduplicated extents The following flags are supported when requesting extents. Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but are applicable to other Linux file systems. FIEMAP_FLAG_SYNC - Sync the file first FIEMAP_FLAG_COPIES - Include all copies of an extent FIEMAP_FLAG_NOMERGE - Don't merge blocks in to extents FIEMAP_FLAG_HOLES - Include unwritten holes as an extent Finally, the following reserved fields in the fiemap_extent structure have been utilized and should be reserved to prevent future conflicts. .fe_reserved[0] - Unique device ID; top-level VDEV ID for ZFS .fe_reserved64[0] - Physical length of an extent Future work: * The FIEMAP_FLAG_XATTR flag is not supported. Implementing this should be relatively straight forward if it is needed. This would be a handy way to determine if the xattrs have been stored in the dnode, spill block, or an external object. * The FIEMAP_FLAG_CACHE flag is not supported. We rely on the ARC to keep the right blocks cached. * The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the dmu_offset_next() function. The zpl_llseek() function can be updated to use FIEMAP to quickly determine the next extent. This has the advantage of correctly accounting for newly dirtied or freed blocks without forcing a txg sync. The FIEMAP interface is fully described at: * https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#264
The FIEMAP ioctl is the standard Linux user space mechanism for inspecting physical layout of a file on disk. The ioctl returns a list of extents each of which describes a region of the file. Compatible blocks are merged in to larger extents when they are physically contiguous and have identical flags. The following per-extent flags are supported by ZFS. FIEMAP_EXTENT_LAST - The last extent in the mapping FIEMAP_EXTENT_UNKNOWN - Set on gang blocks FIEMAP_EXTENT_DELALLOC - Dirty extents not yet written FIEMAP_EXTENT_ENCODED - Set on compressed extents FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents FIEMAP_EXTENT_NOT_ALIGNED - Extent is not block aligned FIEMAP_EXTENT_DATA_INLINE - Set on embedded block pointers FIEMAP_EXTENT_UNWRITTEN - Set on holes (normally not reported) FIEMAP_EXTENT_MERGED - Multiple block pointers were merged FIEMAP_EXTENT_SHARED - Set on deduplicated extents The following flags are supported when requesting extents. Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but are applicable to other Linux file systems. FIEMAP_FLAG_SYNC - Sync the file first FIEMAP_FLAG_COPIES - Include all copies of an extent FIEMAP_FLAG_NOMERGE - Don't merge blocks in to extents FIEMAP_FLAG_HOLES - Include unwritten holes as an extent Finally, the following reserved fields in the fiemap_extent structure have been utilized and should be reserved to prevent future conflicts. .fe_reserved[0] - Unique device ID; top-level VDEV ID for ZFS .fe_reserved64[0] - Physical length of an extent Future work: * The FIEMAP_FLAG_XATTR flag is not supported. Implementing this should be relatively straight forward if it is needed. This would be a handy way to determine if the xattrs have been stored in the dnode, spill block, or an external object. * The FIEMAP_FLAG_CACHE flag is not supported. We rely on the ARC to keep the right blocks cached. * The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the dmu_offset_next() function. The zpl_llseek() function can be updated to use FIEMAP to quickly determine the next extent. This has the advantage of correctly accounting for newly dirtied or freed blocks without forcing a txg sync. The FIEMAP interface is fully described at: * https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#264
The FIEMAP ioctl is the standard Linux user space mechanism for inspecting physical layout of a file on disk. The ioctl returns a list of extents each of which describes a region of the file. Compatible blocks are merged in to larger extents when they are physically contiguous and have identical flags. The following per-extent flags are supported by ZFS. FIEMAP_EXTENT_LAST - The last extent in the mapping FIEMAP_EXTENT_UNKNOWN - Set on gang blocks FIEMAP_EXTENT_DELALLOC - Dirty extents not yet written FIEMAP_EXTENT_ENCODED - Set on compressed extents FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents FIEMAP_EXTENT_NOT_ALIGNED - Extent is not block aligned FIEMAP_EXTENT_DATA_INLINE - Set on embedded block pointers FIEMAP_EXTENT_UNWRITTEN - Set on holes (normally not reported) FIEMAP_EXTENT_MERGED - Multiple block pointers were merged FIEMAP_EXTENT_SHARED - Set on deduplicated extents The following flags are supported when requesting extents. Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but are applicable to other Linux file systems. FIEMAP_FLAG_SYNC - Sync the file first FIEMAP_FLAG_COPIES - Include all copies of an extent FIEMAP_FLAG_NOMERGE - Don't merge blocks in to extents FIEMAP_FLAG_HOLES - Include unwritten holes as an extent Finally, the following reserved fields in the fiemap_extent structure have been utilized and should be reserved to prevent future conflicts. .fe_reserved[0] - Unique device ID; top-level VDEV ID for ZFS .fe_reserved64[0] - Physical length of an extent Future work: * The FIEMAP_FLAG_XATTR flag is not supported. Implementing this should be relatively straight forward if it is needed. This would be a handy way to determine if the xattrs have been stored in the dnode, spill block, or an external object. * The FIEMAP_FLAG_CACHE flag is not supported. We rely on the ARC to keep the right blocks cached. * The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the dmu_offset_next() function. The zpl_llseek() function can be updated to use FIEMAP to quickly determine the next extent. This has the advantage of correctly accounting for newly dirtied or freed blocks without forcing a txg sync. The FIEMAP interface is fully described at: * https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#264
Signed-off-by: Pawan <pawan@mayadata.io>
…ge generated by test_connectivity (openzfs#264)
We haven't yet integrated ZFS with filefrag on Linux. But this is certainly something which is worth looking in to me. My initial understanding is that to make this work we would just need to implement the FIEMAP ioctl. Although this is basically an e2fsprogs utility so it may not be the right tool for the job.
The text was updated successfully, but these errors were encountered: