Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support filefrag #264

Open
behlendorf opened this issue Jun 6, 2011 · 16 comments
Open

Support filefrag #264

behlendorf opened this issue Jun 6, 2011 · 16 comments
Labels
Type: Feature Feature request or new feature

Comments

@behlendorf
Copy link
Contributor

We haven't yet integrated ZFS with filefrag on Linux. But this is certainly something which is worth looking in to me. My initial understanding is that to make this work we would just need to implement the FIEMAP ioctl. Although this is basically an e2fsprogs utility so it may not be the right tool for the job.

@adilger
Copy link
Contributor

adilger commented Jul 29, 2011

Brian, most Linux filesystems support FIEMAP today, including ext3/4 and Lustre, since it was originally based on a similar XFS ioctl, and has been adopted by other filesystems since then. While "filefrag" is packaged with e2fsprogs, it is a candidate to move into util-linux-ng at some point (along with lsattr/chattr), as have other e2fsprogs-developed libraries/tools like libblkid, libuuid, libcom_err, fsck, etc.

The other alternative to FIEMAP is FIBMAP, but that ioctl is limited to root, and it only allows returning a single block at a time. Not only is FIBMAP very inefficient for large files, but it also misses some additional features that are nice to have on modern filesystems, like xattr mapping, unwritten extents, delalloc blocks, etc.

@adilger
Copy link
Contributor

adilger commented Nov 4, 2011

I would recommend to start by looking at __generic_block_fiemap() in the kernel, to get an idea of what needs to be done from the FIEMAP side of the code. Note that there is also support in very new kernels to add the SEEK_HOLE
and SEEK_DATA arguments to llseek() (see generic_file_llseek()). For ZFS blocks it should set FIEMAP_EXTENT_MERGED on merged ranges of blocks.

Lustre uses a patched version of filefrag which allows returning the underlying device for each extent (fe_device), because the file is not located on a single LUN. For Lustre this is an index value (0-N). For ZFS one might either return the Linux block device (major << 16 | minor) or 32 bits of the VDEV GUID or similar.

struct fiemap_extent {
        __u64 fe_logical;  /* logical offset in bytes for the start of
                            * the extent from the beginning of the file */
        __u64 fe_physical; /* physical offset in bytes for the start
                            * of the extent from the beginning of the disk */
        __u64 fe_length;   /* length in bytes for this extent */
        __u64 fe_reserved64[2];
        __u32 fe_flags;    /* FIEMAP_EXTENT_* flags for this extent */
        __u32 fe_device;   /* device number */
        __u32 fe_reserved[2];
};

One of the fe_reserved64[2] fields was (in my mind at least) reserved for returning the actual length of the extent, for compressed blocks. I haven't given a lot of thought to whether the existing fe_length should be treated as the physical length or the logical length (currently both are the same), so ideas are welcome. If anyone goes down that road, please send me a patch for e2fsprogs filefrag so I can submit it upstream.

As for the ZFS side, I'd recommend looking at the Solaris ZPL layer to see how they implement SEEK_HOLE and SEEK_DATA traversal of the blocks in the dnode.

Don't forget about reporting dirty pages (ARC buffers) in memory, or copying a file that was just written will result in an empty copy. Even though they do not have blocks allocated yet, it should set FIEMAP_EXTENT_DELALLOC to indicate that the space is in use.

@behlendorf behlendorf removed this from the 0.7.0 milestone Oct 3, 2014
@adilger
Copy link
Contributor

adilger commented Aug 21, 2015

The most recent (but as yet unlanded) patches for handling compressed extents in FIEMAP are at https://lwn.net/Articles/607552/ and http://linux-fsdevel.vger.kernel.narkive.com/S8u3GLaY/patch-0-6-v5-fiemap-introduce-data-compressed-and-phys-length-flags

@Stoatwblr
Copy link

I'm the one who originally requested this in the zfsonlinux mailing list. in order to generate fragmentation reports.

Knowing the actual extent of fragmentation is fairly important given how badly ZFS performance falls over when fragmentation becomes widespread (short version: It's pretty ugly. You don't want to experience it)

Brian, what's the state of play?

@behlendorf
Copy link
Contributor Author

@Stoatwblr this issue is waiting for a developer with enough available time and interest to tackle it. I agree it would be great to have.

@Baughn
Copy link

Baughn commented Jul 8, 2016

This is also a requirement for running the shake defragmenter.

@adilger
Copy link
Contributor

adilger commented Oct 31, 2017

@behlendorf, I saw your FIEMAP project was listed as the runner-up project for the ZFS DevCon Hackathon. Awesome that you had the opportunity to work on this. I'd be interested to review the patch when it is available, and would be interested to discuss how to handle ditto blocks with FIEMAP, if you haven't already implemented this.

It would also be very useful to add FIEMAP support to Lustre osd-zfs so that it can export this information to the client.

@behlendorf
Copy link
Contributor Author

@adilger thanks! So I don't have any code ready to share just yet but I hope to fairly soon. When I have something I'd love to get your input. The OpenZFS summit was a good opportunity to give some careful thought as to how to best go about this.

I was happy the see the FIEMAP interface already provides most of what's needed. Ditto blocks are one exception you mentioned, another concern is including the block device in the extent.

@adilger
Copy link
Contributor

adilger commented Nov 21, 2017

For returning the block device in the extent, I would suggest to use the same mechanism as Lustre does for returning the OST index to the caller. We return the OST index (__u32) in a reserved field:

#define fe_device fe_reserved[0]

that hasn't yet been reserved in the upstream kernel, though I should probably do that at some point. For ZFS it could either return the actual __u32 block device number (kdev_t, which would make it easier for users to locate), or it could e.g. return the low 32 bits of the VDEV GUID (which would be consistent across runs, but need another level of lookup to resolve to a specific disk).

The Lustre-patched filefrag also passes a flag FIEMAP_FLAG_DEVICE_ORDER=0x40000000 to indicate that the lower layers should return the extents in device order (i.e. all of OST0000 first, then OST0001, or wherever the stripes are located) so that they are not interleaved across devices at every stripe_size boundary.

$ filefrag -v /myth/tmp/2stripe
Filesystem type is: bd00bd0
File size of /myth/tmp/2stripe is 20971520 (20480 blocks of 1024 bytes)
 ext:     device_logical:        physical_offset: length:  dev: flags:
   0:        0..   10239: 3844867072..3844877311:  10240: 0002: net
   1:        0..    4095: 2938242048..2938246143:   4096: 0000: net
   2:     4096..    8191: 2938363904..2938367999:   4096: 0000: net
   3:     8192..   10239: 2938380288..2938382335:   2048: 0000: last,net

For handling ditto blocks, the FIEMAP_FLAG_DEVICE_ORDER option would also segregate ditto block copies between multiple VDEVs, and they would return the same logical file offset for each of the ditto copies of that block. For display by filefrag this is fine, and for most tools that use FIEMAP they only care whether there is data at a given offset or not so reporting the same logical offset two or three times should be fine.

@behlendorf
Copy link
Contributor Author

@adilger that all makes good sense and I generally agree.

We return the OST index (__u32) in a reserved field:

It would be great to get this reserved in the kernel. When looking at the existing interfaces I was surprised btrfs hasn't already done this. My feeling here is the kdev_t would be the most useful thing to return. One potential gotcha I see with this approach is we'll need to return some reserved value when the device is faulted or missing.

The Lustre-patched filefrag also passes a flag FIEMAP_FLAG_DEVICE_ORDER=0x40000000

That's handy! Although if we're reporting the kdev_t as the device then what exactly is device order? Numerically based on the kdev_t values won't match up with the zpool status ordering. But your right it still would prevent interleaving which is probably sufficient.

or it could e.g. return the low 32 bits of the VDEV GUID

The full vdev guid could additionally be returned in fe_reserved64[0] if there's a legitimate use for it. I don't think we want to be reporting truncated GUIDs which are very likely, but not guaranteed to be, unique.

ditto blocks

What are you thoughts on adding a new FIEMAP_EXTENT_DATA_DUPLICATE flag which would be set for all extents which have multiple copies.

Mirrors and RAIDZ

Handling mirrors is relatively straight forward since we can return the extent multiple times, one for each device. Although there are some complications with compression, more on that in a minute.

As for RAIDZ and DRAID in order to return correct physical information we need to add an extent for every vdev spanned by the stripe. That's potentially a large number of small extents even assuming that we ignore the parity information. Will that cause any problems for filefrag?

Compression

The fiemap_extent structure only includes a single value for length, fe_length, which implies the logical and physical lengths are the same. With compression this simply isn't true, is there an existing way to handle this I'm not seeing?

Splitting blocks.

When ZFS splits a compressed or encrypted block when writing it, either because its RAIDZ or because a gang block is needed, there exists no meaningful logical to physical offset mapping for the extent. Given the current interface I'm not sure what those extents should look like. One option would be to add the extent multiple times with each entry referencing a different physical offset which includes a partial portion of the block.

An alternative to all of this would be to report the extent from the perspective of the top-level RAIDZ vdev. That would push this issue up in to the caller but my feeling is that wouldn't be particularly useful since it would be difficult at best for them to calculate the physical offsets.

@adilger
Copy link
Contributor

adilger commented Nov 28, 2017

@behlendorf see http://linux-fsdevel.vger.kernel.narkive.com/S8u3GLaY/patch-0-6-v5-fiemap-introduce-data-compressed-and-phys-length-flags for details on how compressed extents should be handled. The reserved64[0] field would become the physical length of the extent, and the existing length field would be the logical length. There should be a new extent flag for compressed blocks, but it makes sense to always fill in the physical length even if the blocks are not compressed.

Ideally, that patch series could be refreshed and submitted upstream, to ensure that the flags/fields are fixed for the future.

@Baughn
Copy link

Baughn commented Jan 31, 2018

For all the ones who just want to display file fragmentation... it's not complete, and I'm not 100% sure it even works, but the script I wrote for #7110 might nevertheless be a good start.

@adilger
Copy link
Contributor

adilger commented Apr 11, 2018

Brian, any news on this? Would it be possible to post a WIP/RFC patch so that it can be reviewed and (maybe) someone else can work on it?

@behlendorf
Copy link
Contributor Author

Nothing new to add. I wish there was a patch worth opening a PR for. But the prototype version I had was very basic having been thrown together for a hackathon.

behlendorf added a commit to behlendorf/zfs that referenced this issue May 17, 2018
The FIEMAP ioctl is the standard Linux user space mechanism for
inspecting physical layout of a file on disk.  The ioctl returns
a list of extents each of which describes a region of the file.
Compatible blocks are merged in to larger extents when they are
physically contiguous and have identical flags.

The following per-extent flags are supported by ZFS.

  FIEMAP_EXTENT_LAST           - The last extent in the mapping
  FIEMAP_EXTENT_UNKNOWN        - Set on gang blocks
  FIEMAP_EXTENT_DELALLOC       - Dirty extents not yet written
  FIEMAP_EXTENT_ENCODED        - Set on compressed extents
  FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents
  FIEMAP_EXTENT_NOT_ALIGNED    - Extent is not block aligned
  FIEMAP_EXTENT_DATA_INLINE    - Set on embedded block pointers
  FIEMAP_EXTENT_UNWRITTEN      - Set on holes (normally not reported)
  FIEMAP_EXTENT_MERGED         - Multiple block pointers were merged
  FIEMAP_EXTENT_SHARED         - Set on deduplicated extents

The following flags are supported when requesting extents.  Note
that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to
ZFS but are applicable to other Linux file systems.

  FIEMAP_FLAG_SYNC             - Sync the file first
  FIEMAP_FLAG_COPIES           - Include all copies of an extent
  FIEMAP_FLAG_NOMERGE          - Don't merge blocks in to extents
  FIEMAP_FLAG_HOLES            - Include unwritten holes as an extent

Finally, the following reserved fields in the fiemap_extent structure
have been utilized and should be reserved to prevent future conflicts.

  .fe_reserved[0]   - Unique device ID; top-level VDEV ID for ZFS
  .fe_reserved64[0] - Physical length of an extent

Future work:

* The FIEMAP_FLAG_XATTR flag is not supported.  Implementing this
  should be relatively straight forward if it is needed.  This would
  be a handy way to determine if the xattrs have been stored in the
  dnode, spill block, or an external object.

* The FIEMAP_FLAG_CACHE flag is not supported.  We rely on the ARC
  to keep the right blocks cached.

* The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the
  dmu_offset_next() function.  The zpl_llseek() function can be
  updated to use FIEMAP to quickly determine the next extent.
  This has the advantage of correctly accounting for newly dirtied
  or freed blocks without forcing a txg sync.

The FIEMAP interface is fully described at:

* https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#264
@behlendorf behlendorf mentioned this issue May 18, 2018
13 tasks
@behlendorf
Copy link
Contributor Author

@adilger I've opened #7545 with a proper FIEMAP implementation. If you have any time to review or test it would be appreciated!

rohan-puri pushed a commit to rohan-puri/zfs that referenced this issue Nov 5, 2019
The FIEMAP ioctl is the standard Linux user space mechanism for
inspecting physical layout of a file on disk.  The ioctl returns
a list of extents each of which describes a region of the file.
Compatible blocks are merged in to larger extents when they are
physically contiguous and have identical flags.

The following per-extent flags are supported by ZFS.

  FIEMAP_EXTENT_LAST           - The last extent in the mapping
  FIEMAP_EXTENT_UNKNOWN        - Set on gang blocks
  FIEMAP_EXTENT_DELALLOC       - Dirty extents not yet written
  FIEMAP_EXTENT_ENCODED        - Set on compressed extents
  FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents
  FIEMAP_EXTENT_NOT_ALIGNED    - Extent is not block aligned
  FIEMAP_EXTENT_DATA_INLINE    - Set on embedded block pointers
  FIEMAP_EXTENT_UNWRITTEN      - Set on holes (normally not reported)
  FIEMAP_EXTENT_MERGED         - Multiple block pointers were merged
  FIEMAP_EXTENT_SHARED         - Set on deduplicated extents

The following flags are supported when requesting extents.  Note
that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to
ZFS but are applicable to other Linux file systems.

  FIEMAP_FLAG_SYNC             - Sync the file first
  FIEMAP_FLAG_COPIES           - Include all copies of an extent
  FIEMAP_FLAG_NOMERGE          - Don't merge blocks in to extents
  FIEMAP_FLAG_HOLES            - Include unwritten holes as an extent

Finally, the following reserved fields in the fiemap_extent structure
have been utilized and should be reserved to prevent future conflicts.

  .fe_reserved[0]   - Unique device ID; top-level VDEV ID for ZFS
  .fe_reserved64[0] - Physical length of an extent

Future work:

* The FIEMAP_FLAG_XATTR flag is not supported.  Implementing this
  should be relatively straight forward if it is needed.  This would
  be a handy way to determine if the xattrs have been stored in the
  dnode, spill block, or an external object.

* The FIEMAP_FLAG_CACHE flag is not supported.  We rely on the ARC
  to keep the right blocks cached.

* The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the
  dmu_offset_next() function.  The zpl_llseek() function can be
  updated to use FIEMAP to quickly determine the next extent.
  This has the advantage of correctly accounting for newly dirtied
  or freed blocks without forcing a txg sync.

The FIEMAP interface is fully described at:

* https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#264
rohan-puri pushed a commit to rohan-puri/zfs that referenced this issue Nov 9, 2019
The FIEMAP ioctl is the standard Linux user space mechanism for
inspecting physical layout of a file on disk.  The ioctl returns
a list of extents each of which describes a region of the file.
Compatible blocks are merged in to larger extents when they are
physically contiguous and have identical flags.

The following per-extent flags are supported by ZFS.

  FIEMAP_EXTENT_LAST           - The last extent in the mapping
  FIEMAP_EXTENT_UNKNOWN        - Set on gang blocks
  FIEMAP_EXTENT_DELALLOC       - Dirty extents not yet written
  FIEMAP_EXTENT_ENCODED        - Set on compressed extents
  FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents
  FIEMAP_EXTENT_NOT_ALIGNED    - Extent is not block aligned
  FIEMAP_EXTENT_DATA_INLINE    - Set on embedded block pointers
  FIEMAP_EXTENT_UNWRITTEN      - Set on holes (normally not reported)
  FIEMAP_EXTENT_MERGED         - Multiple block pointers were merged
  FIEMAP_EXTENT_SHARED         - Set on deduplicated extents

The following flags are supported when requesting extents.  Note
that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to
ZFS but are applicable to other Linux file systems.

  FIEMAP_FLAG_SYNC             - Sync the file first
  FIEMAP_FLAG_COPIES           - Include all copies of an extent
  FIEMAP_FLAG_NOMERGE          - Don't merge blocks in to extents
  FIEMAP_FLAG_HOLES            - Include unwritten holes as an extent

Finally, the following reserved fields in the fiemap_extent structure
have been utilized and should be reserved to prevent future conflicts.

  .fe_reserved[0]   - Unique device ID; top-level VDEV ID for ZFS
  .fe_reserved64[0] - Physical length of an extent

Future work:

* The FIEMAP_FLAG_XATTR flag is not supported.  Implementing this
  should be relatively straight forward if it is needed.  This would
  be a handy way to determine if the xattrs have been stored in the
  dnode, spill block, or an external object.

* The FIEMAP_FLAG_CACHE flag is not supported.  We rely on the ARC
  to keep the right blocks cached.

* The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the
  dmu_offset_next() function.  The zpl_llseek() function can be
  updated to use FIEMAP to quickly determine the next extent.
  This has the advantage of correctly accounting for newly dirtied
  or freed blocks without forcing a txg sync.

The FIEMAP interface is fully described at:

* https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#264
fuhrmannb pushed a commit to fuhrmannb/cstor that referenced this issue Nov 3, 2020
Signed-off-by: Pawan <pawan@mayadata.io>
pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Apr 20, 2021
sdimitro pushed a commit to sdimitro/zfs that referenced this issue May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

5 participants
@Baughn @behlendorf @adilger @Stoatwblr and others