-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FIEMAP support #7545
Add FIEMAP support #7545
Conversation
The FIEMAP ioctl is the standard Linux user space mechanism for inspecting physical layout of a file on disk. The ioctl returns a list of extents each of which describes a region of the file. Compatible blocks are merged in to larger extents when they are physically contiguous and have identical flags. The following per-extent flags are supported by ZFS. FIEMAP_EXTENT_LAST - The last extent in the mapping FIEMAP_EXTENT_UNKNOWN - Set on gang blocks FIEMAP_EXTENT_DELALLOC - Dirty extents not yet written FIEMAP_EXTENT_ENCODED - Set on compressed extents FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents FIEMAP_EXTENT_NOT_ALIGNED - Extent is not block aligned FIEMAP_EXTENT_DATA_INLINE - Set on embedded block pointers FIEMAP_EXTENT_UNWRITTEN - Set on holes (normally not reported) FIEMAP_EXTENT_MERGED - Multiple block pointers were merged FIEMAP_EXTENT_SHARED - Set on deduplicated extents The following flags are supported when requesting extents. Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but are applicable to other Linux file systems. FIEMAP_FLAG_SYNC - Sync the file first FIEMAP_FLAG_COPIES - Include all copies of an extent FIEMAP_FLAG_NOMERGE - Don't merge blocks in to extents FIEMAP_FLAG_HOLES - Include unwritten holes as an extent Finally, the following reserved fields in the fiemap_extent structure have been utilized and should be reserved to prevent future conflicts. .fe_reserved[0] - Unique device ID; top-level VDEV ID for ZFS .fe_reserved64[0] - Physical length of an extent Future work: * The FIEMAP_FLAG_XATTR flag is not supported. Implementing this should be relatively straight forward if it is needed. This would be a handy way to determine if the xattrs have been stored in the dnode, spill block, or an external object. * The FIEMAP_FLAG_CACHE flag is not supported. We rely on the ARC to keep the right blocks cached. * The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the dmu_offset_next() function. The zpl_llseek() function can be updated to use FIEMAP to quickly determine the next extent. This has the advantage of correctly accounting for newly dirtied or freed blocks without forcing a txg sync. The FIEMAP interface is fully described at: * https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#264
Codecov Report
@@ Coverage Diff @@
## master #7545 +/- ##
==========================================
- Coverage 77.42% 77.41% -0.01%
==========================================
Files 336 339 +3
Lines 107529 108468 +939
==========================================
+ Hits 83251 83975 +724
- Misses 24278 24493 +215
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a native speaker but to me the use of 'in to' made it hard to
understand the documentation and with 'into' it would be more clear to me. Feel
free to ignore if I'm wrong or misunderstood.
Great to see this feature coming, btw! 🎉
* - FIEMAP_FLAG_SYNC - Sync extents before reporting | ||
* - FIEMAP_FLAG_XATTR - Report extents used by xattrs (Unsupported) | ||
* - FIEMAP_FLAG_COPIES - Report all data copies (ZFS-only) | ||
* - FIEMAP_FLAG_NOMERGE - Never merge blocks in to extents (ZFS-only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
fiemap_remove | ||
|
||
# While the data is dirty only a single delalloc extent should be | ||
# reported. Once the dirty data is converted in to holes it should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
fiemap_verify -h -F "unwritten:all" | ||
fiemap_remove | ||
|
||
# Verify "merged" is set on blocks merged in to extents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
# Since it's unlikely that all the blocks can be merged due their | ||
# physical offset and device id a large number are written and only | ||
# a single merge is required. | ||
log_note "Verify 'merged' is set on data blocks merged in to extents" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
fiemap_verify -s -F "merged:>0" | ||
fiemap_remove | ||
|
||
# Verify "merged" is set on holes merged in to extents if requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
|
||
# Verify "merged" is set on holes merged in to extents if requested. | ||
# Holes always merge so there must be only one. | ||
log_note "Verify 'merged' is set on holes merged in to extents" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
fiemap_verify -s -H 0:$((BS*8)):1 | ||
fiemap_remove | ||
|
||
# Write blocks in to holes: ...X..XX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
fiemap_remove | ||
|
||
# Write blocks in to holes: ...X..XX | ||
log_note "Write blocks in to holes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
#endif | ||
|
||
/* | ||
* Request that each block be reported and not merged in to an extent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in to -> into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
{ | ||
zfs_fiemap_entry_t *fe; | ||
|
||
fe = kmem_zalloc(sizeof (zfs_fiemap_entry_t), KM_SLEEP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check fe != NULL
I wrote a small fuzz tester for the ioctl here. After hundreds of millions of iterations it never once crashed the kernel! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIEMAP_EXTENT_NOT_ALIGNED - Extent is not block aligned
What does that mean? I would think that we are reporting on ZFS blocks, which are by definition ZFS-block-aligned. Is there some other sort of block at play here?
Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but are applicable to other Linux file systems.
Does that mean that we need to get these flags merged into the linux kernel, otherwise someone else might use the flag value to mean something else?
.fe_reserved[0] - Unique device ID; top-level VDEV ID for ZFS
.fe_reserved64[0] - Physical length of an extent
Likewise for these, what if someone else decides to use these for a different purpose?
FIEMAP_FLAG_SYNC - Sync the file first
I assume that means that we would do a txg_wait_synced() if (and only if) the file is dirty? (looks like this is wrong, from reading the code; at a minimum we should document it more precisely).
The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the dmu_offset_next() function. The zpl_llseek() function can be updated to use FIEMAP to quickly determine the next extent. This has the advantage of correctly accounting for newly dirtied or freed blocks without forcing a txg sync.
That would be awesome!
|
||
/* | ||
* The number of blocks times the block size may exceed the file size | ||
* when there are pending dirty blocks which have not be written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took me a while to parse this... I think you mean:
The fill count reported by FIEMAP (times the block size) may be larger than the file size reported by stat() when there are pending dirty blocks which have not be written.
But I don't understand how that would be the case. There can't be more pending dirty blocks than the size of the file (as reported by stat()). That said, I guess there could be a race condition between the calls to fiemap and stat, where the file size legitimately shrinks between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it can for this special case where FIEMAP_FLAG_NOMERGE
is passed and we return the fill count plus the pending dirty blocks. The dirty blocks might overwrite an existing block so we over report slightly. This might also be possible when the copies
property is set larger than 1.
Now that's fine since if the actual extents are requested you'll just allocate a slightly larger buffer than needed. The returned extents will be accurate.
*/ | ||
static void | ||
dbuf_add_dirty_map(list_t *list, range_tree_t *dirty_tree, | ||
range_tree_t *free_tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make an assertion about the appropriate locks being held in the dnode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
{ | ||
dbuf_dirty_record_t *dr; | ||
|
||
for (dr = list_head(list); dr != NULL; dr = list_next(list, dr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can declare dr
in the loop:
for (dbuf_dirty_record_t *dr = ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
dmu_buf_impl_t *db = dr->dr_dbuf; | ||
|
||
if (db->db_buf == NULL) | ||
(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Seems like it could be pretty heavyweight... and potentially cause locking issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, yes I meant to remove this before pushing.
*/ | ||
boolean_t dirty = B_FALSE; | ||
for (dnode_t *mls_dn = multilist_sublist_head(mls); mls_dn != NULL; | ||
mls_dn = multilist_sublist_next(mls, mls_dn)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be pretty expensive, right? It's O(number of dirty dnodes). Could we use multilist_link_active(), or some other flag specific to the dnode to make this determination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can find a better way to do this.
/* | ||
* When FIEMAP_FLAG_NOMERGE is set and the number of extents for the | ||
* entire file are being requested. Then in this special case walking | ||
* the entire indirect block tree is not required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about:
+ * If FIEMAP_FLAG_NOMERGE is set and the number of extents for the
+ * entire file are being requested, this is a special case where walking
+ * the entire indirect block tree is not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
if (BP_GET_FILL(bp) > 0) { | ||
czb.zb_blkid = i; | ||
error = zfs_fiemap_visit_indirect(spa, dnp, bp, | ||
&czb, zfs_fiemap_cb, (void *)fm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could take arbitrarily long (hours+). No writes will be processed during that time, since we are holding the txg open. This seems like a huge performance problem (and denial-of-service attack). I think we need to rearchitect this to get the info in chunks, dropping locks periodically so that this does not pause the entire storage pool while in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The FIEMAP ioctl itself is allowed to return sub-regions of the file if needed (e.g. the extent array is not large enough) and is expected to be re-called until FIEMAP_EXTENT_LAST
is set, similar to read()
or readdir()
needing multiple calls.
* to fully complete in order to avoid using stale block pointers. | ||
*/ | ||
if (dirty_txg > syncing_txg) | ||
txg_wait_synced(spa_get_dsl(spa), syncing_txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this... I think it's saying that if the dnode is dirty in a txg after the currently-syncing txg, then we will wait for the currently-syncing txg to finish syncing. Why do we do that?
At a high level, can you explain how we deal with pending changes, in the syncing txg; in the quiescing txg; and in the open txg? It seems like it could be hard to determine what's going on in the syncing txg, since there's a bunch of i/os in flight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll extent the block comment above this function to explain at a high level how this is intended to work and why it's OK.
dn->dn_object, dnp->dn_nlevels - 1, 0); | ||
|
||
for (int i = 0; i < MIN(dnp->dn_nblkptr, fm->fm_copies); i++) { | ||
blkptr_t *bp = &dnp->dn_blkptr[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be running during spa_sync()? If so, we need the proper locks to prevent the dnode_phys from changing (specifically it's BP's). See dmu_buf_lock_parent().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this is being protected by the dn_mtx
which should prevent spa_sync()
from changing the dnode_phys
. Which is pretty heavy handed and goes to your larger point of this potentially taking a long time. I'll see what can be done tp chunk this up so never end up stalling the entire pool.
/* | ||
* Recursively walk the indirect block tree for a dnode_phys_t and call | ||
* the provided callback for all block pointers traversed. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your idea that since we are walking via the BP's and ARC here, the DMU could be doing whatever it wants while we're in the middle of this? And we are sure that the blocks on disk won't be overwritten because we are holding the txg open so not enough txg's could pass to free and reallocate a block (due to TXG_DEFER_SIZE)? What if !defer_allowed
in metaslab_sync_done()
?
That isn't possible; `kmem_*alloc(KM_SLEEP)` never returns NULL.
…On Fri, May 18, 2018 at 2:09 PM, Tony Hutter ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In module/zfs/zfs_vnops.c
<#7545 (comment)>:
> +
+ return (error);
+}
+
+/*
+ * Allocate and insert a new extent. Duplicates are never allowed so make
+ * sure to clear the range with zfs_fiemap_clear() as needed.
+ */
+static void
+zfs_fiemap_add_impl(avl_tree_t *t, uint64_t logical_start,
+ uint64_t logical_len, uint64_t physical_start, uint64_t physical_len,
+ uint64_t vdev, uint64_t flags)
+{
+ zfs_fiemap_entry_t *fe;
+
+ fe = kmem_zalloc(sizeof (zfs_fiemap_entry_t), KM_SLEEP);
Check fe != NULL
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#7545 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAwxlDDGJSrZL5WmEwM7_szYrDz0siNdks5tzziKgaJpZM4UD8Vf>
.
|
* The following flags have been submitted for inclusion in future | ||
* Linux kernels and the filefrag(8) utility. | ||
*/ | ||
#define fe_device_reserved fe_reserved[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lustre uses this field as fe_device
. It probably makes sense to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will fix.
#endif | ||
|
||
/* | ||
* Extent is shared with other space. Introduced in 2.6.33 kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, this means that the extent is shared with "another file" for "reflinked" files. It doesn't mean that e.g. the data is shared in the same block as an inode or with an xattr. I don't think ZFS supports per-file reflink yet, but it probably could because it is already COW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that this flag is specifically reserved for reflinked files? This flag is currently being set on extents which are shared by multiple files due to deduplication. Is that an approriate use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think dedupe files are fine. I think the implementation is OK, just the description here makes it sound like the data extent might be shared with some non-data (e.g. inline file data which should be FIEMAP_EXTENT_DATA_INLINE
, or different parts of the block are shared between different files like FIEMAP_EXTENT_DATA_TAIL
). Maybe something like:
* Extent data is deduplicated/referenced by multiple files. Introduced in 2.6.33 kernel.
* On success, zero is returned, the count argument is set to the number of | ||
* unallocated blocks (holes), and the bs argument is set to the block size | ||
* (if it is not NULL). On error, a non-zero errno is returned and the values | ||
* in count and bs are undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this needs to be implemented on non-ZFS filesystems that don't understand FIEMAP_FLAG_NOMERGE
it would be possible to retrieve the full FIEMAP info and then walk the extents to count the allocated blocks (or holes between the logical end of one extent and start of the next). It wouldn't be as efficient as just getting the fill count from ZFS, but would at least give you the "sparseness" of the file.
|
||
if (BP_IS_HOLE(bp)) { | ||
fe->fe_logical_len = fm->fm_block_size; | ||
fe->fe_flags |= FIEMAP_EXTENT_UNWRITTEN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed recently, FIEMAP_EXTENT_UNWRITTEN
is not really the same as a hole. An unwritten extent is for allocated blocks that have a flag in the extent/BP to indicate they should be read back as zero, but have not had any data/zeroes written to them.
Unwritten extents are used by fallocate()
to reserve disk space efficiently in non-COW filesystems (e.g. ext4 or XFS) for large databases or video files. However, that doesn't make sense for COW filesystems, since the "reserved" blocks could never be used directly (though I guess they could be used to prevent other files from consuming that space). I'd rather not conflate these concepts, especially since FIEMAP_EXTENT_UNWRITTEN
extents will normally reference allocated blocks, but that is not the case here.
fe->fe_flags |= FIEMAP_EXTENT_UNWRITTEN; | ||
} else if (BP_IS_EMBEDDED(bp)) { | ||
fe->fe_logical_len = BPE_GET_LSIZE(bp); | ||
fe->fe_physical_start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be the location of the dnode block itself? Since you are setting FIEMAP_EXTENT_NOT_ALIGNED
it could be the actual byte offset of the data, or at least the dnode within the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch. I meant to go back and add this but skipped it initially.
* dirty blocks are assumed to fill holes, and free blocks are assumed | ||
* to overlap with existing free blocks. This is a safe worst case | ||
* estimate which may slightly over report the number of extents for | ||
* a file being actively overwritten. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, since an actively-overwritten file is going to return slightly-incorrect/stale information to userspace regardless of how hard the kernel tries, so it doesn't make sense to try too hard to get it exactly correct.
* exactly how the block pointer will be merged. | ||
*/ | ||
if (fm->fm_extents_max == 0 && fm->fm_flags & FIEMAP_FLAG_NOMERGE && | ||
fm->fm_start == 0 && fm->fm_length == FIEMAP_MAX_OFFSET) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how the fill count is reported (e.g. on a per-BP basis for every level in the tree), you might still be able to return the extent count for a sub-region of the file (i.e. fm_start != 0
or fm_length != FIEMAP_MAX_OFFSET
), but that may not be a common-enough case to optimize.
if (BP_GET_FILL(bp) > 0) { | ||
czb.zb_blkid = i; | ||
error = zfs_fiemap_visit_indirect(spa, dnp, bp, | ||
&czb, zfs_fiemap_cb, (void *)fm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. The FIEMAP ioctl itself is allowed to return sub-regions of the file if needed (e.g. the extent array is not large enough) and is expected to be re-called until FIEMAP_EXTENT_LAST
is set, similar to read()
or readdir()
needing multiple calls.
/* Incompatible ZFS-only flags masked out of compatibility check */ | ||
fei->fi_flags &= ~ZFS_FIEMAP_FLAGS_ZFS; | ||
|
||
error = fiemap_check_flags(fei, ZFS_FIEMAP_FLAGS_COMPAT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to call fiemap_check_flags()
if you don't want. Make a zpl_check_flags()
function that checks for all of the flags that the code currently understands. I made sure that this function was not called at the VFS layer, exactly so that we could have different flags on a per-filesystem basis.
|
||
while ((c = getopt(argc, argv, "achsvD:E:H:F:V:?")) != -1) { | ||
switch (c) { | ||
case 'a': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how a
relates to NOMERGE
? Maybe n
or m
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIEMAP_EXTENT_NOT_ALIGNED - Extent is not block aligned
What does that mean? I would think that we are reporting on ZFS blocks, which
are by definition ZFS->block-aligned. Is there some other sort of block at
play here?
@ahrens this gets set for data in embedded blocks pointers, but we don't
currently bother reporting the exact physical offsets. We'll probably also
need this if we added support for reporting xattrs.
Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but
are applicable to other Linux file systems.
Does that mean that we need to get these flags merged into the linux kernel
Ideally yes. For the flags and the reserved fields.
FIEMAP_FLAG_SYNC - Sync the file first
I assume that means that we would do a txg_wait_synced() if (and only if)
the file is dirty? (looks like > > this is wrong, from reading the code; at
a minimum we should document it more precisely).
Good catch. Will fix.
I wrote a small fuzz tester for the ioctl here. After hundreds of millions
of iterations it never once crashed the kernel!
@tonyhutter excellent thanks!
#endif | ||
|
||
/* | ||
* Request that each block be reported and not merged in to an extent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix
|
||
/* | ||
* The number of blocks times the block size may exceed the file size | ||
* when there are pending dirty blocks which have not be written. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it can for this special case where FIEMAP_FLAG_NOMERGE
is passed and we return the fill count plus the pending dirty blocks. The dirty blocks might overwrite an existing block so we over report slightly. This might also be possible when the copies
property is set larger than 1.
Now that's fine since if the actual extents are requested you'll just allocate a slightly larger buffer than needed. The returned extents will be accurate.
*/ | ||
static void | ||
dbuf_add_dirty_map(list_t *list, range_tree_t *dirty_tree, | ||
range_tree_t *free_tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
{ | ||
dbuf_dirty_record_t *dr; | ||
|
||
for (dr = list_head(list); dr != NULL; dr = list_next(list, dr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
dmu_buf_impl_t *db = dr->dr_dbuf; | ||
|
||
if (db->db_buf == NULL) | ||
(void) dbuf_read(db, NULL, DB_RF_MUST_SUCCEED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, yes I meant to remove this before pushing.
* convenient way to determine the range of TXGs to check. | ||
*/ | ||
rl_t *rl = zfs_range_lock(&zp->z_range_lock, 0, UINT64_MAX, RL_READER); | ||
open_txg = txg_hold_open(spa_get_dsl(spa), &th); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time it didn't feel right to create and assign an empty tx
. Let me rework this and verifty that's OK.
* to fully complete in order to avoid using stale block pointers. | ||
*/ | ||
if (dirty_txg > syncing_txg) | ||
txg_wait_synced(spa_get_dsl(spa), syncing_txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add the ASSERT. I think there might be one other place where this is done too.
* to fully complete in order to avoid using stale block pointers. | ||
*/ | ||
if (dirty_txg > syncing_txg) | ||
txg_wait_synced(spa_get_dsl(spa), syncing_txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll extent the block comment above this function to explain at a high level how this is intended to work and why it's OK.
/* | ||
* When FIEMAP_FLAG_NOMERGE is set and the number of extents for the | ||
* entire file are being requested. Then in this special case walking | ||
* the entire indirect block tree is not required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
dn->dn_object, dnp->dn_nlevels - 1, 0); | ||
|
||
for (int i = 0; i < MIN(dnp->dn_nblkptr, fm->fm_copies); i++) { | ||
blkptr_t *bp = &dnp->dn_blkptr[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now this is being protected by the dn_mtx
which should prevent spa_sync()
from changing the dnode_phys
. Which is pretty heavy handed and goes to your larger point of this potentially taking a long time. I'll see what can be done tp chunk this up so never end up stalling the entire pool.
@@ -774,6 +813,7 @@ const struct inode_operations zpl_dir_inode_operations = { | |||
.permission = zpl_permission, | |||
#endif /* HAVE_GET_ACL | HAVE_CHECK_ACL | HAVE_PERMISSION */ | |||
#endif /* CONFIG_FS_POSIX_ACL */ | |||
.fiemap = zpl_fiemap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This provides extent maps for directories, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I built this patch against zfs 0.8.0-rc1 and I am trying to test it, but I am unable to find the fiemap command on centos? Is this a program you wrote or how would I go about getting it? Running filefrag gives me: [root@bl-vsnap-100 ~]# filefrag -v /vsnap/vpool1/fs1/debug.txt The fiemap under tests gives me: More information on the file:
Indirect blocks:
The reason, I want filefrag is to use it to determine the LBA of a file. Unless you know of a better way to get the LBA of a file? I tried using zdb, but it doesn't give me that information. |
@behlendorf It doesn't apply cleanly, but the conflicts were very minimal. It was mostly just moving stuff around and deleting tests cases that no long exists like the removal of grow_pool/grow_replicas and how it became just one grow interface. The only thing that wasn't just deleting lines was changing txg_verify to TXG_VERIFY(dn->dn_objset->os_spa, i); |
What version of What does strace show when running filefrag? It should try FIEMAP first, and only fall back to FIBMAP if there is an error. |
Yep, it shows that it tries FIEMAP and gets Operation not supported and then fails with the FIBMAP error. I also updated filefrag to 1.44, which I believe was the latest. |
@bgly this may sound like a silly question, but are you absolutely sure you loading the modified kmods? You can verify by checking |
@behlendorf Does the version change when you build your own? When i do the build the packages are tagged differently, and i run: sudo yum localinstall *.$(uname -p).rpm [root@bl-vsnap-100 zfs]# ls *.rpm These are my newly built rpm's |
@bgly yes, based on your packages you should see a |
@behlendorf Interesting... It's not showing the right version... I dont see how that is possible considering I ran: sudo yum localinstall *.$(uname -p).rpm which completed successfully and did the upgrade... Is there anything special I need to do to reload the modules? |
@bgly the upgrade won't automatically swap out the kmod's since they might be in use. So you're going to want to unload them with |
[root@bl-vsnap-100 zfs]# modinfo zfs [root@bl-vsnap-100 zfs]# modinfo spl here it shows the right version loaded...? |
Not quite, |
@behlendorf After rebooting the system, it seems like the zfs modules got replaced, which is good. The fiemap/filefrag commands now work correctly. The only thing that is odd is that /sys/module/zfs/version still shows the wrong version: |
Do you guys know why the values differ? Also, since this is ZFS, is the calculation from Physical offset to Block Number different than it would be for say an ext4? For example: (Test just sits on a regular ext4 filesystem)
Now if you take the physical offset and multiply by 8 since (4096/512) = 8 |
@bgly the discrepancy comes from that fact that There's also the additional complication with ZFS that the logical extent size is likely not the same as the physical extent size, thanks to compression. Again in your example, the logical extent size is 128k, but the data has been compressed down to a single 1k block. Currently I'd suggest using the |
@behlendorf any reason this didn’t go in? |
@bgly unfortunately, I've been unable to devote the time needed to address the review feedback so this can be wrapped up. This is definitely still functionality I'd like to see included when it's ready. |
Now all fiemap zfs-tests pass except 1, fiemap_free, which is a regression in master. ... home/rohan/zol-ws/zfs/tests/test-runner/bin/test-runner.py -c "tests/runfiles/fiemap.run" -T "functional" -i "/home/rohan/zol-ws/zfs/tests/zfs-tests" -I "1" Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/setup (run as root) [00:01] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_sync (run as root) [00:16] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_flags (run as root) [00:10] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_delalloc (run as root) [00:31] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_free (run as root) [00:01] [FAIL] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_copies (run as root) [00:02] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_nomerge (run as root) [00:03] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_holes_sanity (run as root) [01:08] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/cleanup (run as root) [00:00] [PASS] Results Summary PASS 8 FAIL 1 Running Time: 00:02:15 Percent passed: 88.9% ... Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
Now all fiemap zfs-tests pass except 1, fiemap_free, which is a regression in master. ... home/rohan/zol-ws/zfs/tests/test-runner/bin/test-runner.py -c "tests/runfiles/fiemap.run" -T "functional" -i "/home/rohan/zol-ws/zfs/tests/zfs-tests" -I "1" Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/setup (run as root) [00:01] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_sync (run as root) [00:16] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_flags (run as root) [00:10] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_delalloc (run as root) [00:31] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_free (run as root) [00:01] [FAIL] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_copies (run as root) [00:02] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_nomerge (run as root) [00:03] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_holes_sanity (run as root) [01:08] [PASS] Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/cleanup (run as root) [00:00] [PASS] Results Summary PASS 8 FAIL 1 Running Time: 00:02:15 Percent passed: 88.9% ... Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
Brian, what's the current state of this one? |
There hasn't been any new progress. It needs to be rebased on the latest code and the remaining review feedback needs to be addressed. |
Signed-off-by: Lars Thoms <lars.thoms@spacecafe.org>
This feature would be really helpful to Longhorn |
I noticed that this PR hasn’t been updated in some time. We would like to see this feature added, but it seems that it isn’t quite complete, so we’re going to close it for now. When you have time to revisit this work, feel free to reopen this PR or open a new one. Thanks for your contribution to OpenZFS and please continue to open PR’s to address issues that you encounter! |
Description
The FIEMAP ioctl is the standard Linux user space mechanism for inspecting the physical layout of a file on disk. The ioctl returns a list of extents each of which describes a region of the file. Compatible blocks are merged in to larger extents when they are physically contiguous and have identical flags.
The following per-extent flags are supported by ZFS.
The following flags are supported when requesting extents. Note that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to ZFS but are applicable to other Linux file systems.
Finally, the following reserved fields in the fiemap_extent structure have been utilized and should be reserved to prevent future conflicts.
Future work:
The FIEMAP_FLAG_XATTR flag is not supported. Implementing this should be relatively straight forward if it is needed. This would be a handy way to determine if the xattrs have been stored in the dnode, spill block, or an external object.
The FIEMAP_FLAG_CACHE flag is not supported. We rely on the ARC to keep the right blocks cached.
The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the dmu_offset_next() function. The zpl_llseek() function can be updated to use FIEMAP to quickly determine the next extent. This has the advantage of correctly accounting for newly dirtied or freed blocks without forcing a txg sync.
The FIEMAP interface is fully described at:
Motivation and Context
FIEMAP is a fundamental Linux interface and should be fully implemented. With this functionality added Linux utilites like
filefrag(8)
work as expected on a ZFS filesystem. See issue #264.One useful example of the kind of information provided by FIEMAP is the ability to easily check if a file is file is compressed on disk.
Or the layout of a larger file on disk.
How Has This Been Tested?
Nine new test cases have been added to the ZTS which test all of the supported
FIEMAP_FLAG_*
's which can be used, and verify all of the possibleFIEMAP_EXTENT_*
's flags which can be reported. Additionally, a new test utility calledfiemap
was added which can be used to verify a correct extent mapping was returned. The caller may specify the expected location of holes, data, extent flags, device ids, or number of extents. It's effectively a souped up version offilefrag(8)
designed for validation.Types of changes
Checklist:
Signed-off-by
.