Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fiemap[Still needs to address comments from original PR7545] #9554

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rohan-puri
Copy link
Contributor

@rohan-puri rohan-puri commented Nov 5, 2019

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@codecov
Copy link

codecov bot commented Nov 6, 2019

Codecov Report

Merging #9554 into master will decrease coverage by 30.17%.
The diff coverage is 1.88%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #9554       +/-   ##
===========================================
- Coverage   79.02%   48.85%   -30.18%     
===========================================
  Files         418      276      -142     
  Lines      123686    95488    -28198     
===========================================
- Hits        97748    46646    -51102     
- Misses      25938    48842    +22904
Flag Coverage Δ
#kernel ?
#user 48.85% <1.88%> (-17.97%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6c7023a...cc9d510. Read the comment docs.

behlendorf and others added 4 commits November 10, 2019 00:15
The FIEMAP ioctl is the standard Linux user space mechanism for
inspecting physical layout of a file on disk.  The ioctl returns
a list of extents each of which describes a region of the file.
Compatible blocks are merged in to larger extents when they are
physically contiguous and have identical flags.

The following per-extent flags are supported by ZFS.

  FIEMAP_EXTENT_LAST           - The last extent in the mapping
  FIEMAP_EXTENT_UNKNOWN        - Set on gang blocks
  FIEMAP_EXTENT_DELALLOC       - Dirty extents not yet written
  FIEMAP_EXTENT_ENCODED        - Set on compressed extents
  FIEMAP_EXTENT_DATA_ENCRYPTED - Set on encrypted extents
  FIEMAP_EXTENT_NOT_ALIGNED    - Extent is not block aligned
  FIEMAP_EXTENT_DATA_INLINE    - Set on embedded block pointers
  FIEMAP_EXTENT_UNWRITTEN      - Set on holes (normally not reported)
  FIEMAP_EXTENT_MERGED         - Multiple block pointers were merged
  FIEMAP_EXTENT_SHARED         - Set on deduplicated extents

The following flags are supported when requesting extents.  Note
that *_COPIES, *_NOMERGE, and *_HOLE are currently specific to
ZFS but are applicable to other Linux file systems.

  FIEMAP_FLAG_SYNC             - Sync the file first
  FIEMAP_FLAG_COPIES           - Include all copies of an extent
  FIEMAP_FLAG_NOMERGE          - Don't merge blocks in to extents
  FIEMAP_FLAG_HOLES            - Include unwritten holes as an extent

Finally, the following reserved fields in the fiemap_extent structure
have been utilized and should be reserved to prevent future conflicts.

  .fe_reserved[0]   - Unique device ID; top-level VDEV ID for ZFS
  .fe_reserved64[0] - Physical length of an extent

Future work:

* The FIEMAP_FLAG_XATTR flag is not supported.  Implementing this
  should be relatively straight forward if it is needed.  This would
  be a handy way to determine if the xattrs have been stored in the
  dnode, spill block, or an external object.

* The FIEMAP_FLAG_CACHE flag is not supported.  We rely on the ARC
  to keep the right blocks cached.

* The lseek(2) SEEK_HOLE and SEEK_DATA flags still rely on the
  dmu_offset_next() function.  The zpl_llseek() function can be
  updated to use FIEMAP to quickly determine the next extent.
  This has the advantage of correctly accounting for newly dirtied
  or freed blocks without forcing a txg sync.

The FIEMAP interface is fully described at:

* https://www.kernel.org/doc/Documentation/filesystems/fiemap.txt

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#264
Now all fiemap zfs-tests pass except 1, fiemap_free,
which is a regression in master.

...
home/rohan/zol-ws/zfs/tests/test-runner/bin/test-runner.py  -c "tests/runfiles/fiemap.run" -T "functional" -i "/home/rohan/zol-ws/zfs/tests/zfs-tests" -I "1"
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/setup (run as root) [00:01] [PASS]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_sync (run as root) [00:16] [PASS]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_flags (run as root) [00:10] [PASS]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_delalloc (run as root) [00:31] [PASS]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_free (run as root) [00:01] [FAIL]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_copies (run as root) [00:02] [PASS]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_nomerge (run as root) [00:03] [PASS]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/fiemap_holes_sanity (run as root) [01:08] [PASS]
Test: /home/rohan/zol-ws/zfs/tests/zfs-tests/tests/functional/fiemap/cleanup (run as root) [00:00] [PASS]

Results Summary
PASS       8
FAIL       1

Running Time:   00:02:15
Percent passed: 88.9%
...

Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
The test was failing since fiemap range trees,
work on {offset, len}, while dn_free_ranges range trees
work on {blkid, nblks}, do the necessary conversion.

Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
@rohan-puri
Copy link
Contributor Author

rohan-puri commented Nov 9, 2019 via email

@snajpa
Copy link
Contributor

snajpa commented Nov 10, 2019

@rohan-puri never mind, I should have verified things before writing anything, it's .tmpfile that would be the other call in the wrapper ops struct, not .fiemap, thus completely irrelevant to this issue. It was a bit late and I've mistaken the two.

@rohan-puri
Copy link
Contributor Author

@snajpa : ok :)

@behlendorf behlendorf self-requested a review November 12, 2019 22:13
@behlendorf behlendorf added Status: Design Review Needed Architecture or design is under discussion Type: Feature Feature request or new feature labels Nov 12, 2019
@Ornias1993
Copy link
Contributor

@rohan-puri Good work taking this on!
Could you list the changes you made to the original PR?

Could you also fillin the details of your PR, instead of relying on people looking up the old one?
It think mostly a link to the old one and a shameless copy-paste would suffice.

@as-com
Copy link
Contributor

as-com commented Feb 17, 2022

Any update on this? This feature would be incredibly useful.

@jittygitty
Copy link

@rohan-puri @behlendorf @mmaybee Were you using Linux kernel's "generic_block_fiemap" anywhere? Just wondering if you saw no need for it since you're doing something different than I thought. Though I did just notice @adilger's post at #264 (comment)

And yes I know our friends removed the export free-for all generic_block_fiemap and replaced it with the GPL-only iomap_fiemap On July 27th 2021. (And they're just getting warmed up, hence some of my other recent issues I posted...)

fs: REMOVE generic_block_fiemap torvalds/linux@9acb9c4
"FIEMAP cleanups from Christoph transitioning all remaining filesystems supporting FIEMAP (ext2, hpfs) to iomap API and removing the old helper"
torvalds/linux@63b0c40

@behlendorf
Copy link
Contributor

@jittygitty thanks for taking an interest in this PR, but to be clear the kernel changes you referenced don't have any impact on the ability to implement this feature. What's needed is for someone to address the unresolved feedback provided in the original PR review. This still involves a significant amount of development work which I haven't had the time to pursue. If someone would like to pick up this work, I'm happy to help point them at what still needs to be done.

@jittygitty
Copy link

@behlendorf Yes it would be beneficial to explain the path of the implementation, I did not look at code much since I have only very basic C knowledge so I did not have full grasp if you avoided iomap usage due to exports or if there's no help in using. But either way yes if you could explain what needs to be done, I can try to hire someone to give a crack at it and would be good to point them here and read yours or other knowledgeable contributors explanation, instead as I've done before and try to bang my head and understand some basics to try and gain myself enough info on for example fiemap path for reflink implementing.

I'd rather not torture my head and potentially get things wrong. I had also asked for @pjd to provide what he needs for someone to finish implementing reflinks using "his BRT" method for Linux, I'm not against having two methods compete if one method/BRT is not going to waste anyway since will stay with BSD. Plus I'm not even "sure" if via fiemap I can bypass using BRT for reflinks, and use what, some kind of clone of xfs b+ refcounttree etc, not sure how BRT differs or if its actually same thing since I didn't "study the code of either" and prefer someone just spell it out. I want the feature for linux for myself, and simply wanted to gauge interest to help "crowd-fund" issues such as reflinks and offline-deduplication to get it done faster on linux.

@Ornias1993 I agree with you that there shouldn't be so much fear and avoidance of export_symbol_gpl, I would simply use anything we need. I've given some preliminary reasons why in #13415

But I should add that I still prefer being reserved for politeness sake, there's no reason openzfs can't be good friends with Linux Kernel guys, their real hatred is for CLOSED SOURCE binaries, OpenZFS complies with basically "all" of the "REASONS" why they went with GPL when they did, ie their intent. OpenZFS is truly open source.

(Though it is unfortunate with DDN and purchase of Nexenta..., had some questions and complaints to let out, but I'll refrain in case there's openzfs contributors on here from Nexenta that might get upset, plus this is Fiemap issue tracker after all.)

@behlendorf
Copy link
Contributor

The next steps for FIEMAP I believe are pretty straight forward. They code needs to be rebased against the current master branch and the outstanding feedback from comment in the PR needs to be resolved. In particular, there was one throrny design issue remaining regarding efficiently handling in-flight I/O which still needs to be resolve and that will take some careful work. Beyond that it of course needs to be tested and verified to work correctly. As always, if someone if willing to tackle this remaining work that would be great.

I'm not against having two methods compete if one method/BRT is not going to waste anyway since will stay with BSD.

These are complementary but different features, so developer time/interest permitting, I'd like to see both implemented. The vast majority of the BRT work is platform independent, this is the hard part of implementing the functionality for ZFS. Wiring it up for reflink on Linux should be relatively straight forward and is something we'll want to sort out in that PR.

@jittygitty
Copy link

@behlendorf Ok thanks, I think you missed one of my questions though, and I agree with you in general that cross-platform is better all else being equal. When I started looking into reflinks + offline-dedupe, I came to conclusion to do it via "FIEMAP".

But I thought implementing reflink via FIEMAP was "different" than how @pjd was doing it with his implementation via "BRT".
So I wanted your clarification if I was correct on that part, but yea regardless of the answer, I'm all for "both" BRT and Fiemap.

Sorry, I know its kind of a messy brainstorm but if you read my comment in #13349 especially the clonefiles.pdf link (see #13349 (comment) ), I think you'll see my thoughts and can correct me.

Anyway I will contact someone I had in mind to see about taking a shot at finishing this 9554 FIEMAP, but you are saying that the original issue #7545 needs to be looked at, yet is all the code from that one already in this one, ie work in which one now?

What I was looking for though was a short "onboarding" for would be coder on this who does not have prior ZFS knowledge but is simply a good C/C++ coder perhaps with some kernel experience. I wanted to know if there's any "primer" that would be very helpful for them to look at before jumping in on this FIEMAP, since again, imagine its someone who never saw ZFS code.

@behlendorf
Copy link
Contributor

But I thought implementing reflink via FIEMAP was "different" than how @pjd was doing it with his implementation via "BRT".
So I wanted your clarification if I was correct on that part.

I see what you're getting at. I think the confusion here is that FIEMAP is solely about reporting logical/physical file extents. Now user space applications like cp can do some interesting things with that information, but FIEMAP does not provide any reasonable mechanism to implement something like reflink. That's a significantly more challenging problem, which is of course where the BRT work fits in. Or looked at another way, the BRT work enables a performant reflink implementation and the FIEMAP work provides a read-only reporting mechanism to see exactly how the data was laid out.

but you are saying that the original issue #7545 needs to be looked at.

The original #7575 has all the relevant review comments and latest fully working version of the code. I'd suggest starting there. Unfortunately, one of the reasons this stalled out is there are some challenging design and implementation details to work out. In particular around reconciling file extents for uncommitted dirty data blocks and pending frees which have been modified in the not-yet-synced-to-disk transaction groups. While I wouldn't want to discourage any one from taking on this work, getting it right is going to be a little bit tricky and require some solid knowledge of the internals.

I wish I could point you at a primer, but I'm not aware of anything along those lines. There is a fair amount of high level design documentation available, but only the code is really going to have the level of detail needed.

@jittygitty
Copy link

@behlendorf Thanks ok I understand someone new to zfs will need read through the code carefully and learn it that way. I will try get someone to take a look, wish I hadn't ditched class to hack gopher systems in library...so I wouldn't have to 'hire' coders.

As far as FIEMAP I knew it provided extents interface to underlying blocks but I thought if we're going to use the Linux VFS ioctl FICLONE, FICLONERANGE and FIDEDUPERANGE, then I thought we had to speak "extents" through those ioctls since I thought they didn't talk blocks and we/zfs weren't extent based already since I saw this FIEMAP issue outstanding. I thought we could leverage existing reflink, dedupe, and copy_file_range code in the Linux kernel via those ficlone/ficlonerange/fideduperange and lessen the work on zfs side. I thought that @pjd had created some "custom ioctl" for zfs maybe and wasn't using ficlone.

BRT I had thought was like the XFS reference count B+tree for reflinks but maybe I'm wrong? At least for zfs I hear it can't be as straight-forward as it seems on XFS? But I can't figure out exactly why, as to what complicates things for us compared to xfs etc.

What was really depressing me a bit much was hearing it would basically be impossible for us to preserve reflinks over zfs send/receive and when I read that supposedly BTRFS "can" preserve their reflinks over btrfs send/receive, I was like "how"?

On a related note, will FIEMAP (and filefrag support) help us get better tools to "defrag" files via maybe background low 'scrub' priority rewriting? (I guess this could be done now with zdb zfs frag check utils?) I'm also looking for a good write-up as to why Block Pointer Rewrite is so difficult and why its really the "only way" some have said, to deal with long-term fragmentation.

Also know of anything about the Panzura ZFS dedupe tech that was supposed to be coming to ZFS?
(Apologies for the long post and questions, you can ignore if busy, no hurry. thx)

@adilger
Copy link
Contributor

adilger commented May 5, 2022

@behlendorf, if one of the main obstacles for landing FIEMAP support is the "inflight writes" handling (FIEMAP_EXTENT_DELALLOC), a relatively straight forward solution would be to always force FIEMAP_FLAG_SYNC for all files. Not the greatest for performance, but this should be implemented as a no-op for files that don't have any dirty pages (i.e. most files, and should not trigger a needless TXG sync/wait) then it would still be very useful for the majority of cases. For Lustre we ended up doing this because getting pending in-memory write information from all clients is nigh impossible. If there are concurrent write and FIEMAP calls on a single file then there will always be a race between what FIEMAP reports and the current state of the file that cannot be closed from userspace, so I don't think that should be a reasonable goal for this patch.

FIEMAP isn't used so often on still-dirty files, and in the cases that it is most users probably care about the on-disk allocations and not that X MB of unwritten data is still pending in memory. For those few that care about unwritten data, the complete lack of FIEMAP is equally non-functional, and that could be added incrementally as needed after the core on-disk allocation reporting is available.

@wibed
Copy link

wibed commented Jul 1, 2022

is there any update on this?

@mike-pisman
Copy link

Indeed, any update on this?

@ErrorNoInternet
Copy link
Contributor

rebased (badly, but compiles and works) on commit d98973d if anyone wants to clean it up, add back the tests and make a proper PR: fiemap-diff.txt

@adilger
Copy link
Contributor

adilger commented Apr 17, 2024

Note that there is development again on the upstream kernel to add compressed extent support for FIEMAP:

https://lore.kernel.org/linux-fsdevel/cover.1709918025.git.sweettea-kernel@dorminy.me/
https://lore.kernel.org/linux-fsdevel/2befe2c13065bdf3ca74cb8b701727940310fd2a.1712126039.git.sweettea-kernel@dorminy.me/

The patches so far are using the fe_reserved[0] field for fe_physical_length, which matches this patch usage. It will also set the flag FIEMAP_EXTENT_DATA_COMPRESSED on compressed extents, in addition to FIEMAP_EXTENT_ENCODED:

/* Data is compressed by fs. Sets EXTENT_ENCODED. */
#define FIEMAP_EXTENT_DATA_COMPRESSED	0x00000040

Note that this is all still under discussion and subject to change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet