Implement fallocate FALLOC_FL_PUNCH_HOLE #2619

dweeezil · 2014-08-22T14:16:54Z

Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2). Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add an autotools check for the truncate_pagecache_range() helper function
and a compatible implementation if the kernel doesn't support it.

Hole punching clears the relevent pagecache entries in zfs_space()
via truncate_pagecache_range(). For better source compatibility with
other implementations within the zfs_setattr() function, its call to
truncate_setsize() has been moved to zfs_space().

dweeezil · 2014-08-22T14:19:41Z

I'm running the full suite of xfstests now. I ran the relevant individual tests by hand which passed.

As indicated in the commit comment, I've moved the truncate_setsize() call in zfs_setattr() to zfs_freesp() in the same vicinity as its new call to truncate_pagecache_range(). This aligns the zfs_setattr() function a bit more closely with the upstream code. However, Illumos actually does its truncation-style page clearing in zfs_trunc(). We could probably do it there, too, however it seemed to make sense to have both of them in the same place.

behlendorf · 2014-08-22T22:31:33Z

I like where this is going but it looks like patch still has a few rough edges. The build results for the commit above show at least one minor build issue and a BUG while running fsx.

dweeezil · 2014-08-23T01:33:02Z

I'll be working on this over the weekend. Haven't had a chance to check the xfstests results yet. I'll do some fsx runs as well. What were the build issues? I'll be trying a 2.6 kernel as well but haven't done so yet.

nedbass · 2014-08-23T01:52:22Z

@dweezil, you can see the build failures by clikcing on the Details link below. That will take you to the buildbot console page. Click on the colored boxes to see a popup with the test results for that builder. The red ones represent failures. Look for failures that say "built zfs failed", then click on the "stdio" link to see the log.

In this case, the failures were:

cc1: warnings being treated as errors
/tmp/zfs-build--gM8QWDbM/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zpl_file.c: In function 'zpl_fallocate_common':
/tmp/zfs-build--gM8QWDbM/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zpl_file.c:489: error: unused variable 'cr'

dweeezil · 2014-08-23T02:21:50Z

Thanks. I forgot about the build bot.

behlendorf · 2014-08-23T03:30:31Z

Right. Just keep the buildbot green.

behlendorf · 2014-08-26T23:07:02Z

Several of the builders encountered a kernel BUG during testing, see:

http://buildbot.zfsonlinux.org/builders/debian-7.2.0-amd64-builder/builds/1180/steps/shell_15/logs/stdio

dweeezil · 2014-08-26T23:16:37Z

I was looking at the build bot results earlier while they were still running. In particular, it looks like xfstest 127 was hanging in some cases. I didn't notice the kernel BUG at the time. I'll look into these further tomorrow.

dweeezil · 2014-08-27T18:29:25Z

Ugh, it seems that prior to torvalds/linux@5a72039 (appearing in 3.11), truncate_inode_pages_range() required aligned ranges. I'm not keen on yet another autoconf test for this feature so I'll re-work it to always do the alignment.

dweeezil · 2014-08-29T18:22:36Z

My last commit, 55bcc51 is still likely to fail on kernels > 3.4 and < 3.11 but it does work properly for the older 3.2 kernel I was testing with. Unfortunately this is an area where the bundled filesystems have an advantage because their code is tied to a particular kernel source tree. The problem is that there are kernels with torvalds/linux@623e3db but without torvalds/linux@5a72039.

As much as I'd like to rely on the much nicer truncate_inode_pages_range() in the newer kernels, I think I'm going to punt and simply bundle code from my less-than-optimal implementation of truncate_pagecache_range() directly into the code in zfs_free_range(). It should work on all kernels and will eliminate the autotools check.

dweeezil · 2014-08-29T19:21:48Z

As suggested in my last comment, I pulled the code into zfs_free_range() as dweeezil/zfs@772b20a.

dweeezil · 2014-08-31T05:08:27Z

Dealing with the heads and tails of the truncated region in the page cache is going to be a bit trickier than I thought at first.

dweeezil · 2014-09-01T20:37:37Z

This commit still has some trash in it and also not as much detail in its log message as I'd ultimately like to see. I pushed it so the buildbot could get its hands on it and try some kernels I don't have easy access to. I'm also looking for feedback on the approach taken to deal with the sub-page stubs of holes.

My approach was to borrow the logic framework from ext4_ext_punch_hole() in the 3.2 kernel to detect the corner cases and then to use an almost-duplicate of the existing update_pages() (which I called zfs_update_pages()) to handle the updating of the stubs in the page cache.

behlendorf · 2014-09-01T23:15:23Z

I'll get you some more detailed feedback tomorrow but this looks like a reason approach. If it can be cleanly done unifying update_pages and zfs_update_pages would be desirable. I also see many of the fallocate xfstests aren't running for some reason. We should determine why.

dweeezil · 2014-09-02T00:51:15Z

@behlendorf OK. The last thing I pushed seems to pass the buildbot's tests except for style. I just re-pushed a cleaned up version which has a more specialized punching helper called zfs_zero_partial_page(); it's a slimmed-down version of the previous function. Also added a better commit message. I think its logic is pretty solid. While testing it, I instrumented the kernel to display what type of sub-page zeroing was necessary: the standard fsx runs seemed to exercise all potential corner cases (punch begins in mid-page, ends in mid-page, begins and ends in mid-page, is a sub-page-sized punch).

Could many of the fallocate tests not be running because they're testing the actual "allocation" functionality of fallocate which we don't, of course, support?

I've done more extensive fsx tests by hand. In particular, I've run it with larger files, much longer run times and also with different ZFS record sizes to test the actual punching at the filesystem level any any possible interactions with the page cache. That said, I've been concentrating on a 3.2 kernel on my test system because it most closely matches the older kernels without the newer, smarter truncate_inode_pages_range() as used by truncate_pagecache_range() on newer kernels.

behlendorf · 2014-09-02T23:46:37Z

@dweeezil Great work, this support has come together nicely! The changes look correct to me and I see that all the builders are now passing. I'm going to kick off some additional testing to be extra sure. But if you're happy with the latest version of the patch and nothing exciting comes up in my testing we might get this merged this week.

Could many of the fallocate tests not be running because they're testing the actual "allocation" functionality of fallocate which we don't, of course, support?

Yes, that's likely it.

dweeezil · 2014-09-03T05:04:38Z

@behlendorf I'm pretty happy with it at this point.

As to the support for "mode 0" mentioned in #2649, I'm not clear whether we want to even pretend to support it. The guarantee of no-fail writes within the allocated region would require a whole new bit of on-disk format to support; some sort of per-dataset/per-file reservation along with all the requisite accounting. Given the infrastructure provided by this patch, however, we certainly could "pretend" to support mode 0 if we cared to in order to appease applications which use a "mode 0 test" to determine whether any fallocate operations, including hole punching, are possible. I'm inclined to leave this as-is.

behlendorf · 2014-09-03T18:01:28Z

I'm not clear whether we want to even pretend to support it.

We shouldn't. It's better to be honest with the applications and return that this functionality isn't currently supported rather than pretend we provide this hard guarantee. This is definitely something we could support if someone wanted to work on it. In the meanwhile they can always set a reserveration or a refreservation on the dataset.

So far so good with the additional testing.

behlendorf · 2014-09-03T22:11:52Z

@dweeezil unfortunately I'm still able to occasionally cause xfstests 124 to fail when operating in the mixed standard/memory mapped mode. This happens perhaps 1/100 times in my testing. I haven't investigated in detail yet but I have an educated guess as to the problem.

In order to ensure the data in the page cache pages always matches what's in the ARC the pages and ARC buffers must both be updated under the zfs_range_lock. Otherwise there's a tiny window where they might be inconsistent. Somehow the logic after out: which truncates the page from the cache needs to be pulled under the lock. It looks like that's going to be complicated somewhat by the fact that zfs_extend(), zfs_trunc(), and zfs_free_range() and independently take the lock.

127  - output mismatch (see 127.out.bad)
--- 127.out 2014-09-02 20:08:06.858769002 -0700
+++ 127.out.bad 2014-09-02 20:28:10.302769121 -0700
@@ -10,4 +10,1041 @@
 === FSX Standard Mode, No Memory Mapping ===
 All operations completed A-OK!
 === FSX Standard Mode, Memory Mapping ===
-All operations completed A-OK!
+ltp/fsx -f -q -l 262144 -o 65536 -S 191110531 -N 100000 fsx_std_mmap
+READ BAD DATA: offset = 0xf307, size = 0xeba2, fname = /tank/xfstests/fsx_std_mmap
+OFFSET GOOD    BAD RANGE
+0x f307    0x0000  0x9021  0x    0
+operation# (mod 256) for the bad data may be 33
+0x f308    0x0000  0x21b5  0x    1
+operation# (mod 256) for the bad data may be 33
...

dweeezil · 2014-09-03T23:19:49Z

@behlendorf I was afraid something like that might be possible. When I was examining other file systems, I they all did their pagecache truncation [EDIT] much earlier [end EDIT] but it seemed like it ought to be possible to do it later in our case ("out:"). I really like having both the truncate_setsize() and the new code in zfs_freesp() for consistency with upstream code. It sds like we may need to push the new code down into zfs_free_range(). I'm going to try that.

EDIT: Hmm, that would put it inside the transaction.

dweeezil · 2014-09-04T15:07:36Z

@behlendorf I was able to duplicate this by severely constraining the memory on my test VM, letting fxs run indefinitely and the placing more pressure on the memory while it was running. I'm going to try to rig up the system to reproduce the problem more reliably and quickly and will then try to move the page cache truncation code under the range lock.

behlendorf · 2014-09-04T16:05:44Z

EDIT: Hmm, that would put it inside the transaction.

This should be OK as long as you perform the page cache update under the range lock. The code in zfs_write() behaves the same way.

dweeezil · 2014-09-05T21:25:24Z

The 4f92732 commit looks like it's going to pass the buildbot tests (assuming we can ignore the hopefully spurious ztest vdev DTL error) and it has also survived several hours of wall time in my "torture test" environment in which I'm forcing lots of swapping to occur on a severely memory-constrained system while multiple instances of fsx are running. Previously, I could trigger an error in this environment within 5 or 10 minutes.

I split the work into two halves: complete page truncations which run at the end of zfs_freesp() and partial truncations (page stub zeroing) which run under a range lock in zfs_free_range().

behlendorf · 2014-09-05T23:28:52Z

@dweeezil That sounds like a reasonable approach. Let's make sure it holds up with a weekend of testing. It would be nice if you could rebase the patch on the latest master. I've just merged the AIO support which is now passing all the test cases as well.

Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of fallocate(2). Mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add the zfs_zero_partial_page() function, modeled after update_page(), to handle zeroing partial pages in a hole-punching operation. It must be used under a range lock for the requested region in order that the ARC and page cache stay in sync. Move the existing page cache truncation via truncate_setsize() into zfs_freesp() for better source structure compatibility with upstream code. Add page cache truncation to zfs_freesp() and zfs_free_range() to handle hole punching.

behlendorf · 2014-09-08T17:46:07Z

@dweeezil My test results look good, I'm no longer seeing any failures which were caused by this change. The one buildbot failure above was clearly unrelated. Please, let me know if your testing has turned up anything. If not I'll get it merged.

dweeezil · 2014-09-08T18:57:07Z

@behlendorf I ran fsx for quite awhile over the weekend on my stress testing rig and it never failed. I think it's good at this point.

dweeezil force-pushed the fallocate branch 2 times, most recently from 2265d20 to aa3408d Compare August 26, 2014 12:54

behlendorf added this to the 0.6.4 milestone Aug 26, 2014

behlendorf added the Feature label Aug 26, 2014

dweeezil force-pushed the fallocate branch from aa3408d to 55bcc51 Compare August 29, 2014 18:06

dweeezil force-pushed the fallocate branch from 55bcc51 to 772b20a Compare August 29, 2014 19:19

dweeezil force-pushed the fallocate branch from 772b20a to 6b6ca17 Compare August 30, 2014 14:54

dweeezil force-pushed the fallocate branch from 6b6ca17 to d41417d Compare September 1, 2014 20:04

behlendorf mentioned this pull request Sep 1, 2014

fix fallocate support #2649

Closed

dweeezil force-pushed the fallocate branch from d41417d to d8d15b3 Compare September 2, 2014 00:41

dweeezil force-pushed the fallocate branch from d8d15b3 to 4f92732 Compare September 5, 2014 14:28

dweeezil force-pushed the fallocate branch from 4f92732 to 330f492 Compare September 6, 2014 02:42

behlendorf closed this in 223df01 Sep 8, 2014

jittygitty mentioned this pull request Apr 30, 2022

FAST-Tracking REFLINK and Offline Deduplication, first for LINUX only #13349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fallocate FALLOC_FL_PUNCH_HOLE #2619

Implement fallocate FALLOC_FL_PUNCH_HOLE #2619

dweeezil commented Aug 22, 2014

dweeezil commented Aug 22, 2014

behlendorf commented Aug 22, 2014

dweeezil commented Aug 23, 2014

nedbass commented Aug 23, 2014

dweeezil commented Aug 23, 2014

behlendorf commented Aug 23, 2014

behlendorf commented Aug 26, 2014

dweeezil commented Aug 26, 2014

dweeezil commented Aug 27, 2014

dweeezil commented Aug 29, 2014

dweeezil commented Aug 29, 2014

dweeezil commented Aug 31, 2014

dweeezil commented Sep 1, 2014

behlendorf commented Sep 1, 2014

dweeezil commented Sep 2, 2014

behlendorf commented Sep 2, 2014

dweeezil commented Sep 3, 2014

behlendorf commented Sep 3, 2014

behlendorf commented Sep 3, 2014

dweeezil commented Sep 3, 2014

dweeezil commented Sep 4, 2014

behlendorf commented Sep 4, 2014

dweeezil commented Sep 5, 2014

behlendorf commented Sep 5, 2014

behlendorf commented Sep 8, 2014

dweeezil commented Sep 8, 2014

Implement fallocate FALLOC_FL_PUNCH_HOLE #2619

Implement fallocate FALLOC_FL_PUNCH_HOLE #2619

Conversation

dweeezil commented Aug 22, 2014

dweeezil commented Aug 22, 2014

behlendorf commented Aug 22, 2014

dweeezil commented Aug 23, 2014

nedbass commented Aug 23, 2014

dweeezil commented Aug 23, 2014

behlendorf commented Aug 23, 2014

behlendorf commented Aug 26, 2014

dweeezil commented Aug 26, 2014

dweeezil commented Aug 27, 2014

dweeezil commented Aug 29, 2014

dweeezil commented Aug 29, 2014

dweeezil commented Aug 31, 2014

dweeezil commented Sep 1, 2014

behlendorf commented Sep 1, 2014

dweeezil commented Sep 2, 2014

behlendorf commented Sep 2, 2014

dweeezil commented Sep 3, 2014

behlendorf commented Sep 3, 2014

behlendorf commented Sep 3, 2014

dweeezil commented Sep 3, 2014

dweeezil commented Sep 4, 2014

behlendorf commented Sep 4, 2014

dweeezil commented Sep 5, 2014

behlendorf commented Sep 5, 2014

behlendorf commented Sep 8, 2014

dweeezil commented Sep 8, 2014