Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fallocate FALLOC_FL_PUNCH_HOLE #2619

Closed
wants to merge 1 commit into from

Conversation

dweeezil
Copy link
Contributor

Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2). Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add an autotools check for the truncate_pagecache_range() helper function
and a compatible implementation if the kernel doesn't support it.

Hole punching clears the relevent pagecache entries in zfs_space()
via truncate_pagecache_range(). For better source compatibility with
other implementations within the zfs_setattr() function, its call to
truncate_setsize() has been moved to zfs_space().

@dweeezil
Copy link
Contributor Author

I'm running the full suite of xfstests now. I ran the relevant individual tests by hand which passed.

As indicated in the commit comment, I've moved the truncate_setsize() call in zfs_setattr() to zfs_freesp() in the same vicinity as its new call to truncate_pagecache_range(). This aligns the zfs_setattr() function a bit more closely with the upstream code. However, Illumos actually does its truncation-style page clearing in zfs_trunc(). We could probably do it there, too, however it seemed to make sense to have both of them in the same place.

@behlendorf
Copy link
Contributor

I like where this is going but it looks like patch still has a few rough edges. The build results for the commit above show at least one minor build issue and a BUG while running fsx.

@dweeezil
Copy link
Contributor Author

I'll be working on this over the weekend. Haven't had a chance to check the xfstests results yet. I'll do some fsx runs as well. What were the build issues? I'll be trying a 2.6 kernel as well but haven't done so yet.

@nedbass
Copy link
Contributor

nedbass commented Aug 23, 2014

@dweezil, you can see the build failures by clikcing on the Details link below. That will take you to the buildbot console page. Click on the colored boxes to see a popup with the test results for that builder. The red ones represent failures. Look for failures that say "built zfs failed", then click on the "stdio" link to see the log.

In this case, the failures were:

cc1: warnings being treated as errors
/tmp/zfs-build--gM8QWDbM/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zpl_file.c: In function 'zpl_fallocate_common':
/tmp/zfs-build--gM8QWDbM/BUILD/zfs-kmod-0.6.3/_kmod_build_2.6.32-5-amd64/module/zfs/../../../zfs-0.6.3/module/zfs/zpl_file.c:489: error: unused variable 'cr'

@dweeezil
Copy link
Contributor Author

Thanks. I forgot about the build bot.

@behlendorf
Copy link
Contributor

Right. Just keep the buildbot green.

@dweeezil dweeezil force-pushed the fallocate branch 2 times, most recently from 2265d20 to aa3408d Compare August 26, 2014 12:54
@behlendorf
Copy link
Contributor

Several of the builders encountered a kernel BUG during testing, see:

http://buildbot.zfsonlinux.org/builders/debian-7.2.0-amd64-builder/builds/1180/steps/shell_15/logs/stdio

@behlendorf behlendorf added this to the 0.6.4 milestone Aug 26, 2014
@dweeezil
Copy link
Contributor Author

I was looking at the build bot results earlier while they were still running. In particular, it looks like xfstest 127 was hanging in some cases. I didn't notice the kernel BUG at the time. I'll look into these further tomorrow.

@dweeezil
Copy link
Contributor Author

Ugh, it seems that prior to torvalds/linux@5a72039 (appearing in 3.11), truncate_inode_pages_range() required aligned ranges. I'm not keen on yet another autoconf test for this feature so I'll re-work it to always do the alignment.

@dweeezil
Copy link
Contributor Author

My last commit, 55bcc51 is still likely to fail on kernels > 3.4 and < 3.11 but it does work properly for the older 3.2 kernel I was testing with. Unfortunately this is an area where the bundled filesystems have an advantage because their code is tied to a particular kernel source tree. The problem is that there are kernels with torvalds/linux@623e3db but without torvalds/linux@5a72039.

As much as I'd like to rely on the much nicer truncate_inode_pages_range() in the newer kernels, I think I'm going to punt and simply bundle code from my less-than-optimal implementation of truncate_pagecache_range() directly into the code in zfs_free_range(). It should work on all kernels and will eliminate the autotools check.

@dweeezil
Copy link
Contributor Author

As suggested in my last comment, I pulled the code into zfs_free_range() as dweeezil/zfs@772b20a.

@dweeezil
Copy link
Contributor Author

Dealing with the heads and tails of the truncated region in the page cache is going to be a bit trickier than I thought at first.

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 1, 2014

This commit still has some trash in it and also not as much detail in its log message as I'd ultimately like to see. I pushed it so the buildbot could get its hands on it and try some kernels I don't have easy access to. I'm also looking for feedback on the approach taken to deal with the sub-page stubs of holes.

My approach was to borrow the logic framework from ext4_ext_punch_hole() in the 3.2 kernel to detect the corner cases and then to use an almost-duplicate of the existing update_pages() (which I called zfs_update_pages()) to handle the updating of the stubs in the page cache.

@behlendorf
Copy link
Contributor

I'll get you some more detailed feedback tomorrow but this looks like a reason approach. If it can be cleanly done unifying update_pages and zfs_update_pages would be desirable. I also see many of the fallocate xfstests aren't running for some reason. We should determine why.

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 2, 2014

@behlendorf OK. The last thing I pushed seems to pass the buildbot's tests except for style. I just re-pushed a cleaned up version which has a more specialized punching helper called zfs_zero_partial_page(); it's a slimmed-down version of the previous function. Also added a better commit message. I think its logic is pretty solid. While testing it, I instrumented the kernel to display what type of sub-page zeroing was necessary: the standard fsx runs seemed to exercise all potential corner cases (punch begins in mid-page, ends in mid-page, begins and ends in mid-page, is a sub-page-sized punch).

Could many of the fallocate tests not be running because they're testing the actual "allocation" functionality of fallocate which we don't, of course, support?

I've done more extensive fsx tests by hand. In particular, I've run it with larger files, much longer run times and also with different ZFS record sizes to test the actual punching at the filesystem level any any possible interactions with the page cache. That said, I've been concentrating on a 3.2 kernel on my test system because it most closely matches the older kernels without the newer, smarter truncate_inode_pages_range() as used by truncate_pagecache_range() on newer kernels.

@behlendorf
Copy link
Contributor

@dweeezil Great work, this support has come together nicely! The changes look correct to me and I see that all the builders are now passing. I'm going to kick off some additional testing to be extra sure. But if you're happy with the latest version of the patch and nothing exciting comes up in my testing we might get this merged this week.

Could many of the fallocate tests not be running because they're testing the actual "allocation" functionality of fallocate which we don't, of course, support?

Yes, that's likely it.

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 3, 2014

@behlendorf I'm pretty happy with it at this point.

As to the support for "mode 0" mentioned in #2649, I'm not clear whether we want to even pretend to support it. The guarantee of no-fail writes within the allocated region would require a whole new bit of on-disk format to support; some sort of per-dataset/per-file reservation along with all the requisite accounting. Given the infrastructure provided by this patch, however, we certainly could "pretend" to support mode 0 if we cared to in order to appease applications which use a "mode 0 test" to determine whether any fallocate operations, including hole punching, are possible. I'm inclined to leave this as-is.

@behlendorf
Copy link
Contributor

I'm not clear whether we want to even pretend to support it.

We shouldn't. It's better to be honest with the applications and return that this functionality isn't currently supported rather than pretend we provide this hard guarantee. This is definitely something we could support if someone wanted to work on it. In the meanwhile they can always set a reserveration or a refreservation on the dataset.

So far so good with the additional testing.

@behlendorf
Copy link
Contributor

@dweeezil unfortunately I'm still able to occasionally cause xfstests 124 to fail when operating in the mixed standard/memory mapped mode. This happens perhaps 1/100 times in my testing. I haven't investigated in detail yet but I have an educated guess as to the problem.

In order to ensure the data in the page cache pages always matches what's in the ARC the pages and ARC buffers must both be updated under the zfs_range_lock. Otherwise there's a tiny window where they might be inconsistent. Somehow the logic after out: which truncates the page from the cache needs to be pulled under the lock. It looks like that's going to be complicated somewhat by the fact that zfs_extend(), zfs_trunc(), and zfs_free_range() and independently take the lock.

127  - output mismatch (see 127.out.bad)
--- 127.out 2014-09-02 20:08:06.858769002 -0700
+++ 127.out.bad 2014-09-02 20:28:10.302769121 -0700
@@ -10,4 +10,1041 @@
 === FSX Standard Mode, No Memory Mapping ===
 All operations completed A-OK!
 === FSX Standard Mode, Memory Mapping ===
-All operations completed A-OK!
+ltp/fsx -f -q -l 262144 -o 65536 -S 191110531 -N 100000 fsx_std_mmap
+READ BAD DATA: offset = 0xf307, size = 0xeba2, fname = /tank/xfstests/fsx_std_mmap
+OFFSET GOOD    BAD RANGE
+0x f307    0x0000  0x9021  0x    0
+operation# (mod 256) for the bad data may be 33
+0x f308    0x0000  0x21b5  0x    1
+operation# (mod 256) for the bad data may be 33
...

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 3, 2014

@behlendorf I was afraid something like that might be possible. When I was examining other file systems, I they all did their pagecache truncation [EDIT] much earlier [end EDIT] but it seemed like it ought to be possible to do it later in our case ("out:"). I really like having both the truncate_setsize() and the new code in zfs_freesp() for consistency with upstream code. It sds like we may need to push the new code down into zfs_free_range(). I'm going to try that.

EDIT: Hmm, that would put it inside the transaction.

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 4, 2014

@behlendorf I was able to duplicate this by severely constraining the memory on my test VM, letting fxs run indefinitely and the placing more pressure on the memory while it was running. I'm going to try to rig up the system to reproduce the problem more reliably and quickly and will then try to move the page cache truncation code under the range lock.

@behlendorf
Copy link
Contributor

EDIT: Hmm, that would put it inside the transaction.

This should be OK as long as you perform the page cache update under the range lock. The code in zfs_write() behaves the same way.

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 5, 2014

The 4f92732 commit looks like it's going to pass the buildbot tests (assuming we can ignore the hopefully spurious ztest vdev DTL error) and it has also survived several hours of wall time in my "torture test" environment in which I'm forcing lots of swapping to occur on a severely memory-constrained system while multiple instances of fsx are running. Previously, I could trigger an error in this environment within 5 or 10 minutes.

I split the work into two halves: complete page truncations which run at the end of zfs_freesp() and partial truncations (page stub zeroing) which run under a range lock in zfs_free_range().

@behlendorf
Copy link
Contributor

@dweeezil That sounds like a reasonable approach. Let's make sure it holds up with a weekend of testing. It would be nice if you could rebase the patch on the latest master. I've just merged the AIO support which is now passing all the test cases as well.

Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2).  Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation.  It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.

Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.

Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.
@behlendorf
Copy link
Contributor

@dweeezil My test results look good, I'm no longer seeing any failures which were caused by this change. The one buildbot failure above was clearly unrelated. Please, let me know if your testing has turned up anything. If not I'll get it merged.

@dweeezil
Copy link
Contributor Author

dweeezil commented Sep 8, 2014

@behlendorf I ran fsx for quite awhile over the weekend on my stress testing rig and it never failed. I think it's good at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants