-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fallocate FALLOC_FL_PUNCH_HOLE #2619
Conversation
I'm running the full suite of xfstests now. I ran the relevant individual tests by hand which passed. As indicated in the commit comment, I've moved the truncate_setsize() call in zfs_setattr() to zfs_freesp() in the same vicinity as its new call to truncate_pagecache_range(). This aligns the zfs_setattr() function a bit more closely with the upstream code. However, Illumos actually does its truncation-style page clearing in zfs_trunc(). We could probably do it there, too, however it seemed to make sense to have both of them in the same place. |
I like where this is going but it looks like patch still has a few rough edges. The build results for the commit above show at least one minor build issue and a BUG while running fsx. |
I'll be working on this over the weekend. Haven't had a chance to check the xfstests results yet. I'll do some fsx runs as well. What were the build issues? I'll be trying a 2.6 kernel as well but haven't done so yet. |
@dweezil, you can see the build failures by clikcing on the Details link below. That will take you to the buildbot console page. Click on the colored boxes to see a popup with the test results for that builder. The red ones represent failures. Look for failures that say "built zfs failed", then click on the "stdio" link to see the log. In this case, the failures were:
|
Thanks. I forgot about the build bot. |
Right. Just keep the buildbot green. |
2265d20
to
aa3408d
Compare
Several of the builders encountered a kernel BUG during testing, see: |
I was looking at the build bot results earlier while they were still running. In particular, it looks like xfstest 127 was hanging in some cases. I didn't notice the kernel BUG at the time. I'll look into these further tomorrow. |
Ugh, it seems that prior to torvalds/linux@5a72039 (appearing in 3.11), |
My last commit, 55bcc51 is still likely to fail on kernels > 3.4 and < 3.11 but it does work properly for the older 3.2 kernel I was testing with. Unfortunately this is an area where the bundled filesystems have an advantage because their code is tied to a particular kernel source tree. The problem is that there are kernels with torvalds/linux@623e3db but without torvalds/linux@5a72039. As much as I'd like to rely on the much nicer |
As suggested in my last comment, I pulled the code into |
Dealing with the heads and tails of the truncated region in the page cache is going to be a bit trickier than I thought at first. |
This commit still has some trash in it and also not as much detail in its log message as I'd ultimately like to see. I pushed it so the buildbot could get its hands on it and try some kernels I don't have easy access to. I'm also looking for feedback on the approach taken to deal with the sub-page stubs of holes. My approach was to borrow the logic framework from |
I'll get you some more detailed feedback tomorrow but this looks like a reason approach. If it can be cleanly done unifying |
@behlendorf OK. The last thing I pushed seems to pass the buildbot's tests except for style. I just re-pushed a cleaned up version which has a more specialized punching helper called Could many of the fallocate tests not be running because they're testing the actual "allocation" functionality of fallocate which we don't, of course, support? I've done more extensive fsx tests by hand. In particular, I've run it with larger files, much longer run times and also with different ZFS record sizes to test the actual punching at the filesystem level any any possible interactions with the page cache. That said, I've been concentrating on a 3.2 kernel on my test system because it most closely matches the older kernels without the newer, smarter |
@dweeezil Great work, this support has come together nicely! The changes look correct to me and I see that all the builders are now passing. I'm going to kick off some additional testing to be extra sure. But if you're happy with the latest version of the patch and nothing exciting comes up in my testing we might get this merged this week.
Yes, that's likely it. |
@behlendorf I'm pretty happy with it at this point. As to the support for "mode 0" mentioned in #2649, I'm not clear whether we want to even pretend to support it. The guarantee of no-fail writes within the allocated region would require a whole new bit of on-disk format to support; some sort of per-dataset/per-file reservation along with all the requisite accounting. Given the infrastructure provided by this patch, however, we certainly could "pretend" to support mode 0 if we cared to in order to appease applications which use a "mode 0 test" to determine whether any fallocate operations, including hole punching, are possible. I'm inclined to leave this as-is. |
We shouldn't. It's better to be honest with the applications and return that this functionality isn't currently supported rather than pretend we provide this hard guarantee. This is definitely something we could support if someone wanted to work on it. In the meanwhile they can always set a reserveration or a refreservation on the dataset. So far so good with the additional testing. |
@dweeezil unfortunately I'm still able to occasionally cause xfstests 124 to fail when operating in the mixed standard/memory mapped mode. This happens perhaps 1/100 times in my testing. I haven't investigated in detail yet but I have an educated guess as to the problem. In order to ensure the data in the page cache pages always matches what's in the ARC the pages and ARC buffers must both be updated under the
|
@behlendorf I was afraid something like that might be possible. When I was examining other file systems, I they all did their pagecache truncation [EDIT] much earlier [end EDIT] but it seemed like it ought to be possible to do it later in our case ("out:"). I really like having both the EDIT: Hmm, that would put it inside the transaction. |
@behlendorf I was able to duplicate this by severely constraining the memory on my test VM, letting fxs run indefinitely and the placing more pressure on the memory while it was running. I'm going to try to rig up the system to reproduce the problem more reliably and quickly and will then try to move the page cache truncation code under the range lock. |
This should be OK as long as you perform the page cache update under the range lock. The code in zfs_write() behaves the same way. |
The 4f92732 commit looks like it's going to pass the buildbot tests (assuming we can ignore the hopefully spurious ztest vdev DTL error) and it has also survived several hours of wall time in my "torture test" environment in which I'm forcing lots of swapping to occur on a severely memory-constrained system while multiple instances of fsx are running. Previously, I could trigger an error in this environment within 5 or 10 minutes. I split the work into two halves: complete page truncations which run at the end of zfs_freesp() and partial truncations (page stub zeroing) which run under a range lock in zfs_free_range(). |
@dweeezil That sounds like a reasonable approach. Let's make sure it holds up with a weekend of testing. It would be nice if you could rebase the patch on the latest master. I've just merged the AIO support which is now passing all the test cases as well. |
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of fallocate(2). Mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add the zfs_zero_partial_page() function, modeled after update_page(), to handle zeroing partial pages in a hole-punching operation. It must be used under a range lock for the requested region in order that the ARC and page cache stay in sync. Move the existing page cache truncation via truncate_setsize() into zfs_freesp() for better source structure compatibility with upstream code. Add page cache truncation to zfs_freesp() and zfs_free_range() to handle hole punching.
@dweeezil My test results look good, I'm no longer seeing any failures which were caused by this change. The one buildbot failure above was clearly unrelated. Please, let me know if your testing has turned up anything. If not I'll get it merged. |
@behlendorf I ran fsx for quite awhile over the weekend on my stress testing rig and it never failed. I think it's good at this point. |
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2). Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.
Add an autotools check for the truncate_pagecache_range() helper function
and a compatible implementation if the kernel doesn't support it.
Hole punching clears the relevent pagecache entries in zfs_space()
via truncate_pagecache_range(). For better source compatibility with
other implementations within the zfs_setattr() function, its call to
truncate_setsize() has been moved to zfs_space().