-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlocks resulting from the post-kmem rework merge #3225
Conversation
This is intended to fix the issues outlined in #3183. It's a WIP branch but since it appears others have been grabbing patches from it, I think it's time to get it posted as a pull request. The "testing this one now" change to |
Unfortunately, it seems I'm going to try the more heavy-handed approach of putting |
@dweeezil |
@tuxoko I'll be running tests with z_hold_mtx under MUTEX_FSTRANS shortly. I expect it will fix this current case of deadlocks. There are clearly too many code paths to fix, individually, otherwise. |
@dweeezil @behlendorf If we are going to keep the MUTEX_FSTRANS thing as it is, we need to be very careful of the lock/unlock sequence of such mutexes. spl_fstrans_mark would also have effect in this. But personally, unless we could fix the MUTEX_FSTRANS thing, I think we should stick to spl_fstrans_mark, because it is more explicit. |
@tuxoko Sorry, I had been meaning to read the comment you referenced in #3183 more closely but since I had not been leaning on MUTEX_FSTRANS, I didn't bother. Your comment is, of course, correct and is certainly a potential problem with z_hold_mtx in which the entries are aliased among potentially many unrelated objects. Using MUTEX_FSTRANS as a fix for z_hold_mtx isn't going to work as it stands now. |
@dweeezil |
fc43636
to
8b5bec7
Compare
My latest commit simply marks/unmarks where the z_hold_mtx array is used. This seems to cure all the cases of its related deadlocks I've been able to produce so far. I've got one small related tweak I'll push later today but I think this is a good start. Unfortunately, I can still deadlock my test system with the stress test. I've not analyzed the stack traces and lock debugging fully yet but it looks like it may be related to the kernel's per-inode |
8b5bec7
to
7045329
Compare
I've pushed another version of this with the comments cleaned up and one small superfluous change reverted. Further testing shows that the deadlocks I was still able to create are somewhat expected on systems with very little memory, no swap and on which tmpfs is being used. After removing all uses of tmpfs, I was no longer able to create the same deadlock. That said, I'm going to continue stress testing over the next day or so. |
7045329
to
6d617e7
Compare
Here's an update after chasing deadlocks all day. Most of the last class of deadlocks I was seeing were due to direct reclaim during the various dmu_tx_hold family of functions. Unfortunately, after locking all of those paths down, I'm now able generate deadlocks involving the |
6d617e7
to
d85a1f7
Compare
I've backed off of all the dmu_tx_hold lockdowns. Instead, as @tuxoko rightly pointed out, memory allocations when db_mtx is held can cause deadlocks. The latest commit puts it under MUTEX_FSTRANS but since I've got a good set of stack traces involving this deadlock, it might be possible to do this in a more granular manner with With this patch in place, however, I can still lock up my test system in which it spins trying to free memory in the SPL slab shrinker. I'm going to get this rebased on current master code now in order to get the latest ARC changes (2cbb06b in particular). |
hm, several git fetch/pull fails due to github being DDOS, failed test imports
whatever that means and http://buildbot.zfsonlinux.org/builders/centos-7.0-x86_64-builder/builds/1764/steps/shell_17
http://oss.sgi.com/archives/xfs/2014-07/msg00395.html looks like another buildbot issue @dweeezil I'll give your latest patch a try, although I doubt I'll run into any serious lockups due to my (compared to others) minor load Thanks for your continued work on this issue ! |
d85a1f7
to
1c05d84
Compare
I've pushed a master-rebase of dweeezil/zfs@d85a1f7 (it's dweeezil/zfs@1c05d84). So far, it's looking very good. I've not been able to deadlock my test system but I have been able to live-lock it to the point only a SysRq-e will get it going again. I'm concerned that @tuxoko's observation about the ordering issue with MUTEX_FSTRANS may be happening along the sync path because I'm seeing reclaim entered from
I have a feeling this patch's use of |
@dweeezil "I've backed off of all the dmu_tx_hold lockdowns. Instead, as @tuxoko rightly pointed out, memory allocations when db_mtx is held can cause deadlocks." Hmm, that rings a bell... https://www.illumos.org/issues/5056 - ZFS deadlock on db_mtx and dn_holds According to git log, hasn't been applied to ZoL? |
@chrisrd That patch is going to take a bit of digesting; it's pretty extensive. The stacks shown in https://www.illumos.org/issues/5056 don't look like anything I've seen as part of working on this issue (at least insofar as the umount path is concerned). That said, I may try a quick port of the patch to see how well it fits into our current code. |
@tuxoko sorry about the long delayed comment on #3183 (comment). You're exactly right, that's a legitimate case which can happen and I should have commented on it explicitly. At the time I didn't think that could occur in the existing code but given the stacks observed by @dweeezil it seems it may be possible. Direct reclaim should never ever ever be possible in the context of the You mentioned the only way to handle this transparently is to keep a reference count in the task. Unfortunately, that's not really something we can do because we can't add a field to the task struct. We could add a bit of thread specific data (TSD) but I suspect the performance hit there would be unacceptable. We may not be able to transparently hide this case. Which is a shame because I'd really like to keep the Linux specific bits to a minimum. I completely agree with @dweeezil's comments above:
So with that in mind let me throw out a couple possible alternate approaches we could take. Better ideas are welcome! I'd love to put this issue behind us once and for all.
|
@behlendorf Before addressing your points above, here are a few other details. First I want to characterize the workload of my torture tests (see https://github.com/dweeezil/stress-tests) to get idea why I'm able to create deadlocks so quickly. All my testing is done with "mem=4g" to simulate a system that's tight on memory from the beginning. It's much, much harder to deadlock even at the 8GiB level with all other things equal. The filesystem is populated with about 1 million files in a 3-level directory hierarchy consisting of 1000 directories. I've not been re-populating it (right now, there are 899K inodes used on my test system). A set of 20 concurrent file-creation with random overwrite processes are repeatedly run and at the same time, a set of 20 concurrent random file removal processes are run. The random removal uses This is a mostly metadata-heavy workload and on a memory constrained system, the ARC quickly fills to its automatically-calculated 1GiB cap (and overshoots periodically). While all this is running, I run a single memory hogging program which nibbles away at a pre-specified amount of memory in 10MiB chunks, dirtying each chunk at it goes. I'm currently using a 2000MiB cap (somewhat arbitrarily). The memory hog is allowed to run to completion and is then re-run after a 4-second delay to allow the ARC to refill. When I launch the memory hog, I watch the ARC via With this patch in its current state, I can usually wake the system up by doing a SysRq-e to SIGTERM all the processes. The trouble seems to stem from the sync task going into reclaim. I do want to try to back off of using MUTEX_FSTRAMS for As to your points above:
My overriding concern is the timing of this issue with respect to a (from what I gather) long-overdue tagged release. If we miss marking something with I'd like to add one other note regarding the case of reclaim I spotted in the sync thread: My first idea was that something was turning off PF_FSTRANS so I put a test and a |
@dweeezil it sounds like we're on the same page about this. I think the short term goal here needs to be putting together some patch which addresses most/all of the known deadlocks. Otherwise we're almost certainly going to have people updating to the next released version and encountering this issue. That's something I definitely want to avoid. However, this is the last remaining blocker on the list holding up a tag. So in the interests of getting something out the door, I agree that we either need to adopt the Have you perhaps already put together a patch for the If you could provide a trivial wrapper script in the short term for your test case I could get some additional testing done done myself for these patches. And speaking of testing thanks for describing your workload and providing a pointer to the test scripts. It would be ideal if at some point we could some form of your testing to the buildbot, perhaps as part of the zfsstress tests @nedbass put together, https://github.com/nedbass/zfsstress. These tests are very similar is spirit to the kind of testing you've been doing. |
@behlendorf Kind of in reverse order here: I've put together a wrapper script for my current testing regimen and cleaned up the (very crude) scripts just a bit. You'll need both https://github.com/dweeezil/stress-tests and https://github.com/dweeezil/junkutils (or your own memory hogging and dirtying program). The former repository has the stress testing scripts and I've not yet done the This should give you enough information to get the test case running. Hopefully you can duplicate the problem as easily as I am able to (which, BTW, I'm strictly running bare metal right now). |
@dweeezil when you refresh the SPL patch can you also revert the MUTEX_FSTRANS patch openzfs/spl@d0d5dd7. Let's get rid of that interface since we've decided not to use it. As for the |
@dweeezil @tuxoko the latest version of the patch stack continues to hold together well for me in all but the most abusive testing. In my estimation it is behaving at least as well as, if not better than, the existing 0.6.3 code when in a low memory situation. There's certainly room for us to continue improving things, and we should, but in the interests of getting this our due tag I'd like to merge this. Then we can get a final round of testing in over the weekend before the tag. Do you guys have any reservations or last minute improvements you'd like to get in to this change? What's your general impression of the state of things with this patch applied to master? |
@behlendorf Insofar as fixing the original issues described in this thread and some other related issues, I'm happy with this patch as it stands. What's your take on the SPL patch? I've been doing all my testing with it as it helps things a lot in this particular case. My impression is that things are in pretty good shape with this patch applied. I'm planning doing some more tests this weekend on bigger hardware configurations along with my (delayed by this issue) testing of pull request 3115 (purposely not hash-referenced; It was this issue which stopped me in my tracks while working on that patch). |
@dweeezil I've been using the SPL patch as well and I think it helps considerably in certain situations. I was going to include it when I merge this change. Thus far it's worked well for me after making that one minor fix. OK, then I'll get this merged today and queue up some additional weekend testing on the new master. |
…3115_WIP_clean sync against latest upstream master from April 3rd to benefit from the changes from openzfs#3225
…3115_WIP_clean sync against latest upstream master from April 3rd to benefit from the changes from openzfs#3225
Hi,
I am currently connected to my machine, which is quite unresponsive, top showing a load average of 18, but no high cpu usage nor disk usage (wa = 0.9). I've setup a crontab that logs free memory and arcstats every 3 minutes until it will crash. Current stats are
I get more and more load average and will probably loose "the hand" on the machine. Running CentOS 7.1 Linux backupmaster.siege.local 3.10.0-229.1.2.el7.x86_64 #1 SMP Fri Mar 27 03:04:26 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux on a Fujitsu TX140S1P with 2x4Tb RE4 drives and 8Gb RAM, no deduplication. PS: sorry if this isn't the right issue to post to, i just read a lot of deadlock stuff and this one just looks like mine. |
There should be more than one stack trace. Can you paste (or pastebin, or gist, etc) some more from the same time period? Should all be logged within a 2 minute window. Also note that another commit went in just recently to improve the situation further. It's not a 100% fix but every little bit helps. Pulling the next update would be my suggestion. |
Well there is a zfs-0.6.3-261 package, but it seems it isn't public or built yet (neither was it yesterday). The full stack trace from last deadlock can be found here: https://gist.github.com/deajan/78757af1066b7d77a73d |
@deajan Sorry about that. The EPEL zfs-testing repository will be fixed today. |
@behlendorf Thanks a lot, and congrats for release 0.6.4 :)) |
Prevent deadlocks by disabling direct reclaim during all NFS calls. This is related to 40d06e3. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3225
Prevent deadlocks by disabling direct reclaim during all NFS and xattr calls. This is related to 40d06e3. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3225
Commit @40d06e3 removed a scheme to avoid reacquiring a mutex in zfs_zinactive. It turns out this the scheme is necessary. Reinstate it. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue openzfs#3225 Closes openzfs#3304
Prevent deadlocks by disabling direct reclaim during all NFS, xattr, ctldir, and super function calls. This is related to 40d06e3. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3225
Prevent deadlocks by disabling direct reclaim during all NFS, xattr, ctldir, and super function calls. This is related to 40d06e3. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue openzfs#3225
It would be better to identify each lock that can be accessed during direct reclaim, convert the entry/exit routines into preprocessor macro wrappers and insert |
Put the following functions under PF_FSTRANS to prevent deadlocks
due to re-entering the filesystem during reclaim:
Testing this one now:
dmu_zfetch_fetch() - z_hold_mtx[] via zget and zf_rwlock via dmu_zfetch