-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[master] panic during zfs send #197
Comments
It appears to happen when a "zfs send -I" and this overlap:
|
My guess is that because the zfs list takes a long time (there are many datasets and snapshots) that the zfs list is still running when the automatic holds are removed at the end of the zfs send -I. |
During a "zfs send -I anypool/data/set@foo anypool/data/set@bar", I ran
and got this (different) panic
|
It looks like this with the latest vdev-iokit. script holds-ssd
In a terminal window, run
in another terminal window
BOOM. Excellent trick on any system which does automatically scheduled backups using zfs send -I.
|
(note to self for later: can this be reproduced in light of the master/spl changes in issue #201 ?) (will try in some hours...) |
Nope. This panic happened during "zfs list -r -t snap ssdpool/DATA/opt" (macports /opt) while the looping holds-ssd script ( #197 (comment) ) was running.
|
Since it appears to be random memory corruption, its unlikely you can trigger the same panic again, but in case you can, I added some guard words around the unique struct, as well as, changed SPL to dirty memory allocations (that aren't zalloc). But really, we do probably need lldb connected and backtrace to get a feel for how it dies. |
Can you not reproduce this? It isn't pool, pool geometry, or system dependent as far as I can tell, and it's easy for me to reproduce. (I'm about to try again with the issue197 zfs and issue201 spl.) |
Got a hopefully more informative panic. With issue197 zfs, issue201 spl, following the method in issuecomment-46371332 (#197 (comment)) Terminal 1:
Terminal 2:
Result, panic:
|
I have put ssdholds in while, as well as My test pool may be too small though.
|
The snapshots that are involved are much smaller (and more numerous) so the race (assuming there is one) is probably when adding or removing the temporary holds. Below is an example of one dataset that will trigger the panic reliably. This issue originated when I started noticing problems when a job that does incrementals across 52 datasets; doing that incremental while running the while loop also causes a panic. Most of the datasets are changed only irregularly, so the changes from one to another are about zero.
|
@rottegift I wonder if this would happen with a slightly different set of scripts: holds-ssd in a while loop to repeatedly get the zfs hold list, and rather than zfs send, a while loop that does zfs hold and zfs release repeatedly. May be easier to reproduce, and may narrow down if the issue only relates to the hold/release vs the send itself. |
I remember seeing the panic log some time ago and wondering if you were running zpool iostat or zfs list in a second terminal - and actually it's this while loop with On another note, any reason not to use |
Yep. That panicked. Details in a moment. Sent from my iPhone
|
Plus you could try
(just removing the holds step) to see if I'm suspecting the issue is in |
A long moment. The panic had boot volume jhfs+ corrupting teeth. The scripts aren't "real" - they were an attempt to get a reduced test case. Roughly, zfs snapshot -r ssdpool@a them a loop doing zfs hold -r test hold ssdpool@a ; zfs release -r ssdpool@a made three or four iterations then kaboom. Sent from my iPhone
|
Ok with send traversing over 50 snapshot, it hung. But no lldb attach, probably trashed the stack. At least I have something to work on :) |
Yeah, my panic left no stack trace at all. Glad you can reproduce it. :-) |
@evansus With the zfs holds removed from the script:
then the repeated hold/release loop
then the zfs list threw an ASSERT which was lost in the immediately subsequent panic:
|
ok could you try out |
Should I try out the mem-fix branch or just the openzfsonosx/spl@220a3ec changes on the issue201 branch ? |
We determined it didn't fix anything. So no need. |
Just checkout branch mem-fix on spl, run whatever zfs branch you want and see. I added 2 commits after my last (faulty) message. |
Ok. crosses fingers :) |
It's surviving serious I/O load so far. |
@rottegift So is it still all OK? |
Yes, it's been up and fine for 26 hours under substantial workload and I can no longer repeat the panic with the holds looping, so closing. |
[master today]
It was doing a "zfs send -I dataset@somesnap dataset@someothersnap" at the time.
I seem to be able to reproduce this but haven't nailed down whether it's one precise dataset or snapshot. It happens during a job that does incremental sends of quite a few datasets on one pool.
The text was updated successfully, but these errors were encountered: