-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs causes machine to lock up when doing a "zfs diff" #2139
Comments
It has been a while, and there isn't any comment here yet. Did I submit to the wrong place? Is there a mailing list or something I should have used instead? I am not trying to rush anyone (I don't actually need to use the zfs diff) I just want to be sure that I have reported the issue in the right way, and reiterate that I would be happy to give more information or try to help track down the problem. For example, if it would help, I could setup a VM to try to reproduce the issue in. |
@crowtrobot You certainly posted your problem report to the correct place. I suspect the only reason you've not yet gotten a reply is because all the developers are busy dealing with other things at the moment. A peek at your stack traces shows some pretty deep recursion within traverse_visitbp() which could be causing problems. That and many, many, other issues have been addressed since the 0.6.2 release. My suggestion for the moment would be to try to run the current master code if you feel comfortable doing so and to see whether the problem persists. |
@crowtrobot Yes, you're certainly in the right place. I meant to comment on this when you first submitted it but I must have been dragged off to look at something else before I was able too. My suggestion would be the same as @dweeezil's. If your comfortable with running the latest code please try that, there have been a large number of improvements made. |
Great. Thanks guys. I will try upgrading to the latest code over the weekend and will let you know what comes of that. |
I am sorry, it took me longer to get back to this than I had hoped. I did get zfs and spl from git, and compile and install them last week. The kmod deb packages were spl_0.6.2-1_amd64.deb It had no problem with my relatively small zraid1/virtualbox file system, but on zraid1/home, it choked again. For context, my zfs list: I don’t think anything in the first 4 items I posted above has changed. I did notice something weird happened this time. I had htop running while doing the zfs diff, and at the same time that the HDD light on the computer stopped blinking, htop stopped refreshing. Another htop I started after the hard drive activity stopped, worked fine. But the one that stopped refreshing never started working again. |
Closing this is believed to have been resolved in master. |
I tried to do a zfs diff for the first time on a few days ago, and after it churned away and showed about a dozen changed files, everything stopped. My computer locked up and I had to use the magic SysRq key strokes to reboot. I ran a scrub which found no problems, so I tried it again, and again the lockup. Hoping to keep the system from locking up again with the next attemp, I logged out of X, and into a text terminal. I turn off my swap (which is in a zvol), thinking that if zfs is misbehaving and breaking swap that could piss the kernel off.
The snapshot I was originally trying to diff had been deleted (auto-snapshot). But when I tried to diff another snapshot I was able to get a partial diff (637 lines) followed by "Unable to determine path or stats for object 8363 in zraid1/home@zfs-auto-snap_daily-2014-01-20-0737: No such file or directory". I got similar errors from other snapshots (different object numbers), and the fourth snapshot I tried seemed to lock up disk I/O. I couldn’t read nor write to files in zfs, but root was still usable (xfs on an SSD). I left it in this locked up state for several minutes and then tried to do a normal reboot, and it got stuck, so again I gave it the magic SysRq reboot. I did another scrub which again reported no problems.
I searched through the issues reported here, and I didn’t see anything that looked to me to be related, but did find a guide for reporting system hangs. It asks for the following information:
alain@neon:/tmp$ sudo zdb
zraid1:
version: 5000
name: 'zraid1'
state: 0
txg: 375816
pool_guid: 15338212067028546992
hostname: 'neon'
vdev_children: 2
vdev_tree:
type: 'root'
id: 0
guid: 15338212067028546992
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 4779923121281758510
path: '/dev/disk/by-partuuid/886c9e72-cba2-4dfc-a97f-a7574b4cca2f'
whole_disk: 0
metaslab_array: 37
metaslab_shift: 33
ashift: 12
asize: 1685820276736
is_log: 0
DTL: 4559
create_txg: 4
children[1]:
type: 'disk'
id: 1
guid: 791129017847427098
path: '/dev/disk/by-partuuid/eba4140f-98ac-4af1-8363-d9278987b4f4'
whole_disk: 0
metaslab_array: 34
metaslab_shift: 33
ashift: 12
asize: 1685820276736
is_log: 0
DTL: 4558
create_txg: 4
features_for_read:
kernel version, ZFSOnLinux ZFS and SPL versions:
Linux Mint 15 "Olivia"
$ sudo uname -a
Linux neon 3.8.0-35-generic Large kmem allocs in get_nvlist and load_nvlist #50-Ubuntu SMP Tue Dec 3 01:24:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
The zfs-dkms and SPL package versions are both 0.6.2-1~raring
zcat /proc/config.gz
:https://gist.github.com/crowtrobot/9165135
for i in /proc/*/stack; do echo $i; cat $i; done;
andps -ef
during the hang.https://gist.github.com/crowtrobot/db940338e6f4df3ba596
https://gist.github.com/crowtrobot/9177539
I didn’t think to grab that, but here is the syslog from that time:
https://gist.github.com/crowtrobot/9165178
/proc/spl/kmem/slab during the hang.
https://gist.github.com/crowtrobot/9177578
https://gist.github.com/crowtrobot/9177610
zfs get all
and thespecific names of the datasets/zvols involved in your workload would also be
useful in enabling the developers to understand your system configuration and
diagnose issues.
https://gist.github.com/crowtrobot/9165212
And I was trying to run "zfs diff zraid1/home@zfs-auto-snap_weekly-2014-01-19-1532"
I can’t think of anything else, but I would be happy to help in any way a non-developer-type can.
The text was updated successfully, but these errors were encountered: