Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs causes machine to lock up when doing a "zfs diff" #2139

Closed
crowtrobot opened this issue Feb 23, 2014 · 6 comments
Closed

zfs causes machine to lock up when doing a "zfs diff" #2139

crowtrobot opened this issue Feb 23, 2014 · 6 comments

Comments

@crowtrobot
Copy link

I tried to do a zfs diff for the first time on a few days ago, and after it churned away and showed about a dozen changed files, everything stopped. My computer locked up and I had to use the magic SysRq key strokes to reboot. I ran a scrub which found no problems, so I tried it again, and again the lockup. Hoping to keep the system from locking up again with the next attemp, I logged out of X, and into a text terminal. I turn off my swap (which is in a zvol), thinking that if zfs is misbehaving and breaking swap that could piss the kernel off.

The snapshot I was originally trying to diff had been deleted (auto-snapshot). But when I tried to diff another snapshot I was able to get a partial diff (637 lines) followed by "Unable to determine path or stats for object 8363 in zraid1/home@zfs-auto-snap_daily-2014-01-20-0737: No such file or directory". I got similar errors from other snapshots (different object numbers), and the fourth snapshot I tried seemed to lock up disk I/O. I couldn’t read nor write to files in zfs, but root was still usable (xfs on an SSD). I left it in this locked up state for several minutes and then tried to do a normal reboot, and it got stuck, so again I gave it the magic SysRq reboot. I did another scrub which again reported no problems.

I searched through the issues reported here, and I didn’t see anything that looked to me to be related, but did find a guide for reporting system hangs. It asks for the following information:

  1. Workload: Nothing but the zfs diff was going on.
  2. Pool configuration: Nothing fancy here
    alain@neon:/tmp$ sudo zdb
    zraid1:
    version: 5000
    name: 'zraid1'
    state: 0
    txg: 375816
    pool_guid: 15338212067028546992
    hostname: 'neon'
    vdev_children: 2
    vdev_tree:
    type: 'root'
    id: 0
    guid: 15338212067028546992
    create_txg: 4
    children[0]:
    type: 'disk'
    id: 0
    guid: 4779923121281758510
    path: '/dev/disk/by-partuuid/886c9e72-cba2-4dfc-a97f-a7574b4cca2f'
    whole_disk: 0
    metaslab_array: 37
    metaslab_shift: 33
    ashift: 12
    asize: 1685820276736
    is_log: 0
    DTL: 4559
    create_txg: 4
    children[1]:
    type: 'disk'
    id: 1
    guid: 791129017847427098
    path: '/dev/disk/by-partuuid/eba4140f-98ac-4af1-8363-d9278987b4f4'
    whole_disk: 0
    metaslab_array: 34
    metaslab_shift: 33
    ashift: 12
    asize: 1685820276736
    is_log: 0
    DTL: 4558
    create_txg: 4
    features_for_read:
  3. The Linux distribution, the release version of the Linux distribution, Linux
    kernel version, ZFSOnLinux ZFS and SPL versions:
    Linux Mint 15 "Olivia"
    $ sudo uname -a
    Linux neon 3.8.0-35-generic Large kmem allocs in get_nvlist and load_nvlist #50-Ubuntu SMP Tue Dec 3 01:24:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
    The zfs-dkms and SPL package versions are both 0.6.2-1~raring
  4. If possible, the kernel configuration file. e.g. zcat /proc/config.gz:
    https://gist.github.com/crowtrobot/9165135
  5. If possible, the output of for i in /proc/*/stack; do echo $i; cat $i; done; and ps -ef during the hang.
    https://gist.github.com/crowtrobot/db940338e6f4df3ba596
    https://gist.github.com/crowtrobot/9177539
  6. Dmesg output:
    I didn’t think to grab that, but here is the syslog from that time:
    https://gist.github.com/crowtrobot/9165178
  7. If possible, the contents of /proc/spl/kstat/zfs/arcstats and
    /proc/spl/kmem/slab during the hang.
    https://gist.github.com/crowtrobot/9177578
    https://gist.github.com/crowtrobot/9177610
  8. If you are comfortable providing it, the output of zfs get all and the
    specific names of the datasets/zvols involved in your workload would also be
    useful in enabling the developers to understand your system configuration and
    diagnose issues.
    https://gist.github.com/crowtrobot/9165212
    And I was trying to run "zfs diff zraid1/home@zfs-auto-snap_weekly-2014-01-19-1532"
  9. Any other information that appears to be relevant.
    I can’t think of anything else, but I would be happy to help in any way a non-developer-type can.
@crowtrobot
Copy link
Author

It has been a while, and there isn't any comment here yet. Did I submit to the wrong place? Is there a mailing list or something I should have used instead? I am not trying to rush anyone (I don't actually need to use the zfs diff) I just want to be sure that I have reported the issue in the right way, and reiterate that I would be happy to give more information or try to help track down the problem. For example, if it would help, I could setup a VM to try to reproduce the issue in.

@dweeezil
Copy link
Contributor

@crowtrobot You certainly posted your problem report to the correct place. I suspect the only reason you've not yet gotten a reply is because all the developers are busy dealing with other things at the moment. A peek at your stack traces shows some pretty deep recursion within traverse_visitbp() which could be causing problems. That and many, many, other issues have been addressed since the 0.6.2 release. My suggestion for the moment would be to try to run the current master code if you feel comfortable doing so and to see whether the problem persists.

@behlendorf behlendorf added this to the 0.6.5 milestone Mar 21, 2014
@behlendorf behlendorf added the Bug label Mar 21, 2014
@behlendorf
Copy link
Contributor

@crowtrobot Yes, you're certainly in the right place. I meant to comment on this when you first submitted it but I must have been dragged off to look at something else before I was able too. My suggestion would be the same as @dweeezil's. If your comfortable with running the latest code please try that, there have been a large number of improvements made.

@crowtrobot
Copy link
Author

Great. Thanks guys. I will try upgrading to the latest code over the weekend and will let you know what comes of that.

@crowtrobot
Copy link
Author

I am sorry, it took me longer to get back to this than I had hoped.  I did get zfs and spl from git, and compile and install them last week.  The kmod deb packages were spl_0.6.2-1_amd64.deb
and zfs_0.6.2-1_amd64.deb, even though I expected higher version numbers.  Anyway, proceeding on the assumption that the spl-master and zfs-master I downloaded from github was in fact current code, I tried the zfs diff command again.  

It had no problem with my relatively small zraid1/virtualbox file system, but on zraid1/home, it choked again.  For context, my zfs list:
alain@neon:~$ sudo zfs list
NAME                USED  AVAIL  REFER  MOUNTPOINT
zraid1             1.91T  1.11T   136K  /zraid1
zraid1/home        1.63T  1.11T  1.52T  /home
zraid1/home/old    22.5G  1.11T  22.3G  /home/old
zraid1/swap1       17.0G  1.11T  8.91G  -
zraid1/torrents     190G  1.11T  22.3G  -
zraid1/virtualbox  75.8G  1.11T  37.5G  /home/virtualbox

I don’t think anything in the first 4 items I posted above has changed.  
5.  The new ps -ef output:
https://gist.github.com/crowtrobot/10561908
   and the stacks
https://gist.github.com/crowtrobot/10561954
6.  Dmesg output  https://gist.github.com/crowtrobot/10562026
7.  Arcstats
https://gist.github.com/crowtrobot/10562045
 and slab
https://gist.github.com/crowtrobot/10562099
8.  New zfs get all
https://gist.github.com/crowtrobot/10562225

I did notice something weird happened this time.  I had htop running while doing the zfs diff, and at the same time that the HDD light on the computer stopped blinking, htop stopped refreshing.  Another htop I started after the hard drive activity stopped, worked fine.  But the one that stopped refreshing never started working again.  

@behlendorf
Copy link
Contributor

Closing this is believed to have been resolved in master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants