Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null pointer dereference at arc_read #2171

Closed
jvsalo opened this issue Mar 10, 2014 · 12 comments
Closed

Null pointer dereference at arc_read #2171

jvsalo opened this issue Mar 10, 2014 · 12 comments
Milestone

Comments

@jvsalo
Copy link

jvsalo commented Mar 10, 2014

I hit this one today. It's Debian stock 3.2.0-4-amd64 kernel with ZoL 0.6.2 from the ZoL Debian packages. The machine has 64GB of ECC RAM and ZFS is loaded with default options. This zpool has 3x 6-disk raidz2 with 2x 240GB L2ARC SSDs.

The server runs BackupPC (rsync) and Netatalk for Apple Time Machine. It also sends and receives streams from other ZFS machines. Many of these things were going on at the time of the crash.

This host has been running without problems for quite some time. I don't think it will be easy for me to reproduce it.

Could this be https://www.illumos.org/issues/4089 ?

[9318327.840130] BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
[9318327.840210] IP: [<ffffffffa0857e79>] arc_read+0x5fe/0x709 [zfs]
[9318327.840271] PGD 0 
[9318327.840308] Oops: 0000 [#1] SMP 
[9318327.840353] CPU 5 
[9318327.840362] Modules linked in: ipt_REJECT sha256_generic iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack xt_multiport rfcomm parport_pc ppdev lp parport bnep cpufreq_conservative cpufreq_powersave bluetooth cpufreq_userspace rfkill cpufreq_stats crc16 binfmt_misc nfsd nfs nfs_acl auth_rpcgss fscache lockd sunrpc bonding iptable_filter ip_tables x_tables 8021q garp stp hwmon_vid ipmi_si ipmi_msghandler loop fuse dm_crypt zfs(P) psmouse zunicode(P) zavl(P) zcommon(P) znvpair(P) spl(O) zlib_deflate coretemp serio_raw pcspkr evdev joydev i2c_i801 shpchp i2c_core iTCO_wdt iTCO_vendor_support ioatdma acpi_cpufreq mperf wmi processor thermal_sys button container xfs dm_mod raid1 md_mod usbhid hid ses enclosure sg sd_mod crc_t10dif crc32c_intel ghash_clmulni_intel aesni_intel aes_x86_64 aes_generic cryptd ahci libahci libata ehci_hcd mpt2sas usbcore raid_class scsi_transport_sas scsi_mod usb_common igb dca [last unloaded: scsi_wait_scan]
[9318327.841330] 
[9318327.841363] Pid: 18329, comm: zfs Tainted: P           O 3.2.0-4-amd64 #1 Debian 3.2.35-2 Supermicro X9DRW/X9DRW
[9318327.841451] RIP: 0010:[<ffffffffa0857e79>]  [<ffffffffa0857e79>] arc_read+0x5fe/0x709 [zfs]
[9318327.841531] RSP: 0018:ffff88048c8639d8  EFLAGS: 00010282
[9318327.841572] RAX: ffff8800067aeda0 RBX: ffff88037fc91210 RCX: 0000000000000000
[9318327.841639] RDX: 0000000000000000 RSI: ffffffffa08fda60 RDI: ffff88006120f600
[9318327.841706] RBP: ffff8806d1733eb0 R08: ffff88006120f600 R09: 0000000000000200
[9318327.841773] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9002412b840
[9318327.841840] R13: ffff88048c863af4 R14: ffff88103cb56000 R15: 0000000000000000
[9318327.841908] FS:  00007f315c4bbc40(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
[9318327.841978] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[9318327.842020] CR2: 0000000000000014 CR3: 000000093e8b9000 CR4: 00000000000406e0
[9318327.842087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[9318327.842154] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[9318327.842221] Process zfs (pid: 18329, threadinfo ffff88048c862000, task ffff880654349160)
[9318327.842290] Stack:
[9318327.842322]  ffffffffa085648e ffff88102cdb9980 ffff881000000000 ffff881000040640
[9318327.842406]  0000000000000000 0000000000000003 ffff881025f79cc0 ffff88103c655000
[9318327.842491]  000000330a681800 ffff88000c9e7200 0000000000000200 0000000000000000
[9318327.842576] Call Trace:
[9318327.842622]  [<ffffffffa085648e>] ? arc_read_done+0x217/0x217 [zfs]
[9318327.842676]  [<ffffffffa085c238>] ? dbuf_read+0x388/0x4ba [zfs]
[9318327.842731]  [<ffffffffa0861ca8>] ? dmu_buf_hold+0xac/0xed [zfs]
[9318327.842793]  [<ffffffffa08abfab>] ? zap_lockdir+0x48/0x497 [zfs]
[9318327.842839]  [<ffffffff8134c508>] ? mutex_lock+0xd/0x2d
[9318327.842896]  [<ffffffffa08ac576>] ? zap_cursor_retrieve+0x4e/0x1c2 [zfs]
[9318327.842942]  [<ffffffff810363d8>] ? should_resched+0x5/0x23
[9318327.842989]  [<ffffffff8134bd27>] ? _cond_resched+0x7/0x1c
[9318327.843030]  [<ffffffff8134c508>] ? mutex_lock+0xd/0x2d
[9318327.843088]  [<ffffffffa08a4b50>] ? vdev_config_sync+0x13f/0x13f [zfs]
[9318327.843146]  [<ffffffffa0866d44>] ? dmu_dir_list_next+0x74/0x115 [zfs]
[9318327.843195]  [<ffffffffa02ebd7a>] ? kmem_free_debug+0xc5/0x10d [spl]
[9318327.843242]  [<ffffffff810eaed0>] ? kfree+0x5b/0x6c
[9318327.843285]  [<ffffffffa02ebd7a>] ? kmem_free_debug+0xc5/0x10d [spl]
[9318327.843329]  [<ffffffff810363d8>] ? should_resched+0x5/0x23
[9318327.843373]  [<ffffffff8134bd27>] ? _cond_resched+0x7/0x1c
[9318327.843414]  [<ffffffff810363d8>] ? should_resched+0x5/0x23
[9318327.843457]  [<ffffffff8134bd27>] ? _cond_resched+0x7/0x1c
[9318327.843498]  [<ffffffff8134c508>] ? mutex_lock+0xd/0x2d
[9318327.843553]  [<ffffffffa086e10e>] ? dnode_rele+0x3d/0x62 [zfs]
[9318327.843605]  [<ffffffffa085c960>] ? dbuf_rele_and_unlock+0x16/0x19f [zfs]
[9318327.843662]  [<ffffffffa08644f3>] ? dmu_objset_prefetch+0xea/0xf2 [zfs]
[9318327.853523]  [<ffffffffa08b82e0>] ? zfs_ioc_dataset_list_next+0xf3/0x1c3 [zfs]
[9318327.853604]  [<ffffffffa08bc3f7>] ? zfsdev_ioctl+0x103/0x159 [zfs]
[9318327.853648]  [<ffffffff8110706d>] ? do_vfs_ioctl+0x459/0x49a
[9318327.853690]  [<ffffffff811070f9>] ? sys_ioctl+0x4b/0x72
[9318327.853733]  [<ffffffff8100ee86>] ? math_state_restore+0x4b/0x55
[9318327.853777]  [<ffffffff81352012>] ? system_call_fastpath+0x16/0x1b
[9318327.853818] Code: c7 04 24 8e 64 85 a0 89 54 24 18 8b 54 24 5c 89 54 24 10 48 8b 54 24 40 e8 c0 b3 07 00 48 8b 93 f0 00 00 00 48 c7 c6 60 da 8f a0 <48> 63 7a 14 48 89 44 24 30 e8 79 c1 ff ff 41 f6 45 00 04 48 8b 
[9318327.854203] RIP  [<ffffffffa0857e79>] arc_read+0x5fe/0x709 [zfs]
[9318327.854254]  RSP <ffff88048c8639d8>
[9318327.854289] CR2: 0000000000000014
[9318327.854883] ---[ end trace 9a59537db9a68d12 ]---
@Ukko-Ylijumala
Copy link

This looks like the same bug as #2165.

edit: call paths up to dbuf_read differ, so looks like we can arrive at the same bug via different routes.

@behlendorf behlendorf added this to the 0.6.4 milestone Mar 20, 2014
@behlendorf behlendorf added the Bug label Mar 20, 2014
@behlendorf
Copy link
Contributor

Yes, it looks very likely that this is the same as Illumos 4089. The NULL dereference occurred in basically that same chunk of code. If you still have the zfs.ko module from the failure and it has debugging symbols we could get the exact line to confirm this.

Regardless we should pick up the upstream fix, I'll open a pull request.

$ gdb module/zfs/zfs.ko
GNU gdb (GDB) Fedora 7.6.1-46.fc19
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/behlendo/src/git/zfs/module/zfs/zfs.ko...done.

(gdb) list *(arc_read+0x5fe)
...

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Mar 20, 2014
4089 NULL pointer dereference in arc_read()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4089
  illumos/illumos-gate@57815f6

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2171
@behlendorf behlendorf modified the milestones: 0.6.3, 0.6.4 Mar 20, 2014
@jvsalo
Copy link
Author

jvsalo commented Mar 22, 2014

Unfortunately I don't have debug symbols in the module (issue #2173). Thanks for making the pull request! I don't think I can verify the fix, because I don't have an easy reproducer.

@josephglanville
Copy link

I think we hit this doing torture tests on Ubuntu 12.04 with Linux 3.2 kernel and ZFS 0.6.2.
Here is the stack trace from our environment:
https://gist.github.com/josephglanville/56795868175bb63851f4

Unfortunately I don't have the module built with debug symbols either. :(
We can however reproduce the bug so I am going to rebuild with debug and attempt to do that.

behlendorf pushed a commit that referenced this issue Mar 24, 2014
4089 NULL pointer dereference in arc_read()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4089
  illumos/illumos-gate@57815f6

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2171
Issue #2165
Closes #2198
@behlendorf
Copy link
Contributor

@josephglanville The expected fix has been merged to master. It would be helpful if you could verify you're unable to recreate the issue using the latest source from Github.

@josephglanville
Copy link

@behlendorf I have gotten master installed. This takes about 24-48hrs to reproduce under load, will let you know how it goes.

@josephglanville
Copy link

I had some other issues with master so I backported the patch to 0.6.2.
I'm not sure if I applied the patch wrong (or it was dependent on other changes) but I still hit the same or similar NPD as seen here:

Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.511900] BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.519847] IP: [<ffffffffa0268d81>] arc_read+0x3f1/0x860 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.525844] PGD 60810f067 PUD 612ca8067 PMD 0
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.530381] Oops: 0000 [#1] SMP
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.533681] CPU 0
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.535522] Modules linked in: iscsi_trgt(O) bonding serio_raw joydev i7core_edac edac_core mac_hid zf
s(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl(O) zlib_deflate xfs ses enclosure igb usbhid hid ixgbe mpt2sas scsi_transport_sas dca mdi
o raid_class [last unloaded: iscsi_trgt]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.561372]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.563800] Pid: 614, comm: zvol/24 Tainted: P           O 3.2.0-53-generic #81-Ubuntu Quanta
        QSSC-S99Q             /S99Q
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.577848] RIP: 0010:[<ffffffffa0268d81>]  [<ffffffffa0268d81>] arc_read+0x3f1/0x860 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.586296] RSP: 0018:ffff88060dae1b10  EFLAGS: 00010282
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.591646] RAX: ffff8801ad820900 RBX: ffff88014c3f7cc0 RCX: 0000000000000000
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.591646] RAX: ffff8801ad820900 RBX: ffff88014c3f7cc0 RCX: 0000000000000000
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.598822] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801ad820cb0
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.605997] RBP: ffff88060dae1bc0 R08: ffffc90057d63000 R09: 0000000000002000
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.613173] R10: ffff880592f96da0 R11: ffff8806133fd000 R12: ffffc9015602f700
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.620350] R13: ffff8803b205c120 R14: 0000000000002000 R15: ffff88060dae1c4c
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.627527] FS:  0000000000000000(0000) GS:ffff880627200000(0000) knlGS:0000000000000000
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.635685] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.641476] CR2: 0000000000000014 CR3: 0000000611487000 CR4: 00000000000006f0
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.648656] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.655831] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.663008] Process zvol/24 (pid: 614, threadinfo ffff88060dae0000, task ffff880612162e00)
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.671340] Stack:
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.673390]  ffffffffa0266fd0 ffff8806124bec00 ffff880600000000 0000000000040640
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.680934]  ffff880600000000 00000000000ab16e ffff880608572be0 ffff8806133fd000
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.688474]  000000000dae1bc0 ffff880592f96da0 ffffffffa026ebf0 ffff88060da86000
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.696014] Call Trace:
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.698502]  [<ffffffffa0266fd0>] ? arc_read_done+0x2f0/0x2f0 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.704826]  [<ffffffffa026ebf0>] ? dbuf_do_evict+0x70/0x70 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.710971]  [<ffffffffa026f4fd>] dbuf_read+0x25d/0x860 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.716768]  [<ffffffffa0277073>] dmu_buf_hold_array_by_dnode+0x163/0x550 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.724151]  [<ffffffffa01d139a>] ? kmem_alloc_debug+0x14a/0x3d0 [spl]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.730735]  [<ffffffffa0277605>] dmu_buf_hold_array+0x65/0x90 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.737137]  [<ffffffffa00ab18b>] ? avl_find+0x5b/0xb0 [zavl]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.742935]  [<ffffffffa0278856>] dmu_read_req+0x56/0x160 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.748916]  [<ffffffffa0312494>] zvol_read+0x64/0xb0 [zfs]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.754532]  [<ffffffffa01d6626>] taskq_thread+0x236/0x4b0 [spl]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.760582]  [<ffffffff810608e0>] ? try_to_wake_up+0x200/0x200
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.766457]  [<ffffffffa01d63f0>] ? task_done+0x160/0x160 [spl]
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.772419]  [<ffffffff8108b64c>] kthread+0x8c/0xa0
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.777338]  [<ffffffff8166b474>] kernel_thread_helper+0x4/0x10
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.783304]  [<ffffffff8108b5c0>] ? flush_kthread_worker+0xa0/0xa0
Mar 28 20:06:44 blk3.b2.z1 kernel: [91356.789526]  [<ffffffff8166b470>] ? gs_change+0x13/0x13

ryao pushed a commit to ryao/zfs that referenced this issue Apr 9, 2014
4089 NULL pointer dereference in arc_read()

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4089
  illumos/illumos-gate@57815f6

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#2171
Issue openzfs#2165
Closes openzfs#2198

Conflicts:
	module/zfs/arc.c
@tuxoko
Copy link
Contributor

tuxoko commented Apr 15, 2014

Hi @josephglanville
I don't think you applied your patch properly.
Your kernel oops is identical with the first one, meaning the code wasn't changed.
Could you double check your backport or try master and test again.

Thanks

@josephglanville
Copy link

@tuxoko I think I did apply the patch correctly but it didn't fix the problem. However I then applied all the patches in @ryao's ZFS fork here: https://github.com/ryao/zfs/tree/zfs-0.6.2-r4 and that fixed my problems. I haven't had time to bisect and find out exactly which other patch fixed the problem but I suspect it was ryao@74db5bf

@behlendorf
Copy link
Contributor

Just for reference the patch you referenced has been merged to master.

@ryao
Copy link
Contributor

ryao commented Apr 22, 2014

Did the patch merged to master resolve this issue for everyone?

@behlendorf
Copy link
Contributor

Since we've now merged the patches referenced above in to master which fixed the issue I'm going to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants