tsd_hash_search() GPF #174

madpenguin · 2012-10-05T16:44:03Z

Hi, I've now seen this issue on 4 different servers, seems to happen at random with no pattern / frequency that I can spot - other than they all see the problem.

Using Ubuntu 11.04 and Ubuntu 12.04, PPA.
spl-dkms (2 0.6.0.71)

Tends to render the system "unhappy" although it keeps running .. although all ZFS volumes seem to die ...

general protection fault: 0000 [#1] SMP
CPU 4
Pid: 25043, comm: kvm Tainted: P           O 3.2.0-30-generic #48-Ubuntu System manufacturer System Product Name/M4A89GTD-PRO
RIP: 0010:[<ffffffffa006a0c0>]  [<ffffffffa006a0c0>] tsd_hash_search.isra.1+0xb0/0x1b0 [spl]
RSP: 0018:ffff8802b9fffde8  EFLAGS: 00010286
RAX: dead000000100100 RBX: 0000000000000002 RCX: 0000000000000037
RDX: 0000000000005053 RSI: 1d30000000000000 RDI: ffff880402f18800
RBP: ffff8802b9fffe38 R08: fffffffcf168c3a6 R09: 00000000000061d3
R10: ffff880401096dd0 R11: 0000000000000293 R12: ffff880402f18800
R13: 00000000000061d3 R14: ffff880401096dc4 R15: ffff880401096dd0
FS:  00007f1526ffd700(0000) GS:ffff88041fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f633cf3b000 CR3: 00000002e92b1000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kvm (pid: 25043, threadinfo ffff8802b9ffe000, task ffff8803f29ac500)
Stack:
ffff8802b9fffe58 ffffffff8109e3f6 0000000000000304 0000000000000000
0000000000000000 0000000000000002 ffff880401096dc0 0000000000000004
00000000000061d3 0000000000000003 ffff8802b9fffec8 ffffffffa006ac27
Call Trace:
[<ffffffff8109e3f6>] ? get_futex_key+0x166/0x2d0
[<ffffffffa006ac27>] tsd_set+0x107/0x700 [spl]
[<ffffffff8165a41e>] ? _raw_spin_lock+0xe/0x20
[<ffffffff8109e9d3>] ? futex_wake+0x113/0x130
[<ffffffffa018de40>] zfs_fsync+0x40/0x110 [zfs]
[<ffffffffa01a3b69>] zpl_fsync+0x59/0x80 [zfs]
[<ffffffff811a6776>] do_fsync+0x56/0x80
[<ffffffff811a6ac3>] sys_fdatasync+0x13/0x20
[<ffffffff81662a02>] system_call_fastpath+0x16/0x1b

The text was updated successfully, but these errors were encountered:

behlendorf · 2012-10-08T17:26:53Z

This is the first report I've seen of problems in the tsd code. Is there a particular workload your using to trigger this issue?

madpenguin · 2012-10-08T17:51:04Z

No, I'm afraid I've not been able to pin it down. It seems random, maybe once every three months per machine .. I mainly use AMD chips, but I've seen it on a Dell R710 (Intel/16 core) too ... [I have 1 machine thus far I've "not" see it on, but that's only been up for 6 weeks ..]

The only think I can think is that they're all running KVM instances to varying degrees .. this particular machine is serving data via two VPS's running gluster. Other machines are running KVM off QCOW2 instances sat directly on ZFS partitions.

All machines use SSD root disks .. machines are all between 1 & 3 years old .. , mostly single chip / 6-core, although the Dell was dual chip / 8-core.

Nothing particularly exotic, all have multiple NIC's of different makes, a mixture of different motherboards and disk controllers .. mixture of disks .. all are running Ubuntu 11.04 and 12.04, which means it spans different kernels ..

Don't know if any of that helps ??

fyi; previously most of these machines were running Software RAID10 for 1 year+ without incident ...

madpenguin · 2012-10-10T14:08:09Z

Ok, another one .. different box with similar specification ... frequency seems to be after 20 days uptime ...

Oct 10 15:05:47 data2 kernel: [1823996.336021] BUG: soft lockup - CPU#3 stuck for 22s! [kvm:14086]
Oct 10 15:05:47 data2 kernel: [1823996.337467] Modules linked in: nf_conntrack_ftp nf_conntrack_irc ip6table_filter ip6_tables ebtable_nat ebtable_broute ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables kvm_amd kvm nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge serio_raw edac_core edac_mce_amd k10temp i2c_piix4 asus_atk0110 mac_hid wmi vesafb zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl(O) zlib_deflate r8169 pata_atiixp megaraid_sas 8021q garp stp sky2
Oct 10 15:05:47 data2 kernel: [1823996.340000] CPU 3
Oct 10 15:05:47 data2 kernel: [1823996.340000] Modules linked in: nf_conntrack_ftp nf_conntrack_irc ip6table_filter ip6_tables ebtable_nat ebtable_broute ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables kvm_amd kvm nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge serio_raw edac_core edac_mce_amd k10temp i2c_piix4 asus_atk0110 mac_hid wmi vesafb zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl(O) zlib_deflate r8169 pata_atiixp megaraid_sas 8021q garp stp sky2
Oct 10 15:05:47 data2 kernel: [1823996.340000]
Oct 10 15:05:47 data2 kernel: [1823996.340000] Pid: 14086, comm: kvm Tainted: P D O 3.2.0-30-generic #48-Ubuntu System manufacturer System Product Name/M4A89GTD-PRO
Oct 10 15:05:47 data2 kernel: [1823996.340000] RIP: 0010:[] [] __ticket_spin_lock+0x20/0x30
Oct 10 15:05:47 data2 kernel: [1823996.340000] RSP: 0018:ffff88015d7a9dc8 EFLAGS: 00000297
Oct 10 15:05:47 data2 kernel: [1823996.340000] RAX: 0000000000000eaa RBX: ffff88015d7a9e28 RCX: 0000000000000037
Oct 10 15:05:47 data2 kernel: [1823996.340000] RDX: 0000000000000eab RSI: 7060000000000000 RDI: ffff880404ba8f50
Oct 10 15:05:47 data2 kernel: [1823996.340000] RBP: ffff88015d7a9dc8 R08: fffffffe47d06e0c R09: 0000000000003706
Oct 10 15:05:47 data2 kernel: [1823996.340000] R10: ffff8804019cf490 R11: 0000000000000293 R12: 0000000000000286
Oct 10 15:05:47 data2 kernel: [1823996.340000] R13: ffff88015d7a9d78 R14: ffffffff8108e900 R15: ffff88015d7a9d58
Oct 10 15:05:47 data2 kernel: [1823996.340000] FS: 00007fe04fdfe700(0000) GS:ffff88041fcc0000(0000) knlGS:0000000000000000
Oct 10 15:05:47 data2 kernel: [1823996.340000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct 10 15:05:47 data2 kernel: [1823996.340000] CR2: 00007fc1c6ab6000 CR3: 0000000113b63000 CR4: 00000000000006e0
Oct 10 15:05:47 data2 kernel: [1823996.340000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 10 15:05:47 data2 kernel: [1823996.340000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 10 15:05:47 data2 kernel: [1823996.340000] Process kvm (pid: 14086, threadinfo ffff88015d7a8000, task ffff8801a29ec500)
Oct 10 15:05:47 data2 kernel: [1823996.340000] Stack:
Oct 10 15:05:47 data2 kernel: [1823996.340000] ffff88015d7a9dd8 ffffffff8165a41e ffff88015d7a9e38 ffffffffa00700a7
Oct 10 15:05:47 data2 kernel: [1823996.340000] ffff88015d7a9e58 ffffffff8109e3f6 0000000000000304 0000000000000000
Oct 10 15:05:47 data2 kernel: [1823996.340000] 0000000000000000 0000000000000002 ffff8804019cf480 0000000000000004
Oct 10 15:05:47 data2 kernel: [1823996.340000] Call Trace:
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] _raw_spin_lock+0xe/0x20
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] tsd_hash_search.isra.1+0x97/0x1b0 [spl]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] ? get_futex_key+0x166/0x2d0
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] tsd_set+0x107/0x700 [spl]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] ? _raw_spin_lock+0xe/0x20
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] ? futex_wake+0x113/0x130
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] zfs_fsync+0x40/0x110 [zfs]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] zpl_fsync+0x59/0x80 [zfs]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] do_fsync+0x56/0x80
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] sys_fdatasync+0x13/0x20
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] system_call_fastpath+0x16/0x1b
Oct 10 15:05:47 data2 kernel: [1823996.340000] Code: 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 00 01 00 48 89 e5 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 74 13 66 0f 1f 84 00 00 00 00 00 90 0f b7 07 66 39 d0 75 f6 5d c3 0f 1f 40 00 8b 17 55 31 c0
Oct 10 15:05:47 data2 kernel: [1823996.340000] Call Trace:
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] _raw_spin_lock+0xe/0x20
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] tsd_hash_search.isra.1+0x97/0x1b0 [spl]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] ? get_futex_key+0x166/0x2d0
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] tsd_set+0x107/0x700 [spl]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] ? _raw_spin_lock+0xe/0x20
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] ? futex_wake+0x113/0x130
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] zfs_fsync+0x40/0x110 [zfs]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] zpl_fsync+0x59/0x80 [zfs]
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] do_fsync+0x56/0x80
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] sys_fdatasync+0x13/0x20
Oct 10 15:05:47 data2 kernel: [1823996.340000] [] system_call_fastpath+0x16/0x1b

telbizov · 2012-12-13T19:15:35Z

Hey guys,

As directed from the mailing list, here are the details of my case which seems to be related to this same issue:

Since recently we've been playing around with a new hardware configuration here and as a part of
the evaluation we decided to give ZFS on Linux a try and see how it works.

I have an easily reproducible kernel panic/oops which I can reproduce
within 10-15 minutes of stress-testing. Here are the gory details:

Kernel oops:

[  905.545397] general protection fault: 0000 [#1] SMP 
[  905.545562] Modules linked in:
[  905.545674] CPU 11 
[  905.545743] Pid: 10258, comm: sysbench Not tainted 3.7.0 #1 Supermicro X9DRE-TF+/X9DR7-TF+/X9DRE-TF+/X9DR7-TF+
[  905.545867] RIP: 0010:[<ffffffff81103fbd>]  [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[  905.545990] RSP: 0018:ffff881768b95dc8  EFLAGS: 00010286
[  905.546050] RAX: dead000000100100 RBX: dead0000001000e8 RCX: 0000000000000037
[  905.546113] RDX: 000000000000001a RSI: 0000000000000009 RDI: ffff8818124b5bf0
[  905.546186] RBP: ffff881768b95e08 R08: 4090000000000000 R09: 0000000000002812
[  905.546259] R10: 0000000040e6e758 R11: 0000000000000293 R12: 0000000000002812
[  905.546323] R13: 0000000000008001 R14: ffff8818124b5bf0 R15: 0000000000000003
[  905.546388] FS:  00007f31706e9700(0000) GS:ffff88307fd40000(0000) knlGS:0000000000000000
[  905.546457] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  905.546768] CR2: 0000000000648944 CR3: 00000017f00a4000 CR4: 00000000000407e0
[  905.547066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  905.547363] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  905.547669] Process sysbench (pid: 10258, threadinfo ffff881768b94000, task ffff8818010a3f20)
[  905.548200] Stack:
[  905.554842]  ffff881768b95e18 ffffffff81103f68 ffff881768b95e48 ffff883000c7a180
[  905.555568]  ffff8818137d5ac0 0000000000002812 0000000000000000 0000000000000004
[  905.556325]  ffff881768b95e88 ffffffff81104605 ffff881768b95e88 ffffffff81104ccf
[  905.557069] Call Trace:
[  905.557368]  [<ffffffff81103f68>] ? tsd_hash_search+0x7e/0x14d
[  905.557658]  [<ffffffff81104605>] tsd_set+0x139/0x5d5
[  905.557975]  [<ffffffff81104ccf>] ? tsd_exit+0x133/0x183
[  905.558265]  [<ffffffff811c42e0>] zfs_fsync+0x2e/0xf3
[  905.558572]  [<ffffffff811d5556>] zpl_fsync+0x4e/0x69
[  905.558887]  [<ffffffff81140b67>] vfs_fsync_range+0x1e/0x20
[  905.559162]  [<ffffffff81140b80>] vfs_fsync+0x17/0x19
[  905.559477]  [<ffffffff81140c17>] do_fsync+0x3e/0x5a
[  905.559769]  [<ffffffff81140ee1>] sys_fsync+0xb/0xf
[  905.560075]  [<ffffffff81752112>] system_call_fastpath+0x16/0x1b
[  905.560359] Code: 48 89 5c 24 18 48 89 5c 24 10 48 89 5c 24 08 48 c7 04 24 66 ed 9f 81 41 b9 77 00 00 00 eb 55 48 8b 00 48 85 c0 74 0c 48 8d 58 e8 <44> 39 68 e8 75 ee eb a9 41 80 06 01 31 db f6 05 16 14 a4 00 01 
[  905.564513] RIP  [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[  905.564870]  RSP <ffff881768b95dc8>
[  905.565158] ---[ end trace 8d253ed150ed5e7a ]---
[  906.856676] general protection fault: 0000 [#2] SMP 
[  906.857035] Modules linked in:
[  906.857339] CPU 5 
[  906.857393] Pid: 10260, comm: sysbench Tainted: G      D      3.7.0 #1 Supermicro X9DRE-TF+/X9DR7-TF+/X9DRE-TF+/X9DR7-TF+
[  906.858130] RIP: 0010:[<ffffffff81103fbd>]  [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[  906.858656] RSP: 0018:ffff881768ba1dc8  EFLAGS: 00010286
[  906.858920] RAX: dead000000100100 RBX: dead0000001000e8 RCX: 0000000000000037
[  906.859190] RDX: 0000000000000016 RSI: 0000000000000009 RDI: ffff8818124b4380
[  906.859456] RBP: ffff881768ba1e08 R08: 40a0000000000000 R09: 0000000000002814
[  906.859727] R10: 0000000040e5de1e R11: 0000000000000293 R12: 0000000000002814
[  906.860003] R13: 0000000000008001 R14: ffff8818124b4380 R15: 0000000000000003
[  906.860276] FS:  00007f31706d9700(0000) GS:ffff88181fd40000(0000) knlGS:0000000000000000
[  906.860769] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  906.861041] CR2: 00007f317069fcd8 CR3: 00000017f00a4000 CR4: 00000000000407e0
[  906.861316] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  906.861593] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  906.861867] Process sysbench (pid: 10260, threadinfo ffff881768ba0000, task ffff8818010a12f0)
[  906.862360] Stack:
[  906.862625]  0000000040e5de1e 0000000000000001 00007f31706fb000 000000000067b088
[  906.863303]  ffff8818137d5ac0 0000000000002814 0000000000000000 0000000000000004
[  906.863978]  ffff881768ba1e88 ffffffff81104605 ffff881768ba1e88 ffffffff81104ccf
[  906.864657] Call Trace:
[  906.864931]  [<ffffffff81104605>] tsd_set+0x139/0x5d5
[  906.865204]  [<ffffffff81104ccf>] ? tsd_exit+0x133/0x183
[  906.865483]  [<ffffffff811c42e0>] zfs_fsync+0x2e/0xf3
[  906.865758]  [<ffffffff811d5556>] zpl_fsync+0x4e/0x69
[  906.866031]  [<ffffffff81140b67>] vfs_fsync_range+0x1e/0x20
[  906.866309]  [<ffffffff81140b80>] vfs_fsync+0x17/0x19
[  906.866582]  [<ffffffff81140c17>] do_fsync+0x3e/0x5a
[  906.866855]  [<ffffffff81140ee1>] sys_fsync+0xb/0xf
[  906.867136]  [<ffffffff81752112>] system_call_fastpath+0x16/0x1b
[  906.867409] Code: 48 89 5c 24 18 48 89 5c 24 10 48 89 5c 24 08 48 c7 04 24 66 ed 9f 81 41 b9 77 00 00 00 eb 55 48 8b 00 48 85 c0 74 0c 48 8d 58 e8 <44> 39 68 e8 75 ee eb a9 41 80 06 01 31 db f6 05 16 14 a4 00 01 
[  906.871147] RIP  [<ffffffff81103fbd>] tsd_hash_search+0xd3/0x14d
[  906.871461]  RSP <ffff881768ba1dc8>
[  906.871738] ---[ end trace 8d253ed150ed5e7b ]---

Setup:

SuperMicro X9DRE-TF+ motherboard, 2 x E5-2620, 24 x 8G Kingston DDR3,
LSI 9207-8i HBA connected to 24 port backplane sas-multiplexer 2.5" populated
with 8 SSDs: OCZ Agility 3 256G and 8 SATA Seagate 500G SATA

I have 2 raidz2 pools: ssd1 comprised of all the 8 ssds and sata1 comprised
of all 8 sata disks. OS is on a separate drive.

Steps to reproduce:

I run sysbench fileio random read/write tests for about 10-15 minutes and I get the above oops.
Here's in short what I do:

cd /ssd1
sysbench --num-threads=12 --max-requests=50000 --test=fileio --file-num=4096 --file-block-size=64K --file-test-mode=rndrw --file-fsync-freq=1 prepare

sysbench --num-threads=12 --max-requests=50000 --test=fileio --file-num=4096 --file-block-size=64K --file-test-mode=rndrw --file-fsync-freq=1 run

You might have to execute the second run line a few times in a row in order to get the oops. Doesn't always happen the first time. Between runs I do zfs export ssd1 && zfs import ssd1 to clear cache.

We tested the following kernels with ZFS incorporated into the kernel (not as a module):
3.7.0 with zfs sources from git, master branch

With ZFS 0.6.0-rc12 sources we tried kernels:
3.7.0, 3.6.10, 2.6.32-279.14.1.el6

The crash seem to be the same regardless of the kernel.

Another issue I noticed while running this comes from the test itself. It complains with this:

FATAL: Incorrect file discovered in request
(last message repeated 3 times)
FATAL: Failed to fsync file! file: 0 errno = 22 (8?{zh )
FATAL: Incorrect file discovered in request

It could be test itself but I decided to post it here since it might be related. Notice the garbage in the brackets!
We also disabled NUMA in the BIOS and repeated the test - same result.

I would appreciate if someone could take a look at this. I am ready to provide further details if needed.

Regards,

Rumen Telbizov
Unix Systems Administrator

behlendorf · 2012-12-13T20:50:07Z

@telbizov I've been unable to reproduce the issue locally, even with your test case. However, the above patch should resolve the issue since it effectively retires the offending code which isn't needed. Can you please apply the patch and verify the GPF is resolved.

If the FATAL: Failed to fsync file! and FATAL: Incorrect file discovered in request errors also aren't normal for sysbench on other file systems (ext4, xfs). Then go ahead and open a new issue on the ZFS tracker for them. At least it will give us a place to initially investigate.

telbizov · 2012-12-14T22:37:32Z

@behlendorf

Hi Brian,

Thank you very much for providing this patch.
We recompiled kernel 3.7.0 against git branch issue-174 (correct me
if that wasn't the right thing to do) and then I ran the same tests I did
before over the night.
I am really happy to announce that this morning after about 12 hours
of running the test the machine is still happily trucking. Yay!

Due to the semi-intermittent nature of the problem we decided to run
a second longer test over the weekend before we declare success.
So currently we built ZFS into 2.6.32 since we intend to use that if
possible. This will be the setup that I'll leave the test running on
over the weekend.

I'll report back here on Monday. Hopefully the news will be just as good.

I wonder if other participants, involved in this issue might want to give
your branch a try and see how it goes for them?

Once again, thank you very much for your quick response and help.
Highly appreciated.

Regards,

Rumen Telbizov
Unix Systems Administrator

telbizov · 2012-12-17T19:37:03Z

@behlendorf

Brian,

Here's the update that I promised.
After two days of non-stop running sysbench 2.6.32 patched against issue-174 branch
I am happy to report that the system is still up and running. No kernel panics or any
side-effects that I can see. I think this change has indeed fixed the issue.

It would be nice to see other people's experience with this patch but as far as I can tell
your changes fixed this issue.

Brian, do you intend to merge this change into the master branch?

Thank you once again for your prompt response and the patch.

Rumen Telbizov,
Unix Systems Administrator

behlendorf · 2012-12-18T18:55:18Z

@telbizov That's good news. A little more digging has turned up that this code was originally added to reduce lock contention on the zl_lock from zil_commit() back in '07. Now at lot has changed since then so it's not entirely clear how much this is still helping.

But the test workload was running the filebench varmail tests and observing very slow fsync() times. Would it be possible for you to test this workload on your system with and without the patch so we know what the impact of this will be.

Original Sun issue: 6535160 lock contention on zl_lock from zil_commit

It's my understanding that the zfs_fsyncer_key TSD was added as a performance omtimization to reduce contention on the zl_lock from zil_commit(). This issue manifested itself as very long (100+ms) fsync() system call times for fsync() heavy workloads. However, under Linux I'm not seeing the same contention that was originally described. Therefore, I'm removing this code in order to ween ourselves off any dependence on TSD. If the original performance issue reappears on Linux we can revisit fixing it without resorting to TSD. This just leaves one small ZFS TSD consumer. If it can be cleanly removed from the code we'll be able to shed the SPL TSD implementation entirely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs/spl#174

behlendorf · 2012-12-19T20:58:24Z

Merged in to master as commit openzfs/zfs@31f2b5a , I wasn't able to observe a significant difference in fsync(2) performance with or without the patch. If other notice a significant performance regression we can revisit this issue then.

telbizov · 2012-12-19T21:54:05Z

@behlendorf

Sorry for my late response. I see that you've closed the issue but I did run the suggested test against both my ssd and sata zpools using kernels with and without your patch. Here are the details:

Kernel 2.6.32-279.14.1.el6 [compiled with your patch, i.e. against issue-174]

sata1, raidz2 of 8 sata disks

Filebench Version 1.4.9.1
110045: 0.000: Allocated 170MB of shared memory
110045: 0.001: Varmail Version 3.0 personality successfully loaded
110045: 0.001: Creating/pre-allocating files and filesets
110045: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
110045: 0.005: Removed any existing fileset bigfileset in 1 seconds
110045: 0.005: making tree for filset /sata1/bigfileset
110045: 0.005: Creating fileset bigfileset...
110045: 0.083: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
110045: 0.083: waiting for fileset pre-allocation to finish
110048: 0.083: Starting 1 filereader instances
110049: 0.084: Starting 16 filereaderthread threads
110045: 1.107: Running...
110045: 301.131: Run took 300 seconds...
110045: 301.132: Per-Operation Breakdown
closefile4           41114ops      137ops/s   0.0mb/s      0.0ms/op      585us/op-cpu [0ms - 0ms]
readfile4            41114ops      137ops/s   1.2mb/s      0.0ms/op      790us/op-cpu [0ms - 0ms]
openfile4            41114ops      137ops/s   0.0mb/s      0.0ms/op     1083us/op-cpu [0ms - 0ms]
closefile3           41114ops      137ops/s   0.0mb/s      0.0ms/op     1391us/op-cpu [0ms - 0ms]
fsyncfile3           41114ops      137ops/s   0.0mb/s     54.2ms/op    13220us/op-cpu [11ms - 168ms]
appendfilerand3      41115ops      137ops/s   1.1mb/s      0.0ms/op      800us/op-cpu [0ms - 1ms]
readfile3            41115ops      137ops/s   1.6mb/s      0.0ms/op      747us/op-cpu [0ms - 0ms]
openfile3            41115ops      137ops/s   0.0mb/s      0.0ms/op     1051us/op-cpu [0ms - 0ms]
closefile2           41115ops      137ops/s   0.0mb/s      0.0ms/op     1391us/op-cpu [0ms - 0ms]
fsyncfile2           41115ops      137ops/s   0.0mb/s     53.3ms/op    13466us/op-cpu [10ms - 162ms]
appendfilerand2      41130ops      137ops/s   1.1mb/s      0.0ms/op      488us/op-cpu [0ms - 7ms]
createfile2          41130ops      137ops/s   0.0mb/s      0.2ms/op     1209us/op-cpu [0ms - 0ms]
deletefile1          41129ops      137ops/s   0.0mb/s      0.2ms/op     1308us/op-cpu [0ms - 1ms]
110045: 301.132: IO Summary: 534534 ops, 1781.642 ops/s, (274/274 r/w),   4.9mb/s,    868us cpu/op,  27.0ms latency
110045: 301.132: Shutting down processes

ssd1, raidz2 of 4 ssd disks

Filebench Version 1.4.9.1
110070: 0.000: Allocated 170MB of shared memory
110070: 0.001: Varmail Version 3.0 personality successfully loaded
110070: 0.001: Creating/pre-allocating files and filesets
110070: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
110070: 0.005: Removed any existing fileset bigfileset in 1 seconds
110070: 0.005: making tree for filset /ssd1/bigfileset
110070: 0.005: Creating fileset bigfileset...
110070: 0.081: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
110070: 0.081: waiting for fileset pre-allocation to finish
110073: 0.081: Starting 1 filereader instances
110074: 0.082: Starting 16 filereaderthread threads
110070: 1.101: Running...
110070: 301.122: Run took 300 seconds...
110070: 301.123: Per-Operation Breakdown
closefile4           414791ops     1383ops/s   0.0mb/s      0.0ms/op     1152us/op-cpu [0ms - 0ms]
readfile4            414792ops     1383ops/s  21.3mb/s      0.0ms/op     1213us/op-cpu [0ms - 1ms]
openfile4            414793ops     1383ops/s   0.0mb/s      0.0ms/op     1147us/op-cpu [0ms - 2ms]
closefile3           414793ops     1383ops/s   0.0mb/s      0.0ms/op     1118us/op-cpu [0ms - 0ms]
fsyncfile3           414793ops     1383ops/s   0.0mb/s      0.8ms/op     3684us/op-cpu [0ms - 75ms]
appendfilerand3      414795ops     1383ops/s  10.8mb/s      0.0ms/op     1288us/op-cpu [0ms - 2ms]
readfile3            414797ops     1383ops/s  21.5mb/s      0.0ms/op     1218us/op-cpu [0ms - 0ms]
openfile3            414797ops     1383ops/s   0.0mb/s      0.0ms/op     1180us/op-cpu [0ms - 2ms]
closefile2           414798ops     1383ops/s   0.0mb/s      0.0ms/op     1128us/op-cpu [0ms - 0ms]
fsyncfile2           414799ops     1383ops/s   0.0mb/s      0.9ms/op     3621us/op-cpu [0ms - 76ms]
appendfilerand2      414802ops     1383ops/s  10.8mb/s      0.0ms/op     1256us/op-cpu [0ms - 2ms]
createfile2          414803ops     1383ops/s   0.0mb/s      0.1ms/op     1459us/op-cpu [0ms - 21ms]
deletefile1          414807ops     1383ops/s   0.0mb/s      0.1ms/op     1536us/op-cpu [0ms - 21ms]
110070: 301.123: IO Summary: 5392360 ops, 17973.315 ops/s, (2765/2765 r/w),  64.4mb/s,    553us cpu/op,   0.5ms latency
110070: 301.123: Shutting down processes

Kernel 3.7.0 [compiled without your patch - straight against github master before you merged issue-174]

sata1, raidz2 of 8 sata disks

Filebench Version 1.4.9.1
 4223: 0.000: Allocated 170MB of shared memory
 4223: 0.001: Varmail Version 3.0 personality successfully loaded
 4223: 0.001: Creating/pre-allocating files and filesets
 4223: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
 4223: 0.005: Removed any existing fileset bigfileset in 1 seconds
 4223: 0.005: making tree for filset /sata1/bigfileset
 4223: 0.006: Creating fileset bigfileset...
 4223: 0.105: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
 4223: 0.105: waiting for fileset pre-allocation to finish
 4226: 0.106: Starting 1 filereader instances
 4227: 0.107: Starting 16 filereaderthread threads
 4223: 1.128: Running...
 4223: 301.148: Run took 300 seconds...
 4223: 301.149: Per-Operation Breakdown
closefile4           366251ops     1221ops/s   0.0mb/s      0.0ms/op     1216us/op-cpu [0ms - 0ms]
readfile4            366251ops     1221ops/s  18.2mb/s      0.0ms/op     1286us/op-cpu [0ms - 0ms]
openfile4            366251ops     1221ops/s   0.0mb/s      0.0ms/op     1235us/op-cpu [0ms - 0ms]
closefile3           366252ops     1221ops/s   0.0mb/s      0.0ms/op     1213us/op-cpu [0ms - 0ms]
fsyncfile3           366255ops     1221ops/s   0.0mb/s      0.9ms/op     2299us/op-cpu [0ms - 153ms]
appendfilerand3      366256ops     1221ops/s   9.5mb/s      0.0ms/op     1367us/op-cpu [0ms - 4ms]
readfile3            366256ops     1221ops/s  18.9mb/s      0.0ms/op     1279us/op-cpu [0ms - 0ms]
openfile3            366256ops     1221ops/s   0.0mb/s      0.0ms/op     1231us/op-cpu [0ms - 2ms]
closefile2           366256ops     1221ops/s   0.0mb/s      0.0ms/op     1230us/op-cpu [0ms - 0ms]
fsyncfile2           366256ops     1221ops/s   0.0mb/s      1.1ms/op     2272us/op-cpu [0ms - 153ms]
appendfilerand2      366257ops     1221ops/s   9.5mb/s      0.0ms/op     1322us/op-cpu [0ms - 6ms]
createfile2          366260ops     1221ops/s   0.0mb/s      0.1ms/op     1524us/op-cpu [0ms - 17ms]
deletefile1          366265ops     1221ops/s   0.0mb/s      0.1ms/op     1620us/op-cpu [0ms - 15ms]
 4223: 301.149: IO Summary: 4761322 ops, 15870.028 ops/s, (2442/2442 r/w),  56.2mb/s,    581us cpu/op,   0.6ms latency
 4223: 301.149: Shutting down processes

ssd1, raidz2 of 4 ssd disks

Filebench Version 1.4.9.1
 4254: 0.000: Allocated 170MB of shared memory
 4254: 0.001: Varmail Version 3.0 personality successfully loaded
 4254: 0.001: Creating/pre-allocating files and filesets
 4254: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
 4254: 0.005: Removed any existing fileset bigfileset in 1 seconds
 4254: 0.005: making tree for filset /ssd1/bigfileset
 4254: 0.005: Creating fileset bigfileset...
 4254: 0.095: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
 4254: 0.095: waiting for fileset pre-allocation to finish
 4257: 0.096: Starting 1 filereader instances
 4258: 0.097: Starting 16 filereaderthread threads
 4254: 1.113: Running...
 4254: 301.133: Run took 300 seconds...
 4254: 301.133: Per-Operation Breakdown
closefile4           407559ops     1358ops/s   0.0mb/s      0.0ms/op     1230us/op-cpu [0ms - 0ms]
readfile4            407559ops     1358ops/s  19.8mb/s      0.0ms/op     1285us/op-cpu [0ms - 5ms]
openfile4            407562ops     1358ops/s   0.0mb/s      0.0ms/op     1231us/op-cpu [0ms - 0ms]
closefile3           407562ops     1358ops/s   0.0mb/s      0.0ms/op     1211us/op-cpu [0ms - 0ms]
fsyncfile3           407562ops     1358ops/s   0.0mb/s      0.4ms/op     2285us/op-cpu [0ms - 60ms]
appendfilerand3      407564ops     1358ops/s  10.6mb/s      0.0ms/op     1360us/op-cpu [0ms - 26ms]
readfile3            407564ops     1358ops/s  20.8mb/s      0.0ms/op     1285us/op-cpu [0ms - 0ms]
openfile3            407567ops     1358ops/s   0.0mb/s      0.0ms/op     1242us/op-cpu [0ms - 2ms]
closefile2           407568ops     1358ops/s   0.0mb/s      0.0ms/op     1218us/op-cpu [0ms - 2ms]
fsyncfile2           407570ops     1358ops/s   0.0mb/s      0.5ms/op     2252us/op-cpu [0ms - 59ms]
appendfilerand2      407573ops     1358ops/s  10.6mb/s      0.0ms/op     1338us/op-cpu [0ms - 32ms]
createfile2          407574ops     1358ops/s   0.0mb/s      0.1ms/op     1534us/op-cpu [0ms - 15ms]
deletefile1          407575ops     1358ops/s   0.0mb/s      0.1ms/op     1625us/op-cpu [0ms - 15ms]
 4254: 301.133: IO Summary: 5298359 ops, 17660.065 ops/s, (2717/2717 r/w),  61.8mb/s,    561us cpu/op,   0.3ms latency
 4254: 301.133: Shutting down processes

So, it does look like you are reverting some sort of a "performance fix" related to bugid 6535160 with your issue-174 branch. At least that's what it seems to me.

plain:

fsyncfile3           366255ops     1221ops/s   0.0mb/s      0.9ms/op     2299us/op-cpu [0ms - 153ms]
fsyncfile2           366256ops     1221ops/s   0.0mb/s      1.1ms/op     2272us/op-cpu [0ms - 153ms]

issue-174:

fsyncfile3           41114ops      137ops/s   0.0mb/s     54.2ms/op    13220us/op-cpu [11ms - 168ms]
fsyncfile2           41115ops      137ops/s   0.0mb/s     53.3ms/op    13466us/op-cpu [10ms - 162ms]

The SSD zpool performance doesn't really change much between tests but on the other hand without your patch (issue-174) the SATA zpool performs almost as fast as the ssd pool! This is strange and seems unreal. As if there is some heavy caching effect taking place before your patch.

Let me know what you think. It definitely looks like regression, although I am not sure if it's for better or worse. I'll run additional tests if needed. Maybe we need to reopen this ticket?

Thanks you Brian.

Rumen Telbizov
Unix Systems Administrator

behlendorf · 2012-12-20T01:07:51Z

@telbizov That's for running these benchmarks, your absolutely right there's a regression here so I'm reopening the issue. I've posted a new patch which readds the optimization as per-file instead of pre-thread. Can you please verify it fixes the performance regression.

telbizov · 2012-12-20T18:29:20Z

@behlendorf

Thank you for reopening the issue and continuing to looking into it. I will recompile a new kernel against the latest issue-174 branch tomorrow and I'll repeat the performance tests that I did yesterday and report back here. I hope this latest change will not bring things back to the previous broken state which caused kernel panics. I'll run stability tests over the weekend too and I'll report on Monday.

Another thing I keep wondering is: can you explain this insane performance boost that I see when testing SATA disks? How could it possible be anywhere close to SSD, unless there's heavy caching involved?

Cheers,

Rumen Telbizov
Unix Systems Administrator

behlendorf · 2012-12-20T18:53:12Z

@telbizov Due to the performance regression I'm going to revert the original fix for this issue from the master branch. Once we're happy the the revised fix doesn't suffer from a similar problem we'll merge it. Until then you'll want to run with a locally patched version to avoid the panic which your workload seems particularly good at hitting.

will recompile a new kernel against the latest issue-174 branch tomorrow and I'll repeat the performance tests that I did yesterday and report back here.

Thank you.

I hope this latest change will not bring things back to the previous broken state which caused kernel panics.

It won't. The panic you were hitting was in the thread specific data implementation itself. With the new patch applied that code is still no longer used so we can't really hit that issue. Things may in fact perform slightly better because the new approach sheds some additional locking overhead.

Another thing I keep wondering is: can you explain this insane performance boost that I see when testing SATA disks?

The issue here is really lock contention in zil_commit() on the zr_lock and not disk speed. This optimization is designed to minimize that contention by falling back to synchronous write I/Os when it detects a thread which is calling fsync(2) constantly. This effectively reduces the contention and improves the throughput. The updated patch simply changes the existing logic to detect contention on a per-file instead of per-thread basic.

The zfs_fsyncer_key TSD was added as a performance optimization to reduce contention on the zl_lock from zil_commit(). This issue manifested itself as very long (100+ms) system call times for fsync(2) heavy workloads. However, we'd like to eliminate the usage of TSD in ZoL while keeping this optimization. There is no native Linux kernel equivilant of TSD and the SPLs implementation is primitive. It would simply be best to remove all the TSD consumers. To achieve this the zfs_fsyncer_key TSD counter has been reworked to be per-file instead of per-thread. This is not expected to behave significantly differently than the original implementation for real workloads. This just leaves one small ZFS TSD consumer. If it can be cleanly removed from the code we'll be able to shed the SPL TSD implementation entirely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs/spl#174

telbizov · 2012-12-21T18:43:57Z

@behlendorf

Thanks for your continuing efforts Brian. Here are the results of my tests.
Kernel is 2.6.32-279.14.1.el6-v2 build against the latest version of branch issue-174, i.e. fsync(2) optimization per-file.

Performance test: probing for regression in fsync(2) times compare to plain test above

sata1 (raidz2 8 sata 2.5" disks)

Filebench Version 1.4.9.1
 2572: 0.000: Allocated 170MB of shared memory
 2572: 0.001: Varmail Version 3.0 personality successfully loaded
 2572: 0.001: Creating/pre-allocating files and filesets
 2572: 0.003: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
 2572: 0.005: Removed any existing fileset bigfileset in 1 seconds
 2572: 0.005: making tree for filset /sata1/bigfileset
 2572: 0.005: Creating fileset bigfileset...
 2572: 0.081: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
 2572: 0.081: waiting for fileset pre-allocation to finish
 2575: 0.081: Starting 1 filereader instances
 2576: 0.082: Starting 16 filereaderthread threads
 2572: 1.109: Running...
 2572: 301.132: Run took 300 seconds...
 2572: 301.133: Per-Operation Breakdown
closefile4           52183ops      174ops/s   0.0mb/s      0.0ms/op      576us/op-cpu [0ms - 0ms]
readfile4            52183ops      174ops/s   1.7mb/s      0.0ms/op      795us/op-cpu [0ms - 30ms]
openfile4            52183ops      174ops/s   0.0mb/s      0.0ms/op     1124us/op-cpu [0ms - 0ms]
closefile3           52183ops      174ops/s   0.0mb/s      0.0ms/op     1354us/op-cpu [0ms - 0ms]
fsyncfile3           52183ops      174ops/s   0.0mb/s     41.8ms/op    12371us/op-cpu [10ms - 168ms]
appendfilerand3      52198ops      174ops/s   1.4mb/s      0.0ms/op      772us/op-cpu [0ms - 2ms]
readfile3            52198ops      174ops/s   2.1mb/s      0.0ms/op      787us/op-cpu [0ms - 0ms]
openfile3            52198ops      174ops/s   0.0mb/s      0.0ms/op     1046us/op-cpu [0ms - 0ms]
closefile2           52198ops      174ops/s   0.0mb/s      0.0ms/op     1393us/op-cpu [0ms - 0ms]
fsyncfile2           52198ops      174ops/s   0.0mb/s     40.9ms/op    12332us/op-cpu [9ms - 167ms]
appendfilerand2      52199ops      174ops/s   1.4mb/s      0.0ms/op      498us/op-cpu [0ms - 0ms]
createfile2          52199ops      174ops/s   0.0mb/s      0.1ms/op     1085us/op-cpu [0ms - 0ms]
deletefile1          52199ops      174ops/s   0.0mb/s      0.2ms/op     1357us/op-cpu [0ms - 5ms]
 2572: 301.133: IO Summary: 678502 ops, 2261.503 ops/s, (348/348 r/w),   6.5mb/s,    849us cpu/op,  20.8ms latency
 2572: 301.133: Shutting down processes

So it seems to me that yes, there is some performance improvement but it's not on par with the initial "plain" kernel without your patches. Here are the extracts:

plain

fsyncfile3           366255ops     1221ops/s   0.0mb/s      0.9ms/op     2299us/op-cpu [0ms - 153ms]
fsyncfile2           366256ops     1221ops/s   0.0mb/s      1.1ms/op     2272us/op-cpu [0ms - 153ms]

fsync(2) per file patch

fsyncfile3           52183ops      174ops/s   0.0mb/s     41.8ms/op    12371us/op-cpu [10ms - 168ms]
fsyncfile2           52198ops      174ops/s   0.0mb/s     40.9ms/op    12332us/op-cpu [9ms - 167ms]

your original patch (TSD code removed)

fsyncfile3           41114ops      137ops/s   0.0mb/s     54.2ms/op    13220us/op-cpu [11ms - 168ms]
fsyncfile2           41115ops      137ops/s   0.0mb/s     53.3ms/op    13466us/op-cpu [10ms - 162ms]

ssd1 (raidz2 4 ssd OCZ disks)

Filebench Version 1.4.9.1
 2633: 0.000: Allocated 170MB of shared memory
 2633: 0.001: Varmail Version 3.0 personality successfully loaded
 2633: 0.001: Creating/pre-allocating files and filesets
 2633: 0.002: Fileset bigfileset: 1000 files, 0 leafdirs, avg dir width = 1000000, avg dir depth = 0.5, 14.959MB
 2633: 0.005: Removed any existing fileset bigfileset in 1 seconds
 2633: 0.005: making tree for filset /ssd1/bigfileset
 2633: 0.005: Creating fileset bigfileset...
 2633: 0.077: Preallocated 805 of 1000 of fileset bigfileset in 1 seconds
 2633: 0.077: waiting for fileset pre-allocation to finish
 2636: 0.077: Starting 1 filereader instances
 2637: 0.078: Starting 16 filereaderthread threads
 2633: 1.092: Running...
 2633: 301.112: Run took 300 seconds...
 2633: 301.113: Per-Operation Breakdown
closefile4           418051ops     1393ops/s   0.0mb/s      0.0ms/op     1127us/op-cpu [0ms - 0ms]
readfile4            418052ops     1393ops/s  21.6mb/s      0.0ms/op     1192us/op-cpu [0ms - 0ms]
openfile4            418052ops     1393ops/s   0.0mb/s      0.0ms/op     1127us/op-cpu [0ms - 2ms]
closefile3           418054ops     1393ops/s   0.0mb/s      0.0ms/op     1093us/op-cpu [0ms - 0ms]
fsyncfile3           418056ops     1393ops/s   0.0mb/s      0.9ms/op     3727us/op-cpu [0ms - 59ms]
appendfilerand3      418057ops     1393ops/s  10.9mb/s      0.0ms/op     1270us/op-cpu [0ms - 2ms]
readfile3            418059ops     1393ops/s  21.7mb/s      0.0ms/op     1207us/op-cpu [0ms - 0ms]
openfile3            418059ops     1393ops/s   0.0mb/s      0.0ms/op     1157us/op-cpu [0ms - 2ms]
closefile2           418059ops     1393ops/s   0.0mb/s      0.0ms/op     1107us/op-cpu [0ms - 0ms]
fsyncfile2           418060ops     1393ops/s   0.0mb/s      0.9ms/op     3672us/op-cpu [0ms - 61ms]
appendfilerand2      418062ops     1393ops/s  10.9mb/s      0.0ms/op     1225us/op-cpu [0ms - 1ms]
createfile2          418062ops     1393ops/s   0.0mb/s      0.1ms/op     1423us/op-cpu [0ms - 27ms]
deletefile1          418066ops     1393ops/s   0.0mb/s      0.1ms/op     1540us/op-cpu [0ms - 27ms]
 2633: 301.113: IO Summary: 5434749 ops, 18114.622 ops/s, (2787/2787 r/w),  65.0mb/s,    542us cpu/op,   0.5ms latency
 2633: 301.113: Shutting down processes

So SSD test - as good as always.

Stability testing

I'll run my original sysbench test which was causing kernel panics in the first place over the weekend and make sure that we haven't regressed back in this aspect either. I'll report my results on Monday, 24th.

Regards.

Rumen Telbizov
Unix Systems Administrator

behlendorf · 2012-12-21T22:40:44Z

That's surprising, but that's why we run the tests. OK, so there's clearly more work to do here. Perhaps I will look in to just fixing the TSD issue.

telbizov · 2012-12-24T21:51:53Z

@behlendorf

Brian,

Just wanted to report on my second part of the last test - stability. It seems like on this front we're still OK. I've been running my sysbench, stability test for the weekend and the machine has been just fine - no kernel panics.
So I guess at this point I'll wait for you to evaluate what further options there are and advise if there will be any need for future tests. If there are such - feel free to ask. If I'm around I'll be happy to perform them.

Merry Christmas

Rumen Telbizov
Unix Systems Administrator

azaroth · 2013-01-21T15:46:32Z

Hello

I can confirm that without that patch my 2 servers are crashing, but
only under heavy workload.
Most of a times is kind of deadlock, (no output, no dmesg) - silent death.

Once, I managed to get a stack trace (included below)

Problem is easy to reproduce. Just start dbench with 300-400 processes and wait 15-20 minutes.

Both servers are exactly the same:
OpenSuSe 12.2 with 3.4.11-2.16-default

System Hardware:
Processor: Intel Xeon X5660 @ 2.80GHz (12 Cores), Motherboard: Dell 0DPRKF, Chipset: Intel 5500 I/O + ICH10R, Memory: 4 x 8192 MB DDR3-1066MHz,

kernel: [ 1359.439585] general protection fault: 0000 [#1] SMP.
kernel: [ 1359.439595] CPU 5.
kernel: [ 1359.439598] Modules linked in: loop bridge stp bonding nls_iso8859_1 nls_cp437 vfat fat pciehp pci_hotplug zfs(PO) zcommon(PO) znvpai
kernel: [ 1359.439689].
kernel: [ 1359.439695] Pid: 8248, comm: dbench Tainted: P O 3.4.11-2.16-default #1 Dell Inc. PowerEdge R510/0DPRKF
kernel: [ 1359.439707] RIP: 0010:[] [] tsd_hash_search.isra.1+0xa8/0x1b0 [spl]
kernel: [ 1359.439733] RSP: 0018:ffff8807a3bf1e08 EFLAGS: 00010286
kernel: [ 1359.439738] RAX: dead000000100100 RBX: 0000000000008001 RCX: 0000000000000037
kernel: [ 1359.439745] RDX: 000000000000711b RSI: 01c0000000000000 RDI: ffff880805c5b380
kernel: [ 1359.439752] RBP: ffff880805c5b380 R08: ffffbf8f8f3c2038 R09: 696c632f73746e65
kernel: [ 1359.439758] R10: 7e2f383032746e65 R11: 0000000000000246 R12: 0000000000002038
kernel: [ 1359.439765] R13: ffff880803bbb570 R14: ffff880803bbb564 R15: ffff880803bbb570
kernel: [ 1359.439773] FS: 00007f80e4256700(0000) GS:ffff88082f2a0000(0000) knlGS:0000000000000000
kernel: [ 1359.439780] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: [ 1359.439788] CR2: 00007f6aeb747000 CR3: 000000078bf63000 CR4: 00000000000007e0
kernel: [ 1359.439795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: [ 1359.439803] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
kernel: [ 1359.439812] Process dbench (pid: 8248, threadinfo ffff8807a3bf0000, task ffff8807d84267c0)
kernel: [ 1359.439819] Stack:
kernel: [ 1359.439824] ffff880806c8b7a0 ffff8807450ea440 ffff8807b2d76340 ffffffffa04a0f9d
kernel: [ 1359.439837] 0000000000000002 ffff880803bbb560 0000000000002038 0000000000000004
kernel: [ 1359.439849] ffff880803bbb564 ffffffffa032c317 0000000000000004 ffffffff81163954
kernel: [ 1359.439863] Call Trace:
kernel: [ 1359.439926] [] tsd_set+0x1a7/0x720 [spl]
kernel: [ 1359.440033] [] zfs_fsync+0x3b/0x110 [zfs]
kernel: [ 1359.440368] [] zpl_fsync+0x76/0xe0 [zfs]
kernel: [ 1359.440685] [] do_fsync+0x4d/0x80
kernel: [ 1359.440699] [] sys_fdatasync+0xe/0x20
kernel: [ 1359.440714] [] system_call_fastpath+0x1a/0x1f
kernel: [ 1359.440730] [<00007f80e3d8d150>] 0x7f80e3d8d14f
kernel: [ 1359.440735] Code: ed 48 c1 e5 04 49 03 6d 00 48 89 ef e8 92 11 21 e1 48 8b 45 08 48 85 c0 75 11 eb 57 0f 1f 80 00 00 00 00 48 8b 00 4
kernel: [ 1359.440800] RIP [] tsd_hash_search.isra.1+0xa8/0x1b0 [spl]
kernel: [ 1359.440820] RSP
kernel: [ 1359.456921] ---[ end trace 5895a75b9961ffb5 ]---

behlendorf · 2013-01-31T22:02:36Z

@telbizov @azaroth Thanks for the dbench reproducer. I was able to recreate the issue fairly easily with the right options dbench -F -t 3600 500. The root cause here was a race between tsd_set() and tsd_exit() and some missing locking. The following patch fixes the issue and as expected I'm no longer able to hit the bug after several hours of testing. However, I'd appreciate it if you guys could verify the patch as well.

1d6e830 Fix tsd_get/set() race with tsd_exit/destroy()

behlendorf · 2013-02-01T19:15:12Z

This fix has been merged in to master.

6ef94aa Fix tsd_get/set() race with tsd_exit/destroy()

telbizov · 2013-02-02T07:13:33Z

@behlendorf Thanks for reproducing the issue yourself and even better - fixing it. Unfortunately at this time I no longer have that test setup that I was playing with before and cannot reproduce it right now. Soon enough though I might get a similar machine and will be able to try and test the now master branch which contains your fix.

Once again, thank you for your efforts.

Regards,

Rumen Telbizov
Unix Systems Administrator

It's my understanding that the zfs_fsyncer_key TSD was added as a performance omtimization to reduce contention on the zl_lock from zil_commit(). This issue manifested itself as very long (100+ms) fsync() system call times for fsync() heavy workloads. However, under Linux I'm not seeing the same contention that was originally described. Therefore, I'm removing this code in order to ween ourselves off any dependence on TSD. If the original performance issue reappears on Linux we can revisit fixing it without resorting to TSD. This just leaves one small ZFS TSD consumer. If it can be cleanly removed from the code we'll be able to shed the SPL TSD implementation entirely. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs/spl#174

behlendorf closed this as completed Dec 19, 2012

behlendorf reopened this Dec 20, 2012

behlendorf mentioned this issue Dec 21, 2012

Change fsync(2) optimization to per-file openzfs/zfs#1159

Closed

behlendorf closed this as completed in 6ef94aa Feb 1, 2013

behlendorf mentioned this issue Apr 11, 2013

ZFS hang on high load during simultaneous local + nfs client iozone openzfs/zfs#417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsd_hash_search() GPF #174

tsd_hash_search() GPF #174

madpenguin commented Oct 5, 2012

behlendorf commented Oct 8, 2012

madpenguin commented Oct 8, 2012

madpenguin commented Oct 10, 2012

telbizov commented Dec 13, 2012

behlendorf commented Dec 13, 2012

telbizov commented Dec 14, 2012

telbizov commented Dec 17, 2012

behlendorf commented Dec 18, 2012

behlendorf commented Dec 19, 2012

telbizov commented Dec 19, 2012

behlendorf commented Dec 20, 2012

telbizov commented Dec 20, 2012

behlendorf commented Dec 20, 2012

telbizov commented Dec 21, 2012

behlendorf commented Dec 21, 2012

telbizov commented Dec 24, 2012

azaroth commented Jan 21, 2013

behlendorf commented Jan 31, 2013

behlendorf commented Feb 1, 2013

telbizov commented Feb 2, 2013

tsd_hash_search() GPF #174

tsd_hash_search() GPF #174

Comments

madpenguin commented Oct 5, 2012

behlendorf commented Oct 8, 2012

madpenguin commented Oct 8, 2012

madpenguin commented Oct 10, 2012

telbizov commented Dec 13, 2012

behlendorf commented Dec 13, 2012

telbizov commented Dec 14, 2012

telbizov commented Dec 17, 2012

behlendorf commented Dec 18, 2012

behlendorf commented Dec 19, 2012

telbizov commented Dec 19, 2012

Kernel 2.6.32-279.14.1.el6 [compiled with your patch, i.e. against issue-174]

sata1, raidz2 of 8 sata disks

ssd1, raidz2 of 4 ssd disks

Kernel 3.7.0 [compiled without your patch - straight against github master before you merged issue-174]

sata1, raidz2 of 8 sata disks

ssd1, raidz2 of 4 ssd disks

plain:

issue-174:

behlendorf commented Dec 20, 2012

telbizov commented Dec 20, 2012

behlendorf commented Dec 20, 2012

telbizov commented Dec 21, 2012

Performance test: probing for regression in fsync(2) times compare to plain test above

sata1 (raidz2 8 sata 2.5" disks)

plain

fsync(2) per file patch

your original patch (TSD code removed)

ssd1 (raidz2 4 ssd OCZ disks)

Stability testing

behlendorf commented Dec 21, 2012

telbizov commented Dec 24, 2012

azaroth commented Jan 21, 2013

behlendorf commented Jan 31, 2013

behlendorf commented Feb 1, 2013

telbizov commented Feb 2, 2013