Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task txg_sync blocked for more than 120 seconds #2934

Closed
ioquatix opened this issue Nov 26, 2014 · 48 comments
Closed

task txg_sync blocked for more than 120 seconds #2934

ioquatix opened this issue Nov 26, 2014 · 48 comments

Comments

@ioquatix
Copy link

I've seen this issue a couple of times.

Not really sure what is causing it. Sometimes seems to occur during moderate continuous IO. Not sure if the first line is relevant, included it just in case.

[20542.629415] perf interrupt took too long (2504 > 2495), lowering kernel.perf_event_max_sample_rate to 50100
[421587.664157] INFO: task txg_sync:582 blocked for more than 120 seconds.
[421587.664295]       Tainted: P           O   3.17.3-1-ARCH #1
[421587.664372] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[421587.664478] txg_sync        D 0000000000000000     0   582      2 0x00000000
[421587.664486]  ffff88021ad13b80 0000000000000046 ffff88021cbf64a0 00000000000145c0
[421587.664493]  ffff88021ad13fd8 00000000000145c0 ffff88021cbf6eb0 ffff88021cbf64a0
[421587.664499]  ffff88002777e108 0000000000000000 0000000000000046 0000000000000001
[421587.664505] Call Trace:
[421587.664519]  [<ffffffff8109b9b2>] ? default_wake_function+0x12/0x20
[421587.664528]  [<ffffffff810af265>] ? __wake_up_common+0x55/0x90
[421587.664535]  [<ffffffff815392c9>] schedule+0x29/0x70
[421587.664541]  [<ffffffff815395a4>] io_schedule+0x94/0xf0
[421587.664565]  [<ffffffffa0201c5c>] cv_wait_common+0xac/0x1b0 [spl]
[421587.664572]  [<ffffffff810af950>] ? __wake_up_sync+0x20/0x20
[421587.664584]  [<ffffffffa0201db8>] __cv_wait_io+0x18/0x20 [spl]
[421587.664620]  [<ffffffffa032156b>] zio_wait+0x11b/0x1f0 [zfs]
[421587.664655]  [<ffffffffa02aad11>] dsl_pool_sync+0xc1/0x470 [zfs]
[421587.664692]  [<ffffffffa02c5970>] spa_sync+0x470/0xb90 [zfs]
[421587.664700]  [<ffffffff8101fa46>] ? ___preempt_schedule+0x56/0xb0
[421587.664737]  [<ffffffffa02d6d5a>] txg_sync_thread+0x3da/0x680 [zfs]
[421587.664775]  [<ffffffffa02d6980>] ? txg_delay+0x150/0x150 [zfs]
[421587.664787]  [<ffffffffa01f9cea>] thread_generic_wrapper+0x7a/0x90 [spl]
[421587.664798]  [<ffffffffa01f9c70>] ? __thread_exit+0xa0/0xa0 [spl]
[421587.664804]  [<ffffffff8108e2da>] kthread+0xea/0x100
[421587.664810]  [<ffffffff8108e1f0>] ? kthread_create_on_node+0x1b0/0x1b0
[421587.664817]  [<ffffffff8153d83c>] ret_from_fork+0x7c/0xb0
[421587.664822]  [<ffffffff8108e1f0>] ? kthread_create_on_node+0x1b0/0x1b0
[421827.748558] INFO: task txg_sync:582 blocked for more than 120 seconds.
[421827.748715]       Tainted: P           O   3.17.3-1-ARCH #1
[421827.748812] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[421827.748945] txg_sync        D 0000000000000000     0   582      2 0x00000000
[421827.748956]  ffff88021ad13b80 0000000000000046 ffff88021cbf64a0 00000000000145c0
[421827.748964]  ffff88021ad13fd8 00000000000145c0 ffff88008dbed080 ffff88021cbf64a0
[421827.748971]  ffff88021a528ca8 0000000000000000 0000000000000046 0000000000000001
[421827.748979] Call Trace:
[421827.748996]  [<ffffffff8109b9b2>] ? default_wake_function+0x12/0x20
[421827.749006]  [<ffffffff810af265>] ? __wake_up_common+0x55/0x90
[421827.749016]  [<ffffffff815392c9>] schedule+0x29/0x70
[421827.749023]  [<ffffffff815395a4>] io_schedule+0x94/0xf0
[421827.749052]  [<ffffffffa0201c5c>] cv_wait_common+0xac/0x1b0 [spl]
[421827.749061]  [<ffffffff810af950>] ? __wake_up_sync+0x20/0x20
[421827.749076]  [<ffffffffa0201db8>] __cv_wait_io+0x18/0x20 [spl]
[421827.749120]  [<ffffffffa032156b>] zio_wait+0x11b/0x1f0 [zfs]
[421827.749163]  [<ffffffffa02aad11>] dsl_pool_sync+0xc1/0x470 [zfs]
[421827.749210]  [<ffffffffa02c5970>] spa_sync+0x470/0xb90 [zfs]
[421827.749258]  [<ffffffffa02d6d5a>] txg_sync_thread+0x3da/0x680 [zfs]
[421827.749306]  [<ffffffffa02d6980>] ? txg_delay+0x150/0x150 [zfs]
[421827.749321]  [<ffffffffa01f9cea>] thread_generic_wrapper+0x7a/0x90 [spl]
[421827.749334]  [<ffffffffa01f9c70>] ? __thread_exit+0xa0/0xa0 [spl]
[421827.749342]  [<ffffffff8108e2da>] kthread+0xea/0x100
[421827.749349]  [<ffffffff8108e1f0>] ? kthread_create_on_node+0x1b0/0x1b0
[421827.749358]  [<ffffffff8153d83c>] ret_from_fork+0x7c/0xb0
[421827.749365]  [<ffffffff8108e1f0>] ? kthread_create_on_node+0x1b0/0x1b0
hinoki% dmesg | grep SPL
[    7.786300] SPL: Loaded module v0.6.3-44_g46c9367
[   15.514547] SPL: using hostid 0x000a4700
hinoki% dmesg | grep ZFS
[    7.873330] ZFS: Loaded module v0.6.3-130_g0ec0724, ZFS pool version 5000, ZFS filesystem version 5
@inevity
Copy link

inevity commented Nov 26, 2014

i also met this issue,but we found that if pool offline one disk which is 100% util,the issue disapperce.
we got a coredump also.

@behlendorf
Copy link
Contributor

This indicates that zfs was waiting a long time for IO to complete. As @inevity observed this can happen if you have a drive which hasn't failed yet but is behaving badly. You can look for this with iostat -mx.

@ioquatix
Copy link
Author

ioquatix commented Dec 2, 2014

@behlendorf Thanks for this insight, I will monitor the disks today and see if anything odd shows up.

@ioquatix
Copy link
Author

ioquatix commented Dec 3, 2014

Okay, so I've done zpool scrub tank to get some constant IO:

  pool: tank
 state: ONLINE
  scan: scrub in progress since Wed Dec  3 10:19:26 2014
    416G scanned out of 10.3T at 43.0M/s, 66h46m to go
    0 repaired, 3.96% done

This does seem a bit slower than I've seen before.. but this isn't exactly a fast server. I don't see any error in the dmesg log either.

Similarly, iostat doesn't show a high utilisation:

hinoki# iostat -mx
Linux 3.17.3-1-ARCH (hinoki)    03/12/14    _x86_64_    (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.36    0.00    1.21   15.72    0.00   80.72

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    7.27   13.42     0.25     0.23    47.23     0.25   12.02    5.90   15.34  10.35  21.42
sdb               0.00     0.00    8.24   13.10     0.25     0.22    45.84     0.24   11.45    5.21   15.38   9.85  21.03
sdd               0.00     0.00    7.53   13.10     0.25     0.22    46.88     0.25   11.95    5.87   15.44  10.29  21.22
sdc               0.00     0.00    6.90   13.42     0.24     0.23    47.83     0.27   13.09    6.79   16.33  11.14  22.63
sde               0.00     0.00    0.00    0.00     0.00     0.00    22.95     0.00    7.06    7.47    4.32   2.67   0.00
sdf               0.01     0.12    0.02    0.32     0.00     0.00    19.83     0.17  508.79    5.46  545.90  67.33   2.30

The drives are reasonable and should be able to read around 60-100Mbytes/s, so not sure if this is expected behaviour or not.

I will try to get a higher throughput by doing a backup of some large files and see if I can reproduce and get higher utilisation.

If there is anything else I can do that is within my reach and will help, please let me know.

@behlendorf
Copy link
Contributor

@ioquatix You're going to want to replace sdf. Notice that that average wait time for writes (w_await) for that device is a massive 545ms. From the iostat(1) man page.

              w_await
                     The average time (in  milliseconds)  for  write  requests
                     issued to the device to be served. This includes the time
                     spent by the requests in queue and the time spent servic-
                     ing them.

@ioquatix
Copy link
Author

ioquatix commented Dec 3, 2014

@behlendorf Thanks for that, I didn't know that.

sdf isn't involved with ZFS but I'm sure it's possible for a stall with sdf to lock up the IO subsystem? sdf is actually used as the boot drive. In any case, it's a bit of an odd hybrid drive: http://www.newegg.com/Product/Product.aspx?Item=N82E16822148837 - I've had a pretty bad experience with these drives in general. Perhaps the long wait time is normal? I don't know.

If you can confirm that you think this issue is related to sdf we can go ahead and close this issue. Alternatively, I could try replacing sdf and then report back in a few weeks if the issue pops up again or not?

@behlendorf
Copy link
Contributor

If the drives share a common controller or bus then it's absolutely possible that the sdf device is still responsible.

@ioquatix
Copy link
Author

ioquatix commented Dec 3, 2014

@behlendorf You have been most helpful. I do believe they are on the same bus. I will rip out that drive and perform a ritual sacrifice with it, to bless the new drive :) I'll report back.

@inevity
Copy link

inevity commented Dec 9, 2014

@behlendorf , since upgrade to 0.6.3,we suffered this issue on three machine.
only the high util lead to hung.
So,can we have a method that can deal with situation but not lead to hung,since we have created raid1 ?or we just owe the issue to the bad disk?

Belowe is sar stat ,10 min interval.till 04:20:01 PM,the machine hung and panic reboot.
12:04:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
10:50:01 AM dev8-16 56.02 411.58 7856.07 147.58 0.61 10.88 6.70 37.52
11:00:01 AM dev8-16 93.20 397.66 17028.24 186.98 1.71 18.34 9.54 88.93
11:10:01 AM dev8-16 82.41 397.60 16959.02 210.62 2.02 24.53 11.05 91.05
11:20:01 AM dev8-16 71.90 411.93 12283.86 176.58 1.35 18.79 10.95 78.70
11:30:01 AM dev8-16 79.39 397.87 14343.63 185.67 1.42 17.84 11.04 87.61
11:40:01 AM dev8-16 90.15 478.25 17497.40 199.41 1.59 17.64 10.38 93.59
11:50:01 AM dev8-16 105.33 679.31 21987.18 215.19 2.16 20.56 8.67 91.28
12:00:01 PM dev8-16 110.20 890.97 20680.89 195.74 1.89 17.10 9.50 104.71
12:10:01 PM dev8-16 135.05 778.02 30192.45 229.33 2.74 20.26 9.36 126.41
12:20:01 PM dev8-16 155.13 783.66 29624.86 196.02 4.36 28.09 8.04 124.72
12:30:01 PM dev8-16 183.86 756.37 36029.53 200.08 2.39 13.02 6.89 126.67
12:40:01 PM dev8-16 153.73 1093.49 26868.86 181.90 2.00 12.99 7.46 114.69
12:50:01 PM dev8-16 94.07 1270.50 14318.81 165.72 1.86 19.75 10.23 96.22
01:00:01 PM dev8-16 80.18 528.60 11708.12 152.62 2.10 26.03 12.26 98.32
01:10:01 PM dev8-16 117.04 905.46 17570.03 157.85 2.87 24.66 9.05 105.90
01:20:01 PM dev8-16 101.65 1061.43 15399.91 161.94 1.33 13.04 8.90 90.48
01:30:01 PM dev8-16 114.39 496.99 18776.93 168.49 1.33 11.62 9.24 105.72
01:40:01 PM dev8-16 103.08 1268.19 16400.93 171.41 1.82 17.63 10.10 104.09
01:50:01 PM dev8-16 107.46 690.11 18151.35 175.33 1.59 14.83 9.35 100.47
02:00:01 PM dev8-16 119.45 934.10 19723.72 172.94 3.54 29.70 9.92 118.53
02:10:01 PM dev8-16 122.19 731.50 24481.29 206.34 1.86 15.21 10.13 123.80
02:20:01 PM dev8-16 127.00 827.35 26575.92 215.77 2.16 17.02 10.06 127.80
02:30:01 PM dev8-16 133.85 766.00 31880.89 243.91 2.27 16.94 10.10 135.22
02:40:01 PM dev8-16 119.05 1205.18 23588.52 208.26 2.06 17.29 10.50 125.05
02:50:01 PM dev8-16 115.03 656.91 23416.80 209.28 2.68 23.29 10.29 118.39
03:00:01 PM dev8-16 88.90 780.27 15853.98 187.12 1.57 17.72 11.99 106.55
03:10:01 PM dev8-16 92.72 741.05 15856.22 179.01 4.43 47.80 12.21 113.22
03:20:01 PM dev8-16 136.65 1208.51 26437.38 202.31 1.77 12.98 8.59 117.43
03:30:01 PM dev8-16 133.16 625.53 26582.39 204.33 1.42 10.66 8.79 117.04
03:40:01 PM dev8-16 125.59 1056.71 26449.32 219.02 2.02 16.10 10.17 127.67
03:50:01 PM dev8-16 114.59 865.78 23915.81 216.26 4.67 40.78 10.58 121.29
04:00:01 PM dev8-16 151.64 984.38 33029.32 224.31 2.08 13.75 8.69 131.77
04:10:01 PM dev8-16 151.57 1451.64 31048.61 214.42 2.57 16.99 8.40 127.39
04:20:01 PM dev8-16 96.01 1272.85 25299.97 276.76 5.23 53.77 14.00 134.45

@behlendorf
Copy link
Contributor

@inevity It's also possible the deadlock you're seeing is related to #2523 for which a fix was recenrlt proposed and is being tested. But it's impossible to say for certainly with just the stacks in this bug.

@kernelOfTruth
Copy link
Contributor

@behlendorf I'm pretty sure that I saw these error messages with a drive that was sort of behaving badly and after a few weeks (or 1-2 months) later it wound up showing thousands of defect/reallocated sectors and then went unusable (continually producing errors, taking ages to do things)

so that coincides with what @inevity observed,

I remember a discussion where the problem with failing drives was the timeout and how Linux handles it compared to the other BSD operating systems - but couldn't find it after some times of searching

perhaps it'll wind up later, then I'll post a reference here

@ioquatix
Copy link
Author

So, I've replaced that drive with another drive completely different brand and known (reasonably) good. And, the await time is still a bit high.

I've tried three different ports on two controllers. For some reason I haven't seen the await time get less than about 150ms.

This is a completely different drive and I guess I'm just wondering if that's normal if the drive is the boot drive and has a bunch of requests stacked against it.

@ioquatix
Copy link
Author

By the way, I haven't tested ZFS yet, but I'll see if it makes any difference and report back.

@ioquatix
Copy link
Author

Hmm, okay. So, I've been running iostat -x 60 10 and at the same time did a system upgrade (packages, mostly on sdf) and also a backup via rsync (all to zfs backup partition). Here is a typical dump:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.23    0.00   14.41   64.80    0.00   17.56

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0.00     0.00   37.75   16.92   362.67   740.13    40.35     0.81   14.88   12.04   21.22   9.55  52.20
sdb               0.00     0.00   37.70   16.95   366.13   740.40    40.50     0.77   14.16   11.44   20.19   9.40  51.35
sdc               0.05     0.00   38.00   16.92   370.00   740.13    40.43     0.92   16.81   13.00   25.37  10.51  57.73
sda               0.05     0.00   37.68   16.95   368.27   740.40    40.59     0.79   14.45   11.71   20.56   9.48  51.82
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf               0.00     0.43    0.00    0.53     0.00    22.93    86.00     0.02   30.31    0.00   30.31  20.84   1.11

Mostly looks okay IMHO.

A few minutes into the rsync backup:

[ 1800.885411] INFO: task txg_sync:580 blocked for more than 120 seconds.
[ 1800.885535]       Tainted: P           O   3.18.6-1-ARCH #1
[ 1800.885610] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.885713] txg_sync        D 0000000000000000     0   580      2 0x00000000
[ 1800.885720]  ffff88021bd5bb18 0000000000000046 ffff88021d6d3c60 0000000000013640
[ 1800.885725]  ffff88021bd5bfd8 0000000000013640 ffff8800ccf41420 ffff88021d6d3c60
[ 1800.885729]  ffff88022096f2b0 ffff88022096f298 0000000000000000 0000000000000003
[ 1800.885734] Call Trace:
[ 1800.885746]  [<ffffffff8109e972>] ? default_wake_function+0x12/0x20
[ 1800.885752]  [<ffffffff810b2785>] ? __wake_up_common+0x55/0x90
[ 1800.885758]  [<ffffffff810b29e8>] ? __wake_up+0x48/0x60
[ 1800.885763]  [<ffffffff81551599>] schedule+0x29/0x70
[ 1800.885768]  [<ffffffff81551878>] io_schedule+0x98/0x100
[ 1800.885787]  [<ffffffffa01d2b45>] __cv_broadcast+0xe5/0x160 [spl]
[ 1800.885792]  [<ffffffff810b2ea0>] ? __wake_up_sync+0x20/0x20
[ 1800.885800]  [<ffffffffa01d2c18>] __cv_wait_io+0x18/0x180 [spl]
[ 1800.885819]  [<ffffffffa030a68b>] zio_wait+0x11b/0x200 [zfs]
[ 1800.885841]  [<ffffffffa0291c71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 1800.885848]  [<ffffffffa01cd0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 1800.885872]  [<ffffffffa02acb60>] spa_sync+0x480/0xbd0 [zfs]
[ 1800.885878]  [<ffffffff810b2eb6>] ? autoremove_wake_function+0x16/0x40
[ 1800.885902]  [<ffffffffa02be30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 1800.885925]  [<ffffffffa02bdf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 1800.885932]  [<ffffffffa01ce07a>] __thread_exit+0x9a/0xb0 [spl]
[ 1800.885939]  [<ffffffffa01ce000>] ? __thread_exit+0x20/0xb0 [spl]
[ 1800.885944]  [<ffffffff81090e6a>] kthread+0xea/0x100
[ 1800.885949]  [<ffffffff81090d80>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1800.885955]  [<ffffffff8155563c>] ret_from_fork+0x7c/0xb0
[ 1800.885959]  [<ffffffff81090d80>] ? kthread_create_on_node+0x1c0/0x1c0

@ioquatix
Copy link
Author

iotop:

Total DISK READ :     328.59 K/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:     328.59 K/s | Actual DISK WRITE:       4.65 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                           
  507 be/7 root      328.59 K/s    0.00 B/s  0.00 % 88.83 % [z_wr_iss/0]

Sometimes speed drops down for a brief period of time, then goes back up. I'm not sure if this coincides with the dmsg output but in the time since the last message I've got the following:

[ 2040.970184] INFO: task txg_sync:580 blocked for more than 120 seconds.
[ 2040.970338]       Tainted: P           O   3.18.6-1-ARCH #1
[ 2040.970458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2040.970626] txg_sync        D 0000000000000000     0   580      2 0x00000000
[ 2040.970636]  ffff88021bd5bb18 0000000000000046 ffff88021d6d3c60 0000000000013640
[ 2040.970644]  ffff88021bd5bfd8 0000000000013640 ffff8800ccfb1420 ffff88021d6d3c60
[ 2040.970651]  ffff88006c4d8838 ffffffff00000001 ffff88006c4d87f0 00000000b2e50d18
[ 2040.970658] Call Trace:
[ 2040.970676]  [<ffffffff810b29e8>] ? __wake_up+0x48/0x60
[ 2040.970686]  [<ffffffff81551599>] schedule+0x29/0x70
[ 2040.970694]  [<ffffffff81551878>] io_schedule+0x98/0x100
[ 2040.970722]  [<ffffffffa01d2b45>] __cv_broadcast+0xe5/0x160 [spl]
[ 2040.970731]  [<ffffffff810b2ea0>] ? __wake_up_sync+0x20/0x20
[ 2040.970743]  [<ffffffffa01d2c18>] __cv_wait_io+0x18/0x180 [spl]
[ 2040.970771]  [<ffffffffa030a68b>] zio_wait+0x11b/0x200 [zfs]
[ 2040.970807]  [<ffffffffa0291c71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 2040.970819]  [<ffffffffa01cd0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 2040.970857]  [<ffffffffa02acb60>] spa_sync+0x480/0xbd0 [zfs]
[ 2040.970866]  [<ffffffff810b2eb6>] ? autoremove_wake_function+0x16/0x40
[ 2040.970904]  [<ffffffffa02be30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 2040.970942]  [<ffffffffa02bdf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 2040.970953]  [<ffffffffa01ce07a>] __thread_exit+0x9a/0xb0 [spl]
[ 2040.970964]  [<ffffffffa01ce000>] ? __thread_exit+0x20/0xb0 [spl]
[ 2040.970973]  [<ffffffff81090e6a>] kthread+0xea/0x100
[ 2040.970980]  [<ffffffff81090d80>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2040.970989]  [<ffffffff8155563c>] ret_from_fork+0x7c/0xb0
[ 2040.970995]  [<ffffffff81090d80>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2161.009962] INFO: task txg_sync:580 blocked for more than 120 seconds.
[ 2161.010116]       Tainted: P           O   3.18.6-1-ARCH #1
[ 2161.010236] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2161.010403] txg_sync        D 0000000000000000     0   580      2 0x00000000
[ 2161.010413]  ffff88021bd5bb18 0000000000000046 ffff88021d6d3c60 0000000000013640
[ 2161.010421]  ffff88021bd5bfd8 0000000000013640 ffff8800ccfb1420 ffff88021d6d3c60
[ 2161.010428]  ffff88006c4d8838 ffffffff00000001 ffff88006c4d87f0 00000000b2e50d18
[ 2161.010435] Call Trace:
[ 2161.010453]  [<ffffffff810b29e8>] ? __wake_up+0x48/0x60
[ 2161.010463]  [<ffffffff81551599>] schedule+0x29/0x70
[ 2161.010471]  [<ffffffff81551878>] io_schedule+0x98/0x100
[ 2161.010499]  [<ffffffffa01d2b45>] __cv_broadcast+0xe5/0x160 [spl]
[ 2161.010508]  [<ffffffff810b2ea0>] ? __wake_up_sync+0x20/0x20
[ 2161.010520]  [<ffffffffa01d2c18>] __cv_wait_io+0x18/0x180 [spl]
[ 2161.010549]  [<ffffffffa030a68b>] zio_wait+0x11b/0x200 [zfs]
[ 2161.010585]  [<ffffffffa0291c71>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 2161.010597]  [<ffffffffa01cd0af>] ? spl_kmem_cache_free+0x10f/0x4a0 [spl]
[ 2161.010636]  [<ffffffffa02acb60>] spa_sync+0x480/0xbd0 [zfs]
[ 2161.010644]  [<ffffffff810b2eb6>] ? autoremove_wake_function+0x16/0x40
[ 2161.010684]  [<ffffffffa02be30c>] txg_delay+0x4ec/0xa60 [zfs]
[ 2161.010721]  [<ffffffffa02bdf90>] ? txg_delay+0x170/0xa60 [zfs]
[ 2161.010733]  [<ffffffffa01ce07a>] __thread_exit+0x9a/0xb0 [spl]
[ 2161.010744]  [<ffffffffa01ce000>] ? __thread_exit+0x20/0xb0 [spl]
[ 2161.010752]  [<ffffffff81090e6a>] kthread+0xea/0x100
[ 2161.010759]  [<ffffffff81090d80>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2161.010768]  [<ffffffff8155563c>] ret_from_fork+0x7c/0xb0
[ 2161.010775]  [<ffffffff81090d80>] ? kthread_create_on_node+0x1c0/0x1c0

@ioquatix
Copy link
Author

# uname -a
Linux hinoki 3.19.3-3-ARCH #1 SMP PREEMPT Wed Apr 8 14:10:00 CEST 2015 x86_64 GNU/Linux

# pacman -Q | grep -E "zfs|spl"
spl-git 0.6.4_r0_gcd69f02_3.19.3_3-1
spl-utils-git 0.6.4_r0_gcd69f02_3.19.3_3-1
zfs-git 0.6.4_r0_gd07a163_3.19.3_3-1
zfs-utils-git 0.6.4_r0_gd07a163_3.19.3_3-1
[ 2521.049969] INFO: task txg_sync:762 blocked for more than 120 seconds.
[ 2521.050108]       Tainted: P           O   3.19.3-3-ARCH #1
[ 2521.050184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2521.050290] txg_sync        D ffff8800cd7b3b08     0   762      2 0x00000000
[ 2521.050299]  ffff8800cd7b3b08 ffff8802214589f0 0000000000013f00 ffff8800cd7b3fd8
[ 2521.050306]  0000000000013f00 ffff88021eae89f0 ffff8802214589f0 ffff88022510a4b0
[ 2521.050312]  ffff88022510a498 0000000000000000 0000000000000003 ffff8800cd7b3a58
[ 2521.050318] Call Trace:
[ 2521.050335]  [<ffffffff8109f652>] ? default_wake_function+0x12/0x20
[ 2521.050344]  [<ffffffff810b3c95>] ? __wake_up_common+0x55/0x90
[ 2521.050351]  [<ffffffff810b3ef8>] ? __wake_up+0x48/0x60
[ 2521.050359]  [<ffffffff8155eb59>] schedule+0x29/0x70
[ 2521.050365]  [<ffffffff8155ee1e>] io_schedule+0x8e/0xd0
[ 2521.050391]  [<ffffffffa01e508e>] cv_wait_common+0xae/0x140 [spl]
[ 2521.050397]  [<ffffffff810b4460>] ? wait_woken+0x90/0x90
[ 2521.050408]  [<ffffffffa01e5178>] __cv_wait_io+0x18/0x20 [spl]
[ 2521.050445]  [<ffffffffa0318943>] zio_wait+0x123/0x210 [zfs]
[ 2521.050481]  [<ffffffffa029fbe1>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 2521.050520]  [<ffffffffa02baf80>] spa_sync+0x480/0xbf0 [zfs]
[ 2521.050526]  [<ffffffff810b4476>] ? autoremove_wake_function+0x16/0x40
[ 2521.050564]  [<ffffffffa02cce06>] txg_sync_thread+0x386/0x630 [zfs]
[ 2521.050602]  [<ffffffffa02cca80>] ? txg_quiesce_thread+0x3a0/0x3a0 [zfs]
[ 2521.050614]  [<ffffffffa01e0561>] thread_generic_wrapper+0x71/0x80 [spl]
[ 2521.050624]  [<ffffffffa01e04f0>] ? __thread_exit+0x20/0x20 [spl]
[ 2521.050631]  [<ffffffff81091828>] kthread+0xd8/0xf0
[ 2521.050637]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2521.050644]  [<ffffffff81562918>] ret_from_fork+0x58/0x90
[ 2521.050649]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2761.132245] INFO: task txg_sync:762 blocked for more than 120 seconds.
[ 2761.132355]       Tainted: P           O   3.19.3-3-ARCH #1
[ 2761.132406] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2761.132477] txg_sync        D ffff8800cd7b3b08     0   762      2 0x00000000
[ 2761.132483]  ffff8800cd7b3b08 ffff8802214589f0 0000000000013f00 ffff8800cd7b3fd8
[ 2761.132488]  0000000000013f00 ffffffff81818540 ffff8802214589f0 ffff88022510a4b0
[ 2761.132492]  ffff88022510a498 0000000000000000 0000000000000003 ffff8800cd7b3a58
[ 2761.132496] Call Trace:
[ 2761.132509]  [<ffffffff8109f652>] ? default_wake_function+0x12/0x20
[ 2761.132515]  [<ffffffff810b3c95>] ? __wake_up_common+0x55/0x90
[ 2761.132520]  [<ffffffff810b3ef8>] ? __wake_up+0x48/0x60
[ 2761.132525]  [<ffffffff8155eb59>] schedule+0x29/0x70
[ 2761.132529]  [<ffffffff8155ee1e>] io_schedule+0x8e/0xd0
[ 2761.132549]  [<ffffffffa01e508e>] cv_wait_common+0xae/0x140 [spl]
[ 2761.132553]  [<ffffffff810b4460>] ? wait_woken+0x90/0x90
[ 2761.132561]  [<ffffffffa01e5178>] __cv_wait_io+0x18/0x20 [spl]
[ 2761.132587]  [<ffffffffa0318943>] zio_wait+0x123/0x210 [zfs]
[ 2761.132611]  [<ffffffffa029fbe1>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 2761.132637]  [<ffffffffa02baf80>] spa_sync+0x480/0xbf0 [zfs]
[ 2761.132641]  [<ffffffff810b4476>] ? autoremove_wake_function+0x16/0x40
[ 2761.132666]  [<ffffffffa02cce06>] txg_sync_thread+0x386/0x630 [zfs]
[ 2761.132692]  [<ffffffffa02cca80>] ? txg_quiesce_thread+0x3a0/0x3a0 [zfs]
[ 2761.132700]  [<ffffffffa01e0561>] thread_generic_wrapper+0x71/0x80 [spl]
[ 2761.132707]  [<ffffffffa01e04f0>] ? __thread_exit+0x20/0x20 [spl]
[ 2761.132712]  [<ffffffff81091828>] kthread+0xd8/0xf0
[ 2761.132716]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2761.132721]  [<ffffffff81562918>] ret_from_fork+0x58/0x90
[ 2761.132725]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3361.340538] INFO: task txg_sync:762 blocked for more than 120 seconds.
[ 3361.340759]       Tainted: P           O   3.19.3-3-ARCH #1
[ 3361.340856] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3361.340988] txg_sync        D ffff8800cd7b3b08     0   762      2 0x00000000
[ 3361.340999]  ffff8800cd7b3b08 ffff8802214589f0 0000000000013f00 ffff8800cd7b3fd8
[ 3361.341008]  0000000000013f00 ffff880225c7cf80 ffff8802214589f0 ffff88022510a4b0
[ 3361.341015]  ffff88022510a498 0000000000000000 0000000000000003 ffff8800cd7b3a58
[ 3361.341022] Call Trace:
[ 3361.341042]  [<ffffffff8109f652>] ? default_wake_function+0x12/0x20
[ 3361.341053]  [<ffffffff810b3c95>] ? __wake_up_common+0x55/0x90
[ 3361.341062]  [<ffffffff810b3ef8>] ? __wake_up+0x48/0x60
[ 3361.341072]  [<ffffffff8155eb59>] schedule+0x29/0x70
[ 3361.341079]  [<ffffffff8155ee1e>] io_schedule+0x8e/0xd0
[ 3361.341109]  [<ffffffffa01e508e>] cv_wait_common+0xae/0x140 [spl]
[ 3361.341116]  [<ffffffff810b4460>] ? wait_woken+0x90/0x90
[ 3361.341131]  [<ffffffffa01e5178>] __cv_wait_io+0x18/0x20 [spl]
[ 3361.341177]  [<ffffffffa0318943>] zio_wait+0x123/0x210 [zfs]
[ 3361.341222]  [<ffffffffa029fbe1>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 3361.341270]  [<ffffffffa02baf80>] spa_sync+0x480/0xbf0 [zfs]
[ 3361.341278]  [<ffffffff810b4476>] ? autoremove_wake_function+0x16/0x40
[ 3361.341353]  [<ffffffffa02cce06>] txg_sync_thread+0x386/0x630 [zfs]
[ 3361.341401]  [<ffffffffa02cca80>] ? txg_quiesce_thread+0x3a0/0x3a0 [zfs]
[ 3361.341416]  [<ffffffffa01e0561>] thread_generic_wrapper+0x71/0x80 [spl]
[ 3361.341428]  [<ffffffffa01e04f0>] ? __thread_exit+0x20/0x20 [spl]
[ 3361.341437]  [<ffffffff81091828>] kthread+0xd8/0xf0
[ 3361.341445]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3361.341453]  [<ffffffff81562918>] ret_from_fork+0x58/0x90
[ 3361.341460]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3481.382563] INFO: task txg_sync:762 blocked for more than 120 seconds.
[ 3481.382718]       Tainted: P           O   3.19.3-3-ARCH #1
[ 3481.382813] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3481.382945] txg_sync        D ffff8800cd7b3b08     0   762      2 0x00000000
[ 3481.382956]  ffff8800cd7b3b08 ffff8802214589f0 0000000000013f00 ffff8800cd7b3fd8
[ 3481.382964]  0000000000013f00 ffff880225c7cf80 ffff8802214589f0 ffff88022510a4b0
[ 3481.382971]  ffff88022510a498 0000000000000000 0000000000000003 ffff8800cd7b3a58
[ 3481.382979] Call Trace:
[ 3481.382998]  [<ffffffff8109f652>] ? default_wake_function+0x12/0x20
[ 3481.383009]  [<ffffffff810b3c95>] ? __wake_up_common+0x55/0x90
[ 3481.383018]  [<ffffffff810b3ef8>] ? __wake_up+0x48/0x60
[ 3481.383028]  [<ffffffff8155eb59>] schedule+0x29/0x70
[ 3481.383034]  [<ffffffff8155ee1e>] io_schedule+0x8e/0xd0
[ 3481.383064]  [<ffffffffa01e508e>] cv_wait_common+0xae/0x140 [spl]
[ 3481.383071]  [<ffffffff810b4460>] ? wait_woken+0x90/0x90
[ 3481.383086]  [<ffffffffa01e5178>] __cv_wait_io+0x18/0x20 [spl]
[ 3481.383130]  [<ffffffffa0318943>] zio_wait+0x123/0x210 [zfs]
[ 3481.383175]  [<ffffffffa029fbe1>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 3481.383223]  [<ffffffffa02baf80>] spa_sync+0x480/0xbf0 [zfs]
[ 3481.383230]  [<ffffffff810b4476>] ? autoremove_wake_function+0x16/0x40
[ 3481.383279]  [<ffffffffa02cce06>] txg_sync_thread+0x386/0x630 [zfs]
[ 3481.383327]  [<ffffffffa02cca80>] ? txg_quiesce_thread+0x3a0/0x3a0 [zfs]
[ 3481.383341]  [<ffffffffa01e0561>] thread_generic_wrapper+0x71/0x80 [spl]
[ 3481.383354]  [<ffffffffa01e04f0>] ? __thread_exit+0x20/0x20 [spl]
[ 3481.383362]  [<ffffffff81091828>] kthread+0xd8/0xf0
[ 3481.383370]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3481.383378]  [<ffffffff81562918>] ret_from_fork+0x58/0x90
[ 3481.383386]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3601.424836] INFO: task txg_sync:762 blocked for more than 120 seconds.
[ 3601.424997]       Tainted: P           O   3.19.3-3-ARCH #1
[ 3601.425093] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3601.425225] txg_sync        D ffff8800cd7b3b08     0   762      2 0x00000000
[ 3601.425236]  ffff8800cd7b3b08 ffff8802214589f0 0000000000013f00 ffff8800cd7b3fd8
[ 3601.425245]  0000000000013f00 ffff880225c7cf80 ffff8802214589f0 ffff88022510a4b0
[ 3601.425252]  ffff88022510a498 0000000000000000 0000000000000003 ffff8800cd7b3a58
[ 3601.425259] Call Trace:
[ 3601.425279]  [<ffffffff8109f652>] ? default_wake_function+0x12/0x20
[ 3601.425290]  [<ffffffff810b3c95>] ? __wake_up_common+0x55/0x90
[ 3601.425299]  [<ffffffff810b3ef8>] ? __wake_up+0x48/0x60
[ 3601.425309]  [<ffffffff8155eb59>] schedule+0x29/0x70
[ 3601.425316]  [<ffffffff8155ee1e>] io_schedule+0x8e/0xd0
[ 3601.425346]  [<ffffffffa01e508e>] cv_wait_common+0xae/0x140 [spl]
[ 3601.425352]  [<ffffffff810b4460>] ? wait_woken+0x90/0x90
[ 3601.425367]  [<ffffffffa01e5178>] __cv_wait_io+0x18/0x20 [spl]
[ 3601.425411]  [<ffffffffa0318943>] zio_wait+0x123/0x210 [zfs]
[ 3601.425456]  [<ffffffffa029fbe1>] dsl_pool_sync+0xc1/0x480 [zfs]
[ 3601.425505]  [<ffffffffa02baf80>] spa_sync+0x480/0xbf0 [zfs]
[ 3601.425512]  [<ffffffff810b4476>] ? autoremove_wake_function+0x16/0x40
[ 3601.425560]  [<ffffffffa02cce06>] txg_sync_thread+0x386/0x630 [zfs]
[ 3601.425608]  [<ffffffffa02cca80>] ? txg_quiesce_thread+0x3a0/0x3a0 [zfs]
[ 3601.425622]  [<ffffffffa01e0561>] thread_generic_wrapper+0x71/0x80 [spl]
[ 3601.425635]  [<ffffffffa01e04f0>] ? __thread_exit+0x20/0x20 [spl]
[ 3601.425643]  [<ffffffff81091828>] kthread+0xd8/0xf0
[ 3601.425651]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3601.425660]  [<ffffffff81562918>] ret_from_fork+0x58/0x90
[ 3601.425667]  [<ffffffff81091750>] ? kthread_create_on_node+0x1c0/0x1c0

@voidzero
Copy link

I just had a similar problem. My kernel has grsecurity and rsync seemed to stall while I could hear a lot of disk IO. But then I tried sysctl kernel.pax.softmode=1 and bam, rsync transferred data at a high speed again. I know grsecurity is not supported by the zfs team, but for those who are also using it, it might be a useful tip anyway.

@discostur
Copy link

Hello,

i'm having the same issue - im running ZFS 0.6.5 on CentOS 7.1. After rebooting the system i'm trying to import the pool. Pool import works:

  pool: dpool01
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 5h16m with 0 errors on Fri Aug 14 17:45:52 2015
config:
    NAME                        STATE     READ WRITE CKSUM
    dpool01                     ONLINE       0     0     0
      mirror-0                  ONLINE       0     0     0
        scsi-350014ee30012ec8c  ONLINE       0     0     0
        scsi-350014ee355683358  ONLINE       0     0     0
errors: No known data errors

Mounting the filesystem runs for 2 hours now ... still processing. Mounting the pool read only works without problems.

[Mi Sep 16 13:58:28 2015] INFO: task txg_sync:4015 blocked for more than 120 seconds.
[Mi Sep 16 13:58:28 2015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mi Sep 16 13:58:28 2015] txg_sync        D ffff880ec5c13680     0  4015      2 0x00000080
[Mi Sep 16 13:58:28 2015]  ffff8800d9897b90 0000000000000046 ffff880e5349cfa0 ffff8800d9897fd8
[Mi Sep 16 13:58:28 2015]  ffff8800d9897fd8 ffff8800d9897fd8 ffff880e5349cfa0 ffff880ec5c13f48
[Mi Sep 16 13:58:28 2015]  ffff880c42bccd88 ffff880c42bccdc8 ffff880c42bccdb0 0000000000000001
[Mi Sep 16 13:58:28 2015] Call Trace:
[Mi Sep 16 13:58:28 2015]  [] io_schedule+0x9d/0x130
[Mi Sep 16 13:58:28 2015]  [] cv_wait_common+0xae/0x150 [spl]
[Mi Sep 16 13:58:28 2015]  [] ? wake_up_bit+0x30/0x30
[Mi Sep 16 13:58:28 2015]  [] __cv_wait_io+0x18/0x20 [spl]
[Mi Sep 16 13:58:28 2015]  [] zio_wait+0x123/0x210 [zfs]
[Mi Sep 16 13:58:28 2015]  [] dsl_pool_sync+0xbf/0x420 [zfs]
[Mi Sep 16 13:58:28 2015]  [] spa_sync+0x388/0xb70 [zfs]
[Mi Sep 16 13:58:28 2015]  [] ? autoremove_wake_function+0x2b/0x40
[Mi Sep 16 13:58:28 2015]  [] txg_sync_thread+0x3cc/0x640 [zfs]
[Mi Sep 16 13:58:28 2015]  [] ? txg_fini+0x2a0/0x2a0 [zfs]
[Mi Sep 16 13:58:28 2015]  [] thread_generic_wrapper+0x71/0x80 [spl]
[Mi Sep 16 13:58:28 2015]  [] ? __thread_exit+0x20/0x20 [spl]
[Mi Sep 16 13:58:28 2015]  [] kthread+0xcf/0xe0
[Mi Sep 16 13:58:28 2015]  [] ? kthread_create_on_node+0x140/0x140
[Mi Sep 16 13:58:28 2015]  [] ret_from_fork+0x58/0x90
[Mi Sep 16 13:58:28 2015]  [] ? kthread_create_on_node+0x140/0x140
[Mi Sep 16 13:58:28 2015] INFO: task mount.zfs:4464 blocked for more than 120 seconds.
[Mi Sep 16 13:58:28 2015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mi Sep 16 13:58:28 2015] mount.zfs       D ffff880ec5c13680     0  4464   4463 0x00000080
[Mi Sep 16 13:58:28 2015]  ffff880e3e8e3860 0000000000000086 ffff880e465816c0 ffff880e3e8e3fd8
[Mi Sep 16 13:58:28 2015]  ffff880e3e8e3fd8 ffff880e3e8e3fd8 ffff880e465816c0 ffff880e4540db68
[Mi Sep 16 13:58:28 2015]  ffff880e4540da20 ffff880e4540db70 ffff880e4540da48 0000000000000000
[Mi Sep 16 13:58:28 2015] Call Trace:
[Mi Sep 16 13:58:28 2015]  [] schedule+0x29/0x70
[Mi Sep 16 13:58:28 2015]  [] cv_wait_common+0x125/0x150 [spl]
[Mi Sep 16 13:58:28 2015]  [] ? wake_up_bit+0x30/0x30
[Mi Sep 16 13:58:28 2015]  [] __cv_wait+0x15/0x20 [spl]
[Mi Sep 16 13:58:28 2015]  [] txg_wait_open+0xc3/0x110 [zfs]
[Mi Sep 16 13:58:28 2015]  [] dmu_tx_wait+0x3a8/0x3c0 [zfs]
[Mi Sep 16 13:58:28 2015]  [] dmu_tx_assign+0x9a/0x510 [zfs]
[Mi Sep 16 13:58:28 2015]  [] zfs_rmnode+0x161/0x350 [zfs]
[Mi Sep 16 13:58:28 2015]  [] ? mutex_lock+0x12/0x2f
[Mi Sep 16 13:58:28 2015]  [] zfs_zinactive+0x168/0x180 [zfs]
[Mi Sep 16 13:58:28 2015]  [] zfs_inactive+0x67/0x240 [zfs]
[Mi Sep 16 13:58:28 2015]  [] ? truncate_pagecache+0x59/0x60
[Mi Sep 16 13:58:28 2015]  [] zpl_evict_inode+0x43/0x60 [zfs]
[Mi Sep 16 13:58:28 2015]  [] evict+0xa7/0x170
[Mi Sep 16 13:58:28 2015]  [] iput+0xf5/0x180
[Mi Sep 16 13:58:28 2015]  [] zfs_unlinked_drain+0xb5/0xf0 [zfs]
[Mi Sep 16 13:58:28 2015]  [] ? finish_wait+0x56/0x70
[Mi Sep 16 13:58:28 2015]  [] ? taskq_create+0x228/0x370 [spl]
[Mi Sep 16 13:58:28 2015]  [] ? wake_up_bit+0x30/0x30
[Mi Sep 16 13:58:28 2015]  [] ? zfs_get_done+0x70/0x70 [zfs]
[Mi Sep 16 13:58:28 2015]  [] ? zil_open+0x42/0x60 [zfs]
[Mi Sep 16 13:58:28 2015]  [] zfs_sb_setup+0x138/0x170 [zfs]
[Mi Sep 16 13:58:28 2015]  [] zfs_domount+0x2d3/0x360 [zfs]
[Mi Sep 16 13:58:28 2015]  [] ? zpl_kill_sb+0x20/0x20 [zfs]
[Mi Sep 16 13:58:28 2015]  [] zpl_fill_super+0x2c/0x40 [zfs]
[Mi Sep 16 13:58:28 2015]  [] mount_nodev+0x4d/0xb0
[Mi Sep 16 13:58:28 2015]  [] zpl_mount+0x52/0x80 [zfs]
[Mi Sep 16 13:58:28 2015]  [] mount_fs+0x39/0x1b0
[Mi Sep 16 13:58:28 2015]  [] vfs_kern_mount+0x5f/0xf0
[Mi Sep 16 13:58:28 2015]  [] do_mount+0x24e/0xa40
[Mi Sep 16 13:58:28 2015]  [] ? __get_free_pages+0xe/0x50
[Mi Sep 16 13:58:28 2015]  [] SyS_mount+0x96/0xf0
[Mi Sep 16 13:58:28 2015]  [] system_call_fastpath+0x16/0x1b

Iostat shows me some txg_sync process consuming >90% of IO:

  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                                                                                                                                                          
 4015 be/4 root          0.00 B      0.00 B  0.00 % 92.78 % [txg_sync]
 4464 be/4 root       1391.75 M      0.00 B  0.00 %  0.22 % mount.zfs dpool01/d39c6e3cd45c7a712ffca0b94c16f86e /var/lib/lxc/d39c6e3cd45c7a712ffca0b94c16f86e -o rw,noatime,xattr,zfsutil

Greets
Kilian

@kernelOfTruth
Copy link
Contributor

@discostur how full, empty is your pool ?

@discostur
Copy link

@kernelOfTruth

my pool is quite empty:

NAME USED AVAIL REFER MOUNTPOINT
dpool01 59,2G 840G 6,55G /dpool01

@chrisrd
Copy link
Contributor

chrisrd commented Sep 17, 2015

@discostur The zfs_unlinked_drain in your stack trace identifies the likely issue: when mounting, ZFS will clean up files removed before the previous unmount but not yet completely removed (see module/zfs/zfs_vfsops.c:zfs_sb_setup() if interested). I've had systems sit in that state for many, many hours. There's nothing to do but wait for it to finish.

@behlendorf
Copy link
Contributor

I believe the original issue here has been addressed so I'm closing this issue. Please open a new issue if a similar problem is reproducible with zfs-0.6.5.1 or newer.

@discostur @chrisrd the zfs_unlinked_drain issue is indeed different and I've been hit by it myself fairly recently. Today that's nothing to do but wait, it will finish. That said, I created #3814 as a feature request to make this cleanup asynchronous. Technically this doesn't have to be done synchronously during the mount, with a little care it can be done asynchronously after the mount completes and the filesystem is available. That would be a nice improvement for the next release.

@ioquatix
Copy link
Author

Just to let you know I've been seeing this issue as recently as 2-3 months ago. At that time I reduced the amount of memory available to ZFS significantly and I stopped seeing the issue. I can check again to see if it's still happening.

@behlendorf
Copy link
Contributor

@ioquatix if you can verify with 0.6.5.1 that would be helpful because it's my understanding that this has been resolved. I'd be good to know if I'm wrong about that.

@lnxbil
Copy link

lnxbil commented Nov 2, 2015

Hi,

I also experience this problem:

[    5.741656] SPL: Loaded module v0.6.5-351_gbe2a2c4
[    5.850469] ZFS: Loaded module v0.6.5.2-47_g7c033da, ZFS pool version 5000, ZFS filesystem version 5

and get the same "stuck" behavior:

[231633.797491] INFO: task txg_sync:37987 blocked for more than 120 seconds.
[231633.797547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[231633.797593] txg_sync        D ffff881007c73640     0 37987      2 0x00000000
[231633.797599]  ffff8817f682fb08 0000000000000046 ffff8817ecb25010 ffff8817f682ffd8
[231633.797611]  ffff8817f682ffd8 ffff8817f682ffd8 ffff881ffc0d16e0 ffff8817ecb25010
[231633.797617]  ffff8817f682fb08 ffff8817ecb25010 ffff881007c73f08 ffff881dd7d5c0a8
[231633.797623] Call Trace:
[231633.797635]  [<ffffffff8163cd39>] schedule+0x29/0x70
[231633.797639]  [<ffffffff8163ce0f>] io_schedule+0x8f/0xd0
[231633.797672]  [<ffffffffa03716ba>] cv_wait_common+0xaa/0x130 [spl]
[231633.797677]  [<ffffffff81095260>] ? wake_up_bit+0x40/0x40
[231633.797688]  [<ffffffffa0371758>] __cv_wait_io+0x18/0x20 [spl]
[231633.797774]  [<ffffffffa0770083>] zio_wait+0x123/0x210 [zfs]
[231633.797813]  [<ffffffffa06f98ef>] dsl_pool_sync+0xaf/0x3f0 [zfs]
[231633.797852]  [<ffffffffa0713e4f>] spa_sync+0x39f/0xb30 [zfs]
[231633.797857]  [<ffffffff8109d9f5>] ? __wake_up_common+0x55/0x90
[231633.797863]  [<ffffffff8101b1d9>] ? read_tsc+0x9/0x10
[231633.797902]  [<ffffffffa07265ff>] txg_sync_thread+0x3ef/0x680 [zfs]
[231633.797909]  [<ffffffff81632f58>] ? __slab_free+0xff/0x250
[231633.797948]  [<ffffffffa0726210>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[231633.797957]  [<ffffffffa036c948>] thread_generic_wrapper+0x78/0x90 [spl]
[231633.797965]  [<ffffffffa036c8d0>] ? spl_vmem_fini+0x10/0x10 [spl]
[231633.797970]  [<ffffffff810948f0>] kthread+0xc0/0xd0
[231633.797974]  [<ffffffff81094830>] ? kthread_create_on_node+0x120/0x120
[231633.797978]  [<ffffffff816465d8>] ret_from_fork+0x58/0x90
[231633.797982]  [<ffffffff81094830>] ? kthread_create_on_node+0x120/0x120

@mailinglists35
Copy link

still getting this on 0.6.5.3. please note that I'm not very tolerant and my kernel.hung_task_timeout_secs 120 seconds is changed to 15 seconds

[Thu Dec 10 02:06:33 2015] INFO: task txg_sync:2226 blocked for more than 15 seconds.
[Thu Dec 10 02:06:33 2015] Tainted: P O 3.16.0-4-amd64 #1
[Thu Dec 10 02:06:33 2015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Dec 10 02:06:33 2015] txg_sync D ffff8800d9fce4e8 0 2226 2 0x00000000
[Thu Dec 10 02:06:33 2015] ffff8800d9fce090 0000000000000046 0000000000012f00 ffff8800cfc87fd8
[Thu Dec 10 02:06:33 2015] 0000000000012f00 ffff8800d9fce090 ffff88011bc937b0 ffff880068d71090
[Thu Dec 10 02:06:33 2015] ffff880068d710d0 0000000000000001 ffff8800d1c9c000 0000000000000000
[Thu Dec 10 02:06:33 2015] Call Trace:
[Thu Dec 10 02:06:33 2015] [] ? io_schedule+0x99/0x120
[Thu Dec 10 02:06:33 2015] [] ? cv_wait_common+0x90/0x100 [spl]
[Thu Dec 10 02:06:33 2015] [] ? prepare_to_wait_event+0xf0/0xf0
[Thu Dec 10 02:06:33 2015] [] ? zio_wait+0x10b/0x1e0 [zfs]
[Thu Dec 10 02:06:33 2015] [] ? dsl_pool_sync+0xaa/0x460 [zfs]
[Thu Dec 10 02:06:33 2015] [] ? spa_sync+0x366/0xb30 [zfs]
[Thu Dec 10 02:06:33 2015] [] ? txg_sync_thread+0x3d1/0x680 [zfs]
[Thu Dec 10 02:06:33 2015] [] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
[Thu Dec 10 02:06:33 2015] [] ? thread_generic_wrapper+0x6b/0x80 [spl]
[Thu Dec 10 02:06:33 2015] [] ? __thread_exit+0x20/0x20 [spl]
[Thu Dec 10 02:06:33 2015] [] ? kthread+0xbd/0xe0
[Thu Dec 10 02:06:33 2015] [] ? kthread_create_on_node+0x180/0x180
[Thu Dec 10 02:06:33 2015] [] ? ret_from_fork+0x58/0x90
[Thu Dec 10 02:06:33 2015] [] ? kthread_create_on_node+0x180/0x180

@behlendorf
Copy link
Contributor

@mailinglists35 are they transient?

@mailinglists35
Copy link

yes, only during peak disk activity - for example this occured when at the same time robocopy was writing data and sdelete was wiping free space on a zvol exported via iscsi to a windows client

@voidzero
Copy link

Just a thought... Could depleted entropy pools be an issue here? Maybe in relation with kernel.randomize_va_space and/or LUKS; lack of entropy can cause long stalls sometimes.

@mailinglists35
Copy link

I experience this on pools without LUKS

@ioquatix
Copy link
Author

Is there some way we can test this hypothesis? Can the kernel log when it is running out of entropy?

@ioquatix
Copy link
Author

Okay so this might be odd, but what the hell. I usually experience this problem when running minecraft on said server. Memory Usage? CPU usage? Not really sure. It never seems to be out of memory or pegged.

@jdb2
Copy link

jdb2 commented Jan 23, 2016

I just started experiencing the same problem mentioned in this issue yesterday. I'm running Linux Mint Cinnamon 17.3 64-bit.

"uname -a" produces :

Linux jdb2-Satellite-M305 3.19.0-32-generic #37~14.04.1-Ubuntu SMP Thu Oct 22 09:41:40 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

"lsb_release -a" produces :

LSB Version:    core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch:core-4.1-amd64:core-4.1-noarch:security-4.0-amd64:security-4.0-noarch:security-4.1-amd64:security-4.1-noarch
Distributor ID: LinuxMint
Description:    Linux Mint 17.3 Rosa
Release:    17.3
Codename:   rosa

I have several Windows and Linux VMware VMs ( VMware® Workstation 12 Pro 12.1.0 build-3272444 ) installed in a zpool on an external Western Digital USB 3.0 4TB My Book hard drive which holds a ZFS pool named "zfs-backup". In the hopes of fixing some inconsistencies in my 32-bit Windows 10 VM ( so that I could expand the virtual disk size ), I created a snapshot, reverted said snapshot and then deleted said snapshot. The process was very IO intensive on the zfs-backup pool and brought my system ( at least the UI/GUI ) to a halt. VMware was stuck at 60% "cleaning up deleted files" yesterday morning so I decided to get to virtual terminal so I could kill and restart the Cinnamon desktop manager and get back control of the GUI. Well, when I was in a virtual terminal, all of the sudden a load of error messages started being spewed from the kernel to virtual terminal 1. I caught sight of what looked to be backtraces, but the text was scrolling too fast. Anyway, my system then shut itself off. I don't know what happened or if there was a kernel panic ( my usual experience with kernel panics is a white-on-black screen filled with text describing the panic and a blinking caps lock key ). In any case, nothing from the crash was recorded to /var/log as far as I could tell.

Anyhow, when I booted my system back up and tried to access my zfs-backup pool, almost every zfs or zpool command would hang/block.

Here is my ZFS version from dmesg :

[    4.547631] SPL: Loaded module v0.6.5.4-1~trusty
[    4.612372] ZFS: Loaded module v0.6.5.4-1~trusty, ZFS pool version 5000, ZFS filesystem version 5

Here is a sample of the type of kernel errors I was seeing :

[  890.857154] BTRFS info (device sdr3): disk space caching is enabled
[ 1320.084196] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 1320.084204]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 1320.084206] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.084209] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 1320.084214]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 1320.084219]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 1320.084222]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 1320.084226] Call Trace:
[ 1320.084236]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 1320.084301]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 1320.084306]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 1320.084319]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 1320.084389]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 1320.084429]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 1320.084432]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 1320.084473]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 1320.084515]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 1320.084527]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 1320.084539]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 1320.084543]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 1320.084547]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1320.084550]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 1320.084553]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1440.084138] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 1440.084144]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 1440.084146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.084148] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 1440.084154]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 1440.084157]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 1440.084161]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 1440.084164] Call Trace:
[ 1440.084173]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 1440.084224]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 1440.084229]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 1440.084242]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 1440.084304]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 1440.084344]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 1440.084347]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 1440.084389]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 1440.084430]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 1440.084442]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 1440.084454]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 1440.084458]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 1440.084462]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1440.084465]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 1440.084468]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1491.001888] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
[ 1560.084180] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 1560.084189]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 1560.084192] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1560.084196] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 1560.084209]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 1560.084213]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 1560.084217]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 1560.084220] Call Trace:
[ 1560.084229]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 1560.084281]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 1560.084286]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 1560.084299]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 1560.084364]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 1560.084403]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 1560.084406]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 1560.084448]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 1560.084489]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 1560.084502]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 1560.084513]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 1560.084517]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 1560.084521]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1560.084524]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 1560.084527]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1680.084269] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 1680.084279]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 1680.084281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1680.084284] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 1680.084290]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 1680.084312]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 1680.084315]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 1680.084319] Call Trace:
[ 1680.084331]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 1680.084407]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 1680.084412]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 1680.084425]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 1680.084514]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 1680.084553]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 1680.084557]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 1680.084598]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 1680.084639]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 1680.084651]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 1680.084663]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 1680.084668]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 1680.084671]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1680.084675]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 1680.084678]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1800.084146] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 1800.084153]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 1800.084154] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1800.084157] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 1800.084162]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 1800.084166]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 1800.084169]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 1800.084173] Call Trace:
[ 1800.084182]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 1800.084233]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 1800.084238]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 1800.084251]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 1800.084314]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 1800.084354]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 1800.084357]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 1800.084398]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 1800.084440]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 1800.084452]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 1800.084464]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 1800.084468]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 1800.084471]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1800.084475]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 1800.084478]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.084178] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 1920.084186]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 1920.084188] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1920.084190] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 1920.084195]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 1920.084200]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 1920.084203]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 1920.084207] Call Trace:
[ 1920.084216]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 1920.084270]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 1920.084275]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 1920.084288]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 1920.084354]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 1920.084393]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 1920.084397]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 1920.084438]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 1920.084479]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 1920.084492]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 1920.084503]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 1920.084508]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 1920.084511]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 1920.084515]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 1920.084518]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2040.084144] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 2040.084151]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 2040.084153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2040.084155] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 2040.084161]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 2040.084165]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 2040.084168]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 2040.084172] Call Trace:
[ 2040.084181]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 2040.084231]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 2040.084236]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 2040.084249]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 2040.084312]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 2040.084351]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 2040.084355]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 2040.084396]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 2040.084437]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 2040.084450]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 2040.084461]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 2040.084466]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 2040.084469]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2040.084472]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 2040.084475]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2040.084499] INFO: task zfs:23416 blocked for more than 120 seconds.
[ 2040.084512]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 2040.084514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2040.084515] zfs             D ffff880100d2bb88     0 23416  23415 0x00000000
[ 2040.084519]  ffff880100d2bb88 ffff88005e17b110 0000000000013e80 ffff880100d2bfd8
[ 2040.084523]  0000000000013e80 ffffffff81c1d4e0 ffff88005e17b110 ffff880100d2bb98
[ 2040.084526]  ffff8801f53c3ad8 ffff8801f53c3a20 ffff8801f53c3ae0 ffff8801f53c3a48
[ 2040.084529] Call Trace:
[ 2040.084533]  [<ffffffff817b2a39>] schedule+0x29/0x70
[ 2040.084547]  [<ffffffffc0343b95>] cv_wait_common+0xe5/0x120 [spl]
[ 2040.084550]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 2040.084563]  [<ffffffffc0343be5>] __cv_wait+0x15/0x20 [spl]
[ 2040.084604]  [<ffffffffc04871af>] txg_wait_synced+0xef/0x140 [zfs]
[ 2040.084639]  [<ffffffffc045ceb0>] ? dsl_prop_inherit+0x60/0x60 [zfs]
[ 2040.084675]  [<ffffffffc0462ae7>] dsl_sync_task+0x177/0x270 [zfs]
[ 2040.084710]  [<ffffffffc045e4e0>] ? dsl_props_set_sync_impl+0x170/0x170 [zfs]
[ 2040.084745]  [<ffffffffc045ceb0>] ? dsl_prop_inherit+0x60/0x60 [zfs]
[ 2040.084780]  [<ffffffffc045e4e0>] ? dsl_props_set_sync_impl+0x170/0x170 [zfs]
[ 2040.084815]  [<ffffffffc045cd60>] dsl_props_set+0x50/0x60 [zfs]
[ 2040.084859]  [<ffffffffc04b0d27>] zfs_set_prop_nvlist+0x2a7/0x370 [zfs]
[ 2040.084904]  [<ffffffffc04b1bdf>] zfs_ioc_set_prop+0xdf/0x120 [zfs]
[ 2040.084948]  [<ffffffffc04afe9d>] zfsdev_ioctl+0x49d/0x4e0 [zfs]
[ 2040.084952]  [<ffffffff8120b2de>] ? m_start+0x1e/0xb0
[ 2040.084955]  [<ffffffff8121043c>] ? seq_read+0x1bc/0x370
[ 2040.084958]  [<ffffffff811ffc58>] do_vfs_ioctl+0x2f8/0x510
[ 2040.084962]  [<ffffffff811ece38>] ? __vfs_read+0x18/0x50
[ 2040.084965]  [<ffffffff811ecef6>] ? vfs_read+0x86/0x140
[ 2040.084968]  [<ffffffff811ffef1>] SyS_ioctl+0x81/0xa0
[ 2040.084971]  [<ffffffff817b6dcd>] system_call_fastpath+0x16/0x1b
[ 2160.084157] INFO: task txg_sync:13532 blocked for more than 120 seconds.
[ 2160.084163]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 2160.084165] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2160.084168] txg_sync        D ffff880181c2fbc8     0 13532      2 0x00000000
[ 2160.084173]  ffff880181c2fbc8 ffff8801f81c2740 0000000000013e80 ffff880181c2ffd8
[ 2160.084177]  0000000000013e80 ffff8802365ef5c0 ffff8801f81c2740 ffff880058421d68
[ 2160.084180]  ffff88023fd14778 ffff880058421d68 ffff880058421da8 ffff880058421d90
[ 2160.084184] Call Trace:
[ 2160.084193]  [<ffffffff817b2d40>] io_schedule+0xa0/0x130
[ 2160.084245]  [<ffffffffc0343b4f>] cv_wait_common+0x9f/0x120 [spl]
[ 2160.084249]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 2160.084262]  [<ffffffffc0343c28>] __cv_wait_io+0x18/0x20 [spl]
[ 2160.084326]  [<ffffffffc04d1ad3>] zio_wait+0x123/0x210 [zfs]
[ 2160.084365]  [<ffffffffc0475db5>] spa_sync+0x3b5/0xb50 [zfs]
[ 2160.084369]  [<ffffffff810b4668>] ? __wake_up_common+0x58/0x90
[ 2160.084411]  [<ffffffffc04882cf>] txg_sync_thread+0x3bf/0x630 [zfs]
[ 2160.084452]  [<ffffffffc0487f10>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[ 2160.084465]  [<ffffffffc033eec1>] thread_generic_wrapper+0x71/0x80 [spl]
[ 2160.084476]  [<ffffffffc033ee50>] ? __thread_exit+0x20/0x20 [spl]
[ 2160.084480]  [<ffffffff81093822>] kthread+0xd2/0xf0
[ 2160.084484]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2160.084487]  [<ffffffff817b6d18>] ret_from_fork+0x58/0x90
[ 2160.084490]  [<ffffffff81093750>] ? kthread_create_on_node+0x1c0/0x1c0
[ 2160.084513] INFO: task zfs:23416 blocked for more than 120 seconds.
[ 2160.084515]       Tainted: P           OE  3.19.0-32-generic #37~14.04.1-Ubuntu
[ 2160.084516] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2160.084518] zfs             D ffff880100d2bb88     0 23416  23415 0x00000000
[ 2160.084522]  ffff880100d2bb88 ffff88005e17b110 0000000000013e80 ffff880100d2bfd8
[ 2160.084525]  0000000000013e80 ffffffff81c1d4e0 ffff88005e17b110 ffff880100d2bb98
[ 2160.084528]  ffff8801f53c3ad8 ffff8801f53c3a20 ffff8801f53c3ae0 ffff8801f53c3a48
[ 2160.084532] Call Trace:
[ 2160.084535]  [<ffffffff817b2a39>] schedule+0x29/0x70
[ 2160.084549]  [<ffffffffc0343b95>] cv_wait_common+0xe5/0x120 [spl]
[ 2160.084552]  [<ffffffff810b4e30>] ? prepare_to_wait_event+0x110/0x110
[ 2160.084565]  [<ffffffffc0343be5>] __cv_wait+0x15/0x20 [spl]
[ 2160.084606]  [<ffffffffc04871af>] txg_wait_synced+0xef/0x140 [zfs]
[ 2160.084641]  [<ffffffffc045ceb0>] ? dsl_prop_inherit+0x60/0x60 [zfs]
[ 2160.084677]  [<ffffffffc0462ae7>] dsl_sync_task+0x177/0x270 [zfs]
[ 2160.084712]  [<ffffffffc045e4e0>] ? dsl_props_set_sync_impl+0x170/0x170 [zfs]
[ 2160.084747]  [<ffffffffc045ceb0>] ? dsl_prop_inherit+0x60/0x60 [zfs]
[ 2160.084782]  [<ffffffffc045e4e0>] ? dsl_props_set_sync_impl+0x170/0x170 [zfs]
[ 2160.084817]  [<ffffffffc045cd60>] dsl_props_set+0x50/0x60 [zfs]
[ 2160.084862]  [<ffffffffc04b0d27>] zfs_set_prop_nvlist+0x2a7/0x370 [zfs]
[ 2160.084906]  [<ffffffffc04b1bdf>] zfs_ioc_set_prop+0xdf/0x120 [zfs]
[ 2160.084951]  [<ffffffffc04afe9d>] zfsdev_ioctl+0x49d/0x4e0 [zfs]
[ 2160.084955]  [<ffffffff8120b2de>] ? m_start+0x1e/0xb0
[ 2160.084958]  [<ffffffff8121043c>] ? seq_read+0x1bc/0x370
[ 2160.084961]  [<ffffffff811ffc58>] do_vfs_ioctl+0x2f8/0x510
[ 2160.084965]  [<ffffffff811ece38>] ? __vfs_read+0x18/0x50
[ 2160.084968]  [<ffffffff811ecef6>] ? vfs_read+0x86/0x140
[ 2160.084971]  [<ffffffff811ffef1>] SyS_ioctl+0x81/0xa0
[ 2160.084974]  [<ffffffff817b6dcd>] system_call_fastpath+0x16/0x1b

I waited several hours for the above command to unblock, but now it appears that my zfs-backup pool is up, running and mounted again, with no errors.

Here is the output of "zpool status" :

pool: zfs-backup
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub in progress since Fri Jan 22 12:43:52 2016
    1.09T scanned out of 1.70T at 18.9M/s, 9h17m to go
    0 repaired, 64.42% done
config:

    NAME                                                      STATE     READ WRITE CKSUM
    zfs-backup                                                ONLINE       0     0     0
      usb-WD_My_Book_1230_57434334453359413945594E-0:0-part3  ONLINE       0     0     0

errors: No known data errors

If anyone has any insight on what the cause of my problems was/is, or even better, a fix for this bug, then I'd be appreciative.

Regards,

jdb2

@bobobo1618
Copy link

I get this occasionally with 0.6.5.7, making extensive use of LUKS.

So far during the times I've noticed it, one particular disk was always at 100% utilization, so I have a feeling it's a bad disk more than anything else but in case it's entropy, I've installed and activated rng-tools.

@voidzero
Copy link

Haveged is also useful. Bad disk, yeah could be.

@ioquatix
Copy link
Author

@behlendorf Still experiencing this issue, no disks have failed in over a year, await time is low, txg_sync hanging. What should I do?

@SebastianRusek
Copy link

I also experience this problem while removing directory with 3.5mln files.

Ubuntu server 16.10, kernel: 4.8.0-34-generic, zfsutils-linux: 0.6.5.8-0ubuntu4.1

zpool in kvm VM on LVM volume on MD raid, compression lz4, deduplication
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
backups01 3.99T 2.78T 1.21T - 61% 69% 2.14x ONLINE -

Few minutes after rm -rf start I get "INFO: task txg_sync:1748 blocked for more than 120 seconds." and zpool becomes unusable.

Do we have any workaround? Did anyone tried something like loop with small portions of files to remove?

How can I help you in debugging this issue?

@lnxbil
Copy link

lnxbil commented Jan 12, 2017

I did not encounter this problem in over a year now, but I disabled (recreated my pool without) deduplication. Afterwards, it runs smoothly.

To achieve something similar than deduplication, I use the COW capability to have incremental differences of backups, yet this is not perfect deduplication, but it saves me a factor of at least 10 in comparison to non-ZFS based backups.

@ioquatix
Copy link
Author

So, I had a boot drive (the one with the high await time). I've replaced it with an SSD. I'll report back if this alleviates the problem.

@ioquatix
Copy link
Author

Relates to #4011
and #2611

@ioquatix
Copy link
Author

I've come to the conclusion that there was something wrong with one of the CPU cores in this server. So, some of the issues I've reported here may be due to faulty hardware.

@ghost
Copy link

ghost commented Jun 25, 2017

We also encounter this issue during heavy load.
@ioquatix How did you see that problem with the CPU core?

@ioquatix
Copy link
Author

I've disabled one of the CPU cores and since then haven't had a single problem.

When both CPU cores were enabled, I had a number of correctable and uncorrectable ECC errors. Running memtest86 failed systematically every time at the 9 minute mark for no understandable reason. After swapping memory, running with one DIMM, then the other, I eventually tried disabling one of the CPU cores, and it has been rock solid ever since. My best guess is that the motherboard has become faulty and perhaps the physical traces between the CPU and DIMM sockets have become faulty.

It's certainly an odd situation.

@dreamcat4
Copy link

@ioquatix thanks for this insight regarding faulty CPU core. It might not be my own issue, but it speaks in general to underlying hardware issues / error.

@ioquatix
Copy link
Author

I definitely had an issue with one of the drives too - but it was corrected after running a badblocks destructive read/write check and recreating the array. I think the disk was non-TLER and was taking it's time to try to read the faulty sector.

@darkpixel
Copy link

I don't think it's bad hardware. I was hitting this on 30 servers in a test environment running Ubuntu if I recall correctly. I don't remember the specific version. After a few weeks/months it went away--possibly due to a kernel upgrade or zfs package upgrade.

@ioquatix
Copy link
Author

Interesting - I could still reproduce the issue with those drives - WD Green 3TB - but can't reproduce the issue in exactly the same server with WD Red 4TB. I'll have to do another full backup and see if I can reproduce - I'm doing a zfs send/recv from the 6xRAID10 WD REDS -> 4xRAID10 WD GREENS and that usually triggers the issue.

@dreamcat4
Copy link

dreamcat4 commented Jul 13, 2017

Getting this on ubuntu 16.04 just yesterday. However also had sata bus communications errors, bad controller. And non-TLER drives. So its really hard for me to know which one it is.

I thought there was some fix or solution for ubuntu systems. But cannot remember where I saw it. [edit] but the guy was talking about snapshots, being taken with zfs-auto-snapshot (which I have also here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests