-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task kswapd blocked for more than 120 seconds, zpl_evict_inode in backtrace. #2125
Comments
Here is the process log at 11am, note some processes appear to be stuck in ZFS related functions https://gist.github.com/ioquatix/2764a81d4d4928fb6c8b |
Forgot to include kernel version:
|
@ioquatix Thanks for opening a new issue for this. Can you include what version of zfs you were using. |
@behlendorf How do I find the ZFS version from the packages that have been installed? I set up every 10 minutes to get memory usage stats from the server, so if something goes wrong again I would have some records. Apart from |
Here is my pool configuration:
Here are the mount points:
|
So, after running for a while, it seems like ZFS is forcing other parts of the system to use the swap disk:
Is this normal behaviour? |
Can you post in the output of /proc/spl/kstat/zfs/arcstats. It should shed some light on this. |
arcstats
free -h
|
@ioquatix For current versions of master you can get the running ZFS version from As for this issue unfortunately I don't have a quick solution for you. The arcstats you posted don't so anything obviously wrong. We're going to have to dig in to this issue to find the root cause at some point it may take us a while to get a chance to investigate this. |
Running RSync tonight, starting to see some strange numbers in
Surely |
I'm running a second Rsync task from a remote server to increase the pressure. Seems like the kernel is under significant memory pressure, looking at the churn by running
|
Slowly but surely, over the past 10 minutes, the
System is now swapping to disk and the swap usage is growing by about 5Mbytes/minute.
|
I found I could free up a significant chunk of space by the following:
However I get the feeling this will only delay the issue. |
Seems like this also forced the
|
@ioquatix Just out of curiosity, what does |
Can you post the full arcstats when the system is in this state. Pull request #2110 was in part designed to address some issues related to meta data heavy workload like rsync. It would be very helpful if you could run your workload with the patches in #2110 applied and log arcstats every 15 seconds. You should see a significant improvement. These changes weren't designed to address the deadlock your seeing. But the improved behavior of the ARC might avoid this issue. If we can get some more real world testing on these changes it would be helpful. The changes are safe to apply to your existing pool and I was hoping to merge them to master this week. A few more data points would be welcome. |
I'm logging I'm not sure I'm willing to update to the patch as it is a reasonably important backup server. Is there anything else I can do [to help]? I couldn't see the
I haven't attempted to reproduce the deadlock, I'm carefully watching the free memory usage. It's a moderately important backup server, rather avoid deadlocks if possible. |
@ioquatix The my suggestion would be just to wait a bit until 0.6.3 comes up. Then update to the official tag and we'll see if the issue remains. There's a decent chance this will be addressed. |
Seems like I come in to the same issue:
Can't do [probably] anything disk i/o related. dmesg seems prety similar.
Arcstats:
Kernel:
|
I haven't seen the issue happen again. It happened when I did a fresh RSycn to a new ZFS partition. Since then, I have seen the memory usage over around 1GB free. |
I got a similar problem. I woke the system up from a suspend, after approximately 3 minutes it locked up completely... Note that I have not enabled any swap so I wonder why the task
|
since when does ZFS support any suspend operation ? if so when is it fully usable ? - currently I am also using it on my laptop and have (out of fear of any trouble) omitted suspending :/ |
Oh, so ZFS does not support suspend? |
if I remember correctly it does not - it's mentioned in another issue and was classified as low priority, it also was mentioned that it's non-trivial to implement |
I haven't had this issue show up since upgrading to the latest arch package and I've been running lots of data through it over the past couple of months.. I'm going to close it. |
@behlendorf @ioquatix Not sure if this should be reopened but I just saw very similar behavior on (note: I am really new to ZFS admin)
free -H (although, I couldn't grab this until the system had come back, was completely frozen)
While doing a
After reading through #1657 some suspect entries from
|
Similar issue with ext4: 563663 Apr 22 20:06:57 AI02 kernel: [94488.814125] INFO: task kswapd0:139 blocked for more than 120 seconds.
563664 Apr 22 20:06:57 AI02 kernel: [94488.814132] Tainted: P OE 4.15.0-96-generic #97-Ubuntu
563665 Apr 22 20:06:57 AI02 kernel: [94488.814134] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
563666 Apr 22 20:06:57 AI02 kernel: [94488.814137] kswapd0 D 0 139 2 0x80000000
563667 Apr 22 20:06:57 AI02 kernel: [94488.814141] Call Trace:
563668 Apr 22 20:06:57 AI02 kernel: [94488.814151] __schedule+0x24e/0x880
563669 Apr 22 20:06:57 AI02 kernel: [94488.814158] ? blk_flush_plug_list+0xea/0x270
563670 Apr 22 20:06:57 AI02 kernel: [94488.814161] schedule+0x2c/0x80
563671 Apr 22 20:06:57 AI02 kernel: [94488.814165] schedule_timeout+0x1cf/0x350
563672 Apr 22 20:06:57 AI02 kernel: [94488.814219] ? _xfs_buf_ioapply+0x396/0x4e0 [xfs]
563673 Apr 22 20:06:57 AI02 kernel: [94488.814223] wait_for_completion+0xba/0x140
563674 Apr 22 20:06:57 AI02 kernel: [94488.814227] ? wake_up_q+0x80/0x80
563675 Apr 22 20:06:57 AI02 kernel: [94488.814267] ? xfs_bwrite+0x24/0x60 [xfs]
563676 Apr 22 20:06:57 AI02 kernel: [94488.814302] xfs_buf_submit_wait+0x81/0x210 [xfs]
563677 Apr 22 20:06:57 AI02 kernel: [94488.814335] xfs_bwrite+0x24/0x60 [xfs]
563678 Apr 22 20:06:57 AI02 kernel: [94488.814373] xfs_reclaim_inode+0x327/0x350 [xfs]
563679 Apr 22 20:06:57 AI02 kernel: [94488.814407] xfs_reclaim_inodes_ag+0x1eb/0x340 [xfs]
563680 Apr 22 20:06:57 AI02 kernel: [94488.814416] ? check_preempt_curr+0x2d/0x90
563681 Apr 22 20:06:57 AI02 kernel: [94488.814418] ? ttwu_do_wakeup+0x1e/0x140
563682 Apr 22 20:06:57 AI02 kernel: [94488.814421] ? ttwu_do_activate+0x77/0x80
563683 Apr 22 20:06:57 AI02 kernel: [94488.814424] ? try_to_wake_up+0x59/0x4a0
563684 Apr 22 20:06:57 AI02 kernel: [94488.814427] ? wake_up_process+0x15/0x20
563685 Apr 22 20:06:57 AI02 kernel: [94488.814462] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
563686 Apr 22 20:06:57 AI02 kernel: [94488.814505] xfs_fs_free_cached_objects+0x19/0x20 [xfs]
563687 Apr 22 20:06:57 AI02 kernel: [94488.814510] super_cache_scan+0x165/0x1b0
563688 Apr 22 20:06:57 AI02 kernel: [94488.814514] shrink_slab.part.51+0x1e7/0x440
563689 Apr 22 20:06:57 AI02 kernel: [94488.814518] shrink_slab+0x29/0x30
563690 Apr 22 20:06:57 AI02 kernel: [94488.814521] shrink_node+0x11e/0x300
563691 Apr 22 20:06:57 AI02 kernel: [94488.814525] kswapd+0x2ae/0x730
563692 Apr 22 20:06:57 AI02 kernel: [94488.814529] kthread+0x121/0x140
563693 Apr 22 20:06:57 AI02 kernel: [94488.814532] ? mem_cgroup_shrink_node+0x190/0x190
563694 Apr 22 20:06:57 AI02 kernel: [94488.814534] ? kthread_create_worker_on_cpu+0x70/0x70
563695 Apr 22 20:06:57 AI02 kernel: [94488.814538] ret_from_fork+0x35/0x40
563696 Apr 22 20:06:57 AI02 kernel: [94488.814568] INFO: task kubelet:3361 blocked for more than 120 seconds.
563697 Apr 22 20:06:57 AI02 kernel: [94488.814571] Tainted: P OE 4.15.0-96-generic #97-Ubuntu
563698 Apr 22 20:06:57 AI02 kernel: [94488.814573] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
563699 Apr 22 20:06:57 AI02 kernel: [94488.814575] kubelet D 0 3361 1 0x00000000
563700 Apr 22 20:06:57 AI02 kernel: [94488.814578] Call Trace:
563701 Apr 22 20:06:57 AI02 kernel: [94488.814582] __schedule+0x24e/0x880
563702 Apr 22 20:06:57 AI02 kernel: [94488.814587] ? ___slab_alloc+0xf2/0x4b0
563703 Apr 22 20:06:57 AI02 kernel: [94488.814589] schedule+0x2c/0x80
563704 Apr 22 20:06:57 AI02 kernel: [94488.814592] schedule_preempt_disabled+0xe/0x10 The thing worth noted is that I have disabled the swap before. |
Hi, my apologies if this bug has already been reported, I looked through quite a bit to find something related but wasn't really sure and decided to file a new bug report.
I was running a marginally large Rsync task last night (20-30Gbytes) from a remote server to my local backup machine which is running the latest ZFS available in Arch Linux. I have 4x3TB hard drives in a zraid1 and 1x2TB removable backup drive, both formatted using ZFS. The OS drive is a separate spinning disk running EXT4 and the system is the HP MicroServer with 8GB of ram.
The system uses the default limits for the arc cache which appear to be 4Gbytes of memory. There is also 4Gbytes of swap available.
I've never had problems with Rsync in the past, but last night the entire system ground to a halt and became completely unresponsive by the time I got to it at 11AM. I'm still not particularly sure what the actual problem was - all my ssh sessions locked up and physical access to the computer was no better. I started the backup around 2:15AM and it appears to start having problems around 3AM.
I've also got a process log from 11AM, I'll try to attach it separately as it is quite long.
The text was updated successfully, but these errors were encountered: