Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible machine lockup with BTRFS backend on OOM kill #15654

Closed
Soulou opened this issue Aug 18, 2015 · 8 comments
Closed

Reproducible machine lockup with BTRFS backend on OOM kill #15654

Soulou opened this issue Aug 18, 2015 · 8 comments

Comments

@Soulou
Copy link
Contributor

@Soulou Soulou commented Aug 18, 2015

Server version: 1.7.0
Server API version: 1.19
Go version (server): go1.4.2
Git commit (server): 0baf609
OS/Arch (server): linux/amd64
Containers: 177
Images: 369
Storage Driver: btrfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.19.0-25-generic
Operating System: Ubuntu 14.04.3 LTS
CPUs: 6
Total Memory: 31.42 GiB
Name: dph-scahos01
ID: AZNH:RYE6:EPU2:OM3H:RQGL:E6GL:3PCA:BJJP:JVQO:ILYS:G7B4:TWOZ
Username: soulou
Registry: https://index.docker.io/v1/

It seems related to btrfs, according to the stacks I can find in /var/log/syslog

Aug 18 08:41:16 kernel: [462822.145964] INFO: task kworker/u12:1:31792 blocked for more than 120 seconds.
Aug 18 08:41:16 kernel: [462822.146724]       Tainted: G         C     3.19.0-25-generic #26~14.04.1-Ubuntu
Aug 18 08:41:16 kernel: [462822.147513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 08:41:16 kernel: [462822.148254] kworker/u12:1   D ffff8802c64836c8     0 31792      2 0x00000000
Aug 18 08:41:16 kernel: [462822.148261] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-2)
Aug 18 08:41:16 kernel: [462822.148263]  ffff8802c64836c8 ffff88024cadebf0 0000000000013e80 ffff8802c6483fd8
Aug 18 08:41:16 kernel: [462822.148265]  0000000000013e80 ffffffff81c1d4e0 ffff88024cadebf0 ffff88083ffa4de8
Aug 18 08:41:16 kernel: [462822.148267]  ffff88083fc14778 ffff88083ffa4de8 ffff8802c6483770 0000000000000002
Aug 18 08:41:16 kernel: [462822.148269] Call Trace:
Aug 18 08:41:16 kernel: [462822.148272]  [<ffffffff817b2e20>] ? bit_wait+0x50/0x50
Aug 18 08:41:16 kernel: [462822.148274]  [<ffffffff817b25f0>] io_schedule+0xa0/0x130
Aug 18 08:41:16 kernel: [462822.148276]  [<ffffffff817b2e4c>] bit_wait_io+0x2c/0x50
Aug 18 08:41:16 kernel: [462822.148278]  [<ffffffff817b2c1b>] __wait_on_bit_lock+0x4b/0xb0
Aug 18 08:41:16 kernel: [462822.148281]  [<ffffffff8117588e>] __lock_page+0xae/0xb0
Aug 18 08:41:16 kernel: [462822.148285]  [<ffffffff810b4e50>] ? autoremove_wake_function+0x40/0x40
Aug 18 08:41:16 kernel: [462822.148314]  [<ffffffffc0460d8d>] lock_delalloc_pages+0x13d/0x1d0 [btrfs]
Aug 18 08:41:16 kernel: [462822.148327]  [<ffffffffc0461046>] ? find_delalloc_range.constprop.47+0xa6/0x160 [btrfs]
Aug 18 08:41:16 kernel: [462822.148339]  [<ffffffffc046352b>] find_lock_delalloc_range.constprop.46+0x14b/0x1f0 [btrfs]
Aug 18 08:41:16 kernel: [462822.148350]  [<ffffffffc0463d20>] ? end_extent_writepage+0xa0/0xa0 [btrfs]
Aug 18 08:41:16 kernel: [462822.148361]  [<ffffffffc046364b>] writepage_delalloc.isra.32+0x7b/0x150 [btrfs]
Aug 18 08:41:16 kernel: [462822.148371]  [<ffffffffc04647eb>] __extent_writepage+0xbb/0x2a0 [btrfs]
Aug 18 08:41:16 kernel: [462822.148381]  [<ffffffffc0464d0a>] extent_write_cache_pages.isra.28.constprop.45+0x33a/0x3f0 [btrfs]
Aug 18 08:41:16 kernel: [462822.148393]  [<ffffffffc046673d>] extent_writepages+0x4d/0x70 [btrfs]
Aug 18 08:41:16 kernel: [462822.148404]  [<ffffffffc0449ec0>] ? btrfs_submit_direct+0x7a0/0x7a0 [btrfs]
Aug 18 08:41:16 kernel: [462822.148415]  [<ffffffffc0446eb8>] btrfs_writepages+0x28/0x30 [btrfs]
Aug 18 08:41:16 kernel: [462822.148417]  [<ffffffff811820be>] do_writepages+0x1e/0x40
Aug 18 08:41:16 kernel: [462822.148420]  [<ffffffff81215250>] __writeback_single_inode+0x40/0x220
Aug 18 08:41:16 kernel: [462822.148422]  [<ffffffff81215d73>] writeback_sb_inodes+0x263/0x430
Aug 18 08:41:16 kernel: [462822.148424]  [<ffffffff81215fdf>] __writeback_inodes_wb+0x9f/0xd0
Aug 18 08:41:16 kernel: [462822.148426]  [<ffffffff81216253>] wb_writeback+0x243/0x2c0
Aug 18 08:41:16 kernel: [462822.148429]  [<ffffffff81218813>] bdi_writeback_workfn+0x113/0x440
Aug 18 08:41:16 kernel: [462822.148432]  [<ffffffff8108db6f>] process_one_work+0x14f/0x400
Aug 18 08:41:16 kernel: [462822.148434]  [<ffffffff8108e308>] worker_thread+0x118/0x510
Aug 18 08:41:16 kernel: [462822.148436]  [<ffffffff8108e1f0>] ? rescuer_thread+0x3d0/0x3d0
Aug 18 08:41:16 kernel: [462822.148439]  [<ffffffff81093802>] kthread+0xd2/0xf0
Aug 18 08:41:16 kernel: [462822.148441]  [<ffffffff81093730>] ? kthread_create_on_node+0x1c0/0x1c0
Aug 18 08:41:16 kernel: [462822.148444]  [<ffffffff817b65d8>] ret_from_fork+0x58/0x90
Aug 18 08:41:16 kernel: [462822.148446]  [<ffffffff81093730>] ? kthread_create_on_node+0x1c0/0x1c0

Aug 18 08:41:16 kernel: [462822.148461] INFO: task kworker/u12:10:23202 blocked for more than 120 seconds.
Aug 18 08:41:16 kernel: [462822.149186]       Tainted: G         C     3.19.0-25-generic #26~14.04.1-Ubuntu
Aug 18 08:41:16 kernel: [462822.150010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 08:41:16 kernel: [462822.150813] kworker/u12:10  D ffff880008e53b48     0 23202      2 0x00000000
Aug 18 08:41:16 kernel: [462822.150828] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Aug 18 08:41:16 kernel: [462822.150829]  ffff880008e53b48 ffff88075d4d1d70 0000000000013e80 ffff880008e53fd8
Aug 18 08:41:16 kernel: [462822.150831]  0000000000013e80 ffff88081beff5c0 ffff88075d4d1d70 ffff880008e53b58
Aug 18 08:41:16 kernel: [462822.150833]  ffff880008e53c80 7fffffffffffffff ffff880008e53c78 ffff88075d4d1d70
Aug 18 08:41:16 kernel: [462822.150835] Call Trace:
Aug 18 08:41:16 kernel: [462822.150838]  [<ffffffff817b22e9>] schedule+0x29/0x70
Aug 18 08:41:16 kernel: [462822.150841]  [<ffffffff817b53ac>] schedule_timeout+0x20c/0x280
Aug 18 08:41:16 kernel: [462822.150843]  [<ffffffff8108bdaa>] ? __queue_delayed_work+0xaa/0x1a0
Aug 18 08:41:16 kernel: [462822.150845]  [<ffffffff8108c121>] ? try_to_grab_pending+0xc1/0x170
Aug 18 08:41:16 kernel: [462822.150847]  [<ffffffff817b3014>] wait_for_completion+0xa4/0x170
Aug 18 08:41:16 kernel: [462822.150850]  [<ffffffff810a0a70>] ? wake_up_state+0x20/0x20
Aug 18 08:41:16 kernel: [462822.150852]  [<ffffffff8121559f>] writeback_inodes_sb_nr+0x7f/0xb0
Aug 18 08:41:16 kernel: [462822.150862]  [<ffffffffc042c530>] flush_space+0x430/0x4d0 [btrfs]
Aug 18 08:41:16 kernel: [462822.150871]  [<ffffffffc042c0b4>] ? can_overcommit+0xa4/0xf0 [btrfs]
Aug 18 08:41:16 kernel: [462822.150880]  [<ffffffffc042c73b>] btrfs_async_reclaim_metadata_space+0x16b/0x1f0 [btrfs]
Aug 18 08:41:16 kernel: [462822.150882]  [<ffffffff8108db6f>] process_one_work+0x14f/0x400
Aug 18 08:41:16 kernel: [462822.150884]  [<ffffffff8108e308>] worker_thread+0x118/0x510
Aug 18 08:41:16 kernel: [462822.150886]  [<ffffffff8108e1f0>] ? rescuer_thread+0x3d0/0x3d0
Aug 18 08:41:16 kernel: [462822.150889]  [<ffffffff81093802>] kthread+0xd2/0xf0
Aug 18 08:41:16 kernel: [462822.150891]  [<ffffffff81093730>] ? kthread_create_on_node+0x1c0/0x1c0
Aug 18 08:41:16 kernel: [462822.150894]  [<ffffffff817b65d8>] ret_from_fork+0x58/0x90
Aug 18 08:41:16 kernel: [462822.150896]  [<ffffffff81093730>] ? kthread_create_on_node+0x1c0/0x1c0

Aug 18 08:41:16 kernel: [462822.148461] INFO: task kworker/u12:10:23202 blocked for more than 120 seconds.
Aug 18 08:41:16 kernel: [462822.149186]       Tainted: G         C     3.19.0-25-generic #26~14.04.1-Ubuntu
Aug 18 08:41:16 kernel: [462822.150010] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 18 08:41:16 kernel: [462822.150813] kworker/u12:10  D ffff880008e53b48     0 23202      2 0x00000000
Aug 18 08:41:16 kernel: [462822.150828] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Aug 18 08:41:16 kernel: [462822.150829]  ffff880008e53b48 ffff88075d4d1d70 0000000000013e80 ffff880008e53fd8
Aug 18 08:41:16 kernel: [462822.150831]  0000000000013e80 ffff88081beff5c0 ffff88075d4d1d70 ffff880008e53b58
Aug 18 08:41:16 kernel: [462822.150833]  ffff880008e53c80 7fffffffffffffff ffff880008e53c78 ffff88075d4d1d70
Aug 18 08:41:16 kernel: [462822.150835] Call Trace:
Aug 18 08:41:16 kernel: [462822.150838]  [<ffffffff817b22e9>] schedule+0x29/0x70
Aug 18 08:41:16 kernel: [462822.150841]  [<ffffffff817b53ac>] schedule_timeout+0x20c/0x280
Aug 18 08:41:16 kernel: [462822.150843]  [<ffffffff8108bdaa>] ? __queue_delayed_work+0xaa/0x1a0
Aug 18 08:41:16 kernel: [462822.150845]  [<ffffffff8108c121>] ? try_to_grab_pending+0xc1/0x170
Aug 18 08:41:16 kernel: [462822.150847]  [<ffffffff817b3014>] wait_for_completion+0xa4/0x170
Aug 18 08:41:16 kernel: [462822.150850]  [<ffffffff810a0a70>] ? wake_up_state+0x20/0x20
Aug 18 08:41:16 kernel: [462822.150852]  [<ffffffff8121559f>] writeback_inodes_sb_nr+0x7f/0xb0
Aug 18 08:41:16 kernel: [462822.150862]  [<ffffffffc042c530>] flush_space+0x430/0x4d0 [btrfs]
Aug 18 08:41:16 kernel: [462822.150871]  [<ffffffffc042c0b4>] ? can_overcommit+0xa4/0xf0 [btrfs]
Aug 18 08:41:16 kernel: [462822.150880]  [<ffffffffc042c73b>] btrfs_async_reclaim_metadata_space+0x16b/0x1f0 [btrfs]
Aug 18 08:41:16 kernel: [462822.150882]  [<ffffffff8108db6f>] process_one_work+0x14f/0x400
Aug 18 08:41:16 kernel: [462822.150884]  [<ffffffff8108e308>] worker_thread+0x118/0x510
Aug 18 08:41:16 kernel: [462822.150886]  [<ffffffff8108e1f0>] ? rescuer_thread+0x3d0/0x3d0
Aug 18 08:41:16 kernel: [462822.150889]  [<ffffffff81093802>] kthread+0xd2/0xf0
Aug 18 08:41:16 kernel: [462822.150891]  [<ffffffff81093730>] ? kthread_create_on_node+0x1c0/0x1c0
Aug 18 08:41:16 kernel: [462822.150894]  [<ffffffff817b65d8>] ret_from_fork+0x58/0x90
Aug 18 08:41:16 kernel: [462822.150896]  [<ffffffff81093730>] ? kthread_create_on_node+0x1c0/0x1c0

The whole machine was impacted, I'm looking for hints about how to avoid this. Thanks

@GordonTheTurtle
Copy link

@GordonTheTurtle GordonTheTurtle commented Aug 18, 2015

Hi!

Please read this important information about creating issues.

If you are reporting a new issue, make sure that we do not have any duplicates already open. You can ensure this by searching the issue list for this repository. If there is a duplicate, please close your issue and add a comment to the existing issue instead.

If you suspect your issue is a bug, please edit your issue description to include the BUG REPORT INFORMATION shown below. If you fail to provide this information within 7 days, we cannot debug your issue and will close it. We will, however, reopen it if you later provide the information.

This is an automated, informational response.

Thank you.

For more information about reporting issues, see https://github.com/docker/docker/blob/master/CONTRIBUTING.md#reporting-other-issues


BUG REPORT INFORMATION

Use the commands below to provide key information from your environment:

docker version:
docker info:
uname -a:

Provide additional environment details (AWS, VirtualBox, physical, etc.):

List the steps to reproduce the issue:
1.
2.
3.

Describe the results you received:

Describe the results you expected:

Provide additional info you think is important:

----------END REPORT ---------

#ENEEDMOREINFO

@Soulou Soulou changed the title Machine lockup with BTRFS backend on important load Machine lockup with BTRFS backend on important I/O load Aug 18, 2015
@juliantaylor
Copy link
Contributor

@juliantaylor juliantaylor commented Aug 18, 2015

so the machine was completely dead or did it recover?
these backtraces are not unusual with btrfs under high load, its usually just a task taking a long time but it will recover. I regularly have these in my logs on a filesystem not used by docker.

@Soulou
Copy link
Contributor Author

@Soulou Soulou commented Aug 18, 2015

It did not recover at all, I had to hard reboot the server. Docker was completely unresponsive, and some of the containers running on it were completely frozen/IO blocked.

@Soulou
Copy link
Contributor Author

@Soulou Soulou commented Aug 18, 2015

I've some news about that, I think I've identified who such lock happened in my use case.

  • Lot of I/O
  • OOM kills process
  • Btrfs has locked pages
  • Driver/kernel freezes
@Soulou Soulou changed the title Machine lockup with BTRFS backend on important I/O load Machine reproducible lockup with BTRFS backend on important I/O load Aug 18, 2015
@Soulou
Copy link
Contributor Author

@Soulou Soulou commented Aug 18, 2015

Ok, I eventually handled to reproduce the exactly same situation, the image I was using was scalingo/builder:v18 But whatever it is, it has nothing to do with the image.

  1. I started 3 containers making a java deployment (maven download/build/package etc.)
  2. Containers where limited to 512MB of memory and 1024 of memory+swap
  3. All three deployments ran out of memory, because at least 2-3GB is necessary to handle completely the build
  4. The OOM Killer started running and killed all the java processes of maven.

OOM logs: http://pastebin.com/Lb4Gybay

  1. Then btrfs workers froze immediately:

Here are all the logs and inclusing a final Sysrq-w at the end. From this point Docker was not working anymore, the BTRFS driver was "broken"

ps aux | grep ' D '
root      1643  0.0  0.0      0     0 ?        D    aug18   0:01 [kworker/u12:1]

Syslog + Sysrq-w logs: http://pastebin.com/qd1yB8u2

I'll post on BTRFS ML, if you have any idea, I take it!

@Soulou Soulou changed the title Machine reproducible lockup with BTRFS backend on important I/O load Reproducible machine lockup with BTRFS backend on important I/O load Aug 19, 2015
@Soulou Soulou changed the title Reproducible machine lockup with BTRFS backend on important I/O load Reproducible machine lockup with BTRFS backend on OOM kill Sep 1, 2015
@mward29
Copy link

@mward29 mward29 commented Sep 29, 2016

We just ran into this in a rather harsh way. With almost exact scenario. Any movement?

@etiennebec
Copy link

@etiennebec etiennebec commented Sep 29, 2016

We were experiencing the same issue (high io, btrfs volume unresponsive, server completely frozen) and we finally switched to devicemapper (direct-lvm) to address this problem.

@Soulou
Copy link
Contributor Author

@Soulou Soulou commented Jun 25, 2018

Old and not happened for a long time, support linux update/docker update did the trick

@Soulou Soulou closed this Jun 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.