Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue Identified with Daemon and CentOS 7.4 #821

Closed
FrankSealover opened this issue Dec 21, 2017 · 8 comments
Closed

Issue Identified with Daemon and CentOS 7.4 #821

FrankSealover opened this issue Dec 21, 2017 · 8 comments
Assignees

Comments

@FrankSealover
Copy link

FrankSealover commented Dec 21, 2017

  • Panel or Daemon: Daemon
  • Version of Panel/Daemon: 0.4.3 (This issue was also reported with 0.4.5)
  • Server's OS: CentOS 7.4
  • Your Computer's OS & Browser: Windows 10 Pro & Google Chrome

Add Details Below:

  • Docker version: Docker version 17.11.0-ce, build 1caf76c
  • Kernel Version: 3.10.0-693.2.2.el7.x86_64

This issue was identified by experiencing stability problems with the daemon after 90 days of uptime. Severe performance degradation took place, and effectively rendered the daemon unusable without a reboot. However, upon further investigation, and another report from another host, the issue has been identified as a much more severe issue that appears to be related to the way the daemon and docker inter-operate with each other...it's worth noting the following:

  • This does not affect all nodes. It appears only after about 80-90 days of uptime
  • The issue is not solely bound to the docker update users seem to have been having issues with
  • This also seems to correlate with the amount of active servers on a single node. This only occurs with about 25-35 servers on a single daemon.

Kernel Logs from /var/log/messages

NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s![kworker/2:0:23909]
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [ksoftirqd/5:33]
NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [kworker/2:1:20761]
NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [java:16709]
NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [swapper/5:0]
NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [kworker/2:1:20761]
NMI watchdog BUG: soft lockup - CPU#2 stuck for 24s! [kworker/2:1:20761]
NMI watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [swapper/3:0]
NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [du:15885]
NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0]

It is worth noting that the server throughout this time is not loaded anywhere near the thresholds for this to be appearing. Also, the server becomes completely unresponsive.

I have assumed that according to the logs, this primarily became a issue with docker and specifically the daemon (as seen here, [du:15885] assuming that the daemon uses this to determine the directory size of the container upon server initialization)

It is also worth noting that java is also not installed on the host machine, isolating the possibility of this being a issue with the host machine.

Upon further investigation and collaboration, we have also discovered this within the kernel logs of the time from when the server becomes unresponsive, will be placed in hastebin for easier readability.

https://hastebin.com/lituniledo.sql

Within this, one thing that appears to pop out most is this:

kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(935:9cd527dd5e1a39ddc876e23563ac23e13244e42530eccbd9e3df1843d6433225)

We decided to investigate further, as this occurred right after docker has initialized a container, the issues appear right after docker initializes a container.

Additional Information

df -h | grep 9cd527dd5e1
shm              64M     0   64M   0% /var/lib/docker/containers/9cd527dd5e1a39ddc876e23563ac23e13244e42530eccbd9e3df1843d6433225/shm

This is a active docker container.

I will be doing my own investigations into this to see if I can come up with anymore information regarding this and post it here, however, it is a issue that I feel should be given some attention because it has caused stability issues.

@fabri2000779
Copy link

I got this issue with skynode aswell with one of our nodes and we are running Ubuntu 16.04,same kernel log, our solution was a re-installation of the daemon, but recently we got a unreposive machine again with other of our node

@DaneEveritt DaneEveritt added bug Something that's not working as it's intended to be. P: High labels Dec 21, 2017
@FrankSealover
Copy link
Author

FrankSealover commented Dec 21, 2017

our solution was a re-installation of the daemon

A re-installation is not required.

@DaneEveritt
Copy link
Member

DaneEveritt commented Dec 21, 2017

Yeah, this doesn't look like something that would be solved by reinstalling the daemon, it appears to be an issue with how docker is being utilized somewhere.

@vipesz can you provide docker logs from the times you have above, as well as any daemon logs if they appear to contain information as well?

There appears to be a docker issue that is similar, moby/moby#19758, maybe give that a read through and see if it matches what you're seeing?

@FrankSealover
Copy link
Author

FrankSealover commented Dec 22, 2017

@DaneEveritt

Kernel Log Analysis

**Dec 19 18:30:25** na03 kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:2:8207]

Grepped through /var/log/messages*, found a docker container causing the issue with the log output:

**Dec 19 18:30:25** na03 kernel: cache_from_obj: Wrong slab cache. kmalloc-256 but object is from kmem_cache(3532:ef171cc37418c03757281d3336e9574b4127d2fa20906aea14740c7d611b48d0)

Both of these time stamps are the same, associating the two issues together.

A few moments later, with the same docker container, as requested, this is the output docker had:

Dec 19 20:29:09 na03 dockerd: time="2017-12-19T20:29:09.171525074-05:00" level=info msg="Removing stale sandbox ebf910be49548a381045504c5326ced4cf911802d82e2464ac805754c7025438 (ef171cc37418c03757281d3336e9574b4127d2fa20906aea14740c7d611b48d0)"

This container afterwards was not problematic for this specific node and has been stable for 3 days... (this was after the cleanup process, which involved stopping the daemon, and stopping and rebuilding all server containers)

Update:

Additional Information

  • A problematic issue arose in CentOS 7.4/7.3 based systems running EXT4 file systems with the overlayfs driver causing several containers failing to rebuild with the container being mounted by a random rouge process. This has since been resolved by deploying nodes with XFS file systems.
  • Nodes with XFS are also experiencing this issue
  • The issue reported above (Docker hang after intensive run/remove operations moby/moby#19758) is NOT exactly the same. overlayfs does not seem to a culprit as much as in the aforementioned issue. Within the kernel log, this is the only thing that is displayed:
  • The issue is reported on Ubuntu based systems.
Dec 19 20:28:45 na03 dockerd: time="2017-12-19T20:28:45-05:00" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." module=containerd type=io.containerd.snapshotter.v1

As indicated in the previous report;

Feb 22 06:56:59 ip-10-10-2-15 kernel: Modules linked in: xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter nf_nat nf_conntrack isofs **overlay** cirrus ttm drm_kms_helper drm ppdev 8250_fintek input_leds syscopyarea sysfillrect parport_pc sysimgblt i2c_piix4 parport fb_sys_fops pcspkr ceph libceph fscache ip_tables xfs libcrc32c ata_generic pata_acpi crct10dif_pclmul crc32_pclmul crc32c_intel aesni_intel lrw xen_blkfront xen_netfront gf128mul glue_helper ablk_helper ata_piix cryptd serio_raw libata floppy fjes sunrpc dm_mirror dm_region_hash dm_log dm_mod
Feb 22 06:56:59 ip-10-10-2-15 kernel: CPU: 12 PID: 8327 Comm: docker Not tainted 4.4.44-1.el7.elrepo.x86_64 #1
Feb 22 06:56:59 ip-10-10-2-15 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016

@schrej
Copy link
Member

schrej commented Dec 22, 2017

Please use ``` around multiline sourcecode/logs for better formatting ;)

@parkervcp
Copy link
Member

Overlay fs on cent is very much experimental still. I recommend against is. I have seen fewer issues with xfs and no issues with zfs. This appears to be a Docker issue though.

@FrankSealover
Copy link
Author

FrankSealover commented Dec 25, 2017

Overlay fs on cent is very much experimental still. I recommend against is. I have seen fewer issues with xfs and no issues with zfs. This appears to be a Docker issue though.

I'll see if changing the storage driver to overlay2 along with a kernel upgrade resolves this issue. I'll report back my findings after the holidays

@FrankSealover
Copy link
Author

FrankSealover commented Jan 13, 2018

Updated to overlay2 with XFS d_type enabled and Kernel version 4.14, no problems since.
Feel free to close this issue or ask any questions you may have.

I suggest putting this into the documentation for future reference.

Docker Versions:
Docker version 17.11.0-ce, build 1caf76c
Docker version 17.12.0-ce, build c97c6d6

@DaneEveritt DaneEveritt removed bug Something that's not working as it's intended to be. P: High labels May 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants