High CPU Utilization Openshift process (Origin v1.4) #14428

olsib · 2017-05-31T17:06:15Z

Abnormal Openshift Node process utilization.
We're experiencing an unusual CPU utilization in few of our Openshift nodes, sometimes eating more then half of the node CPU (please refer to the stats provided below for more details).
There is no clear evidence as to where does the load come from within the origin node container. PODs are spread equally across nodes, so we would expect more or less same utilization across all nodes.
Our non-prod environments are running Openshift Origin 1.4.
Prod environment is running on OCP v3.4. We need to make sure this bug is not replicated in OCP before we do any deployments there, so your quick assistance will be appreciated.

Version

oc v1.4.0+208f053
kubernetes v1.4.0+776c994
features: Basic-Auth GSSAPI Kerberos SPNEGO

Steps To Reproduce

Can't reproduce in other nodes

Current Result

Container name:
openshift/node:v1.4.0

Docker stats from Node experiencing issue:
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
edfa59a0981a 226.59% 661.9 MiB / 27.79 GiB 2.33% 0 B / 0 B 487.6 MB / 707.9 MB 31

Expected Result

Docker stats from a healthy node:
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
9c5325114fe1 10.04% 231.4 MiB / 27.79 GiB 0.81% 0 B / 0 B 261.5 MB / 11.51 GB 36

Additional Information

Deployment type: Containerized
Docker version: 1.12.5
OS: Redhat 7.3 (kernel 3.10.0-514.6.1.el7.x86_64)

Cluster Info: 3 Master + 12 Nodes
Node Resources: 4 CPU/30.5 GB of RAM
Average Pods/Node: ~50

Faulty node docker info:
docker info
Server Version: 1.12.5
Storage Driver: devicemapper
Pool Name: docker--vg-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 25.48 GB
Data Space Total: 85.46 GB
Data Space Available: 59.98 GB
Metadata Space Used: 11.42 MB
Metadata Space Total: 218.1 MB
Metadata Space Available: 206.7 MB
Thin Pool Minimum Free Space: 8.546 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: host bridge null overlay
Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.16.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 4
Total Memory: 27.79 GiB
ID: W6NH:D2OP:JCBO:SLNW:Q4FX:FUOJ:2JIS:L5SU:IBNY:RZI2:ORCH:6KR2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
Insecure Registries:
172.30.0.0/16
127.0.0.0/8
Registries: registry.access.redhat.com (secure), docker.io (secure)

Healthy node docker info:
Server Version: 1.12.5
Storage Driver: devicemapper
Pool Name: docker--vg-docker--pool
Pool Blocksize: 524.3 kB
Base Device Size: 10.74 GB
Backing Filesystem: xfs
Data file:
Metadata file:
Data Space Used: 41.87 GB
Data Space Total: 214.5 GB
Data Space Available: 172.6 GB
Metadata Space Used: 19.18 MB
Metadata Space Total: 109.1 MB
Metadata Space Available: 89.87 MB
Thin Pool Minimum Free Space: 21.45 GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: false
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: systemd
Plugins:
Volume: local
Network: bridge null host overlay
Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp selinux
Kernel Version: 3.10.0-514.10.2.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 4
Total Memory: 27.79 GiB
ID: AJCW:VLRA:Z3PB:VKGN:RMJ3:DKD6:32EN:NDRC:EEIW:4KII:DK6H:STPG
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
Insecure Registries:
172.30.0.0/16
127.0.0.0/8
Registries: registry.access.redhat.com (secure), docker.io (secure)

stevekuznetsov · 2017-05-31T19:40:08Z

/cc @smarterclayton

smarterclayton · 2017-05-31T20:03:44Z

Please file a bug so we can track it if you are using OCP. Many fixes are in OCP 3.4 that are not in Origin 1.4.

You'd probably need to grab a profile from the node process to identify the issue.

olsib · 2017-06-01T10:25:01Z

I added the following in /etc/sysconfig/origin-node config:
OPENSHIFT_PROFILE=web
and then restarted origin-node, but when I try and capture the profile with the following:

go tool pprof /usr/bin/openshift http://127.0.0.1:6060/debug/pprof/profile

I do get:
Local_ symbolization failed for openshift: open /usr/bin/openshift: no such file or directory

probably because the origin-node is running in a container, hence can't find the openshift binaries locally.
Any idea on how to get a profile for the origin node running in a container?

olsib · 2017-06-01T12:16:17Z

Think I had to run pprof from within the origin node container. Please download the profile from the following URL:
http://tinyurl.com/y7gnytb8

smarterclayton · 2017-06-01T13:07:10Z

Thanks, will take a look

smarterclayton · 2017-06-01T22:43:02Z

----------------------------------------------------------+-------------
                                            19.70s   100% |   github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet.(*Kubelet).cleanupOrphanedPodDirs
         0     0%     0%     19.70s 72.53%                | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet.(*Kubelet).getPodVolumeNameListFromDisk
                                            19.68s 99.90% |   github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util.ReadDirNoExit
                                             0.02s   0.1% |   io/ioutil.ReadDir
----------------------------------------------------------+-------------
                                            19.68s   100% |   github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/kubelet.(*Kubelet).getPodVolumeNameListFromDisk
     0.21s  0.77%  0.77%     19.68s 72.46%                | github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util.ReadDirNoExit
                                            15.56s 80.21% |   os.Lstat
                                             2.33s 12.01% |   os.(*File).Readdirnames
                                             1.44s  7.42% |   runtime.concatstring3
                                             0.07s  0.36% |   runtime.makeslice
----------------------------------------------------------+-------------
                                            15.56s 99.42% |   github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util.ReadDirNoExit
                                             0.09s  0.58% |   os.(*File).readdir
     0.06s  0.22%  0.99%     15.65s 57.62%                | os.Lstat
                                            13.52s 86.95% |   syscall.Lstat
                                             1.44s  9.26% |   runtime.newobject
                                             0.59s  3.79% |   os.fillFileStatFromSys
----------------------------------------------------------+-------------
                                            13.52s   100% |   os.Lstat
     0.09s  0.33%  1.33%     13.52s 49.78%                | syscall.Lstat
                                            11.46s 85.33% |   syscall.Syscall
                                             1.97s 14.67% |   syscall.BytePtrFromString
----------------------------------------------------------+-------------
                                            11.46s 91.97% |   syscall.Lstat
                                                1s  8.03% |   syscall.Getdents
    11.79s 43.41% 44.73%     12.47s 45.91%                | syscall.Syscall
                                             0.37s 54.41% |   runtime.exitsyscall
                                             0.31s 45.59% |   runtime.entersyscall
----------------------------------------------------------+-------------

@childsb @rootfs did we fix a CPU issue in 1.4/3.4 around cleanup of orphan pod dirs where we processed way too many? Basically we're spending all of our time in syscalls to read the volumes directory.

@olsib can you count how many mounts, files, and symlinks you have under your openshift volumes directory? An abnormal number, or a recursive set of links could cause this.

olsib · 2017-06-02T09:38:27Z

Below are recursive counts for what's under /var/lib/origin/openshift.local.volumes/pods dir:

318314 dirs
235497 files
279 symlinks

Also 79 active mounts in total (returned by mount command)

rootfs · 2017-06-07T20:54:53Z

@olsib can you upload the node log?

olsib · 2017-06-08T12:43:08Z

@rootfs please download below:
http://tinyurl.com/ybpltytb

rootfs · 2017-06-09T12:44:45Z

@olsib this looks similar to kubernetes/kubernetes#28750 and kubernetes/kubernetes#45935

# grep "Error: UnmountVolume.TearDown failed for volume" messages |grep "device or resource busy" |wc -l
164854
# grep "Error: UnmountVolume.TearDown failed for volume" messages |grep "device or resource busy" |grep deleting~ |wc -l
159737

olsib · 2017-06-09T13:08:30Z

@rootfs Was this supposed to be fixed in 1.4? We're getting those errors on another cluster as well which is running on same version (1.4).

rootfs · 2017-06-09T13:39:09Z

@olsib "Please file a bug so we can track it if you are using OCP. Many fixes are in OCP 3.4 that are not in Origin 1.4."

olsib · 2017-06-09T14:24:56Z

@rootfs we are using origin, does that mean that we do not get a fix for this?

stevekuznetsov · 2017-06-09T14:50:41Z

I'm not sure if we have an authoritative stance on this, but at this point Origin 1.4 updates will be limited to massive functionality issues and/or serious security issues.

olsib · 2017-06-09T15:07:49Z

@stevekuznetsov is this particular issue fixed in any version of Origin?

stevekuznetsov · 2017-06-09T15:18:20Z

@childsb would know what changes introduced the fix, from which we can determine which releases of Origin contain it.

childsb · 2017-06-14T14:34:00Z

The fix is here #14549

olsib · 2017-06-14T15:00:59Z

Thanks @childsb. @stevekuznetsov so it seems we will have to wait for v1.6 of origin in best scenario? Did you replicate this in OCP 3.4 as well?

stevekuznetsov · 2017-06-14T17:08:23Z

Yes, 1.6 will be the first Origin release with this fix,

olsib · 2017-08-08T13:21:10Z

Is the fix for this pulled in origin v3.6.0?

stevekuznetsov · 2017-08-08T13:31:53Z

$ git tag --contains 8cd58f56901daa453bcadf72d4a5710e23e6b154
v3.6.0
v3.6.0-rc.0
v3.7.0-alpha.0

Looks like yes.

stevekuznetsov added area/performance kind/bug Categorizes issue or PR as related to a bug. labels May 31, 2017

pweil- added component/storage priority/P2 labels Jun 5, 2017

pweil- assigned childsb Jun 5, 2017

stevekuznetsov closed this as completed Aug 8, 2017

krmayankk mentioned this issue Oct 27, 2017

StatefulSet pod stuck in ContainerCreating because kubernetes/kubernetes#54263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU Utilization Openshift process (Origin v1.4) #14428

High CPU Utilization Openshift process (Origin v1.4) #14428

olsib commented May 31, 2017 •

edited

Loading

stevekuznetsov commented May 31, 2017

smarterclayton commented May 31, 2017

olsib commented Jun 1, 2017 •

edited

Loading

olsib commented Jun 1, 2017

smarterclayton commented Jun 1, 2017 •

edited by stevekuznetsov

Loading

smarterclayton commented Jun 1, 2017

olsib commented Jun 2, 2017

rootfs commented Jun 7, 2017

olsib commented Jun 8, 2017

rootfs commented Jun 9, 2017 •

edited

Loading

olsib commented Jun 9, 2017 •

edited

Loading

rootfs commented Jun 9, 2017

olsib commented Jun 9, 2017 •

edited

Loading

stevekuznetsov commented Jun 9, 2017

olsib commented Jun 9, 2017

stevekuznetsov commented Jun 9, 2017

childsb commented Jun 14, 2017

olsib commented Jun 14, 2017

stevekuznetsov commented Jun 14, 2017

olsib commented Aug 8, 2017

stevekuznetsov commented Aug 8, 2017

High CPU Utilization Openshift process (Origin v1.4) #14428

High CPU Utilization Openshift process (Origin v1.4) #14428

Comments

olsib commented May 31, 2017 • edited Loading

Version

Steps To Reproduce

Current Result

Expected Result

Additional Information

stevekuznetsov commented May 31, 2017

smarterclayton commented May 31, 2017

olsib commented Jun 1, 2017 • edited Loading

olsib commented Jun 1, 2017

smarterclayton commented Jun 1, 2017 • edited by stevekuznetsov Loading

smarterclayton commented Jun 1, 2017

olsib commented Jun 2, 2017

rootfs commented Jun 7, 2017

olsib commented Jun 8, 2017

rootfs commented Jun 9, 2017 • edited Loading

olsib commented Jun 9, 2017 • edited Loading

rootfs commented Jun 9, 2017

olsib commented Jun 9, 2017 • edited Loading

stevekuznetsov commented Jun 9, 2017

olsib commented Jun 9, 2017

stevekuznetsov commented Jun 9, 2017

childsb commented Jun 14, 2017

olsib commented Jun 14, 2017

stevekuznetsov commented Jun 14, 2017

olsib commented Aug 8, 2017

stevekuznetsov commented Aug 8, 2017

olsib commented May 31, 2017 •

edited

Loading

olsib commented Jun 1, 2017 •

edited

Loading

smarterclayton commented Jun 1, 2017 •

edited by stevekuznetsov

Loading

rootfs commented Jun 9, 2017 •

edited

Loading

olsib commented Jun 9, 2017 •

edited

Loading

olsib commented Jun 9, 2017 •

edited

Loading