Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node hostname changes to the name of a k8s daemonset pod #70543

Closed
bhperry opened this issue Nov 1, 2018 · 16 comments
Closed

Node hostname changes to the name of a k8s daemonset pod #70543

bhperry opened this issue Nov 1, 2018 · 16 comments

Comments

@bhperry
Copy link

@bhperry bhperry commented Nov 1, 2018

What happened:
A few of the nodes in my cluster changed their transient hostname to the name of the Clamav Pod that is running on them. Clamav mounts / to /host so that it can run scans, but in readonly mode so I don't think that it's directly changing the hostname. Only the transient hostname changed, /etc/hostname stays the same. This has happened on 3 nodes so far. I can change it back using hostnamectl, but there's no guarantees that it will stay that way. I originally thought it might have something to do with DHCP (that is where my research about transient hostnames changing initially led me), but the dhcpcd client service is not even running on the nodes, and on top of that I don't know how it would get a pod name.

Discovered this issue from an OSSEC (Wazuh) alert.

What you expected to happen:
Hostname should not become the name of a pod.

How to reproduce it (as minimally and precisely as possible):
I set up a secondary test cluster using the same config, and so far it's nodes have not had this happen yet. The original cluster is running a Clamav daemonset. CoreOS, CRI-O, and k8s 1.11.3. Not sure how to reproduce deterministically. The first two nodes changed in one night, then later that next day another one triggered the OSSEC alert.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T18:02:47Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-09T17:53:03Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    AWS
  • OS (e.g. from /etc/os-release):
    NAME="Container Linux by CoreOS"
    ID=coreos
    VERSION=1688.5.3
    VERSION_ID=1688.5.3
    BUILD_ID=2018-04-03-0547
    PRETTY_NAME="Container Linux by CoreOS 1688.5.3 (Rhyolite)"
    ANSI_COLOR="38;5;75"
    HOME_URL="https://coreos.com/"
    BUG_REPORT_URL="https://issues.coreos.com"
    COREOS_BOARD="amd64-usr"
  • Kernel (e.g. uname -a):
    Linux clamav-2dr4d 4.14.32-coreos #1 SMP Tue Apr 3 05:21:26 UTC 2018 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/Linux
  • Install tools:
    Custom. Kubelet running as systemd unit, all other control plane components are kubelet manifest pods.
  • Others:

/kind bug

@neolit123

This comment has been minimized.

Copy link
Member

@neolit123 neolit123 commented Nov 1, 2018

would be great if you can try reproducing with latest stable - 1.12.x
/sig node

@k8s-ci-robot k8s-ci-robot added sig/node and removed needs-sig labels Nov 1, 2018
@bhperry

This comment has been minimized.

Copy link
Author

@bhperry bhperry commented Nov 1, 2018

My company just upgraded to 1.11.3 from 1.9.7, we will not be doing another version upgrade for a little while (probably until 1.13). As I said, I haven't even been able to reproduce with exactly the same cluster config, so if I try with 1.12 and don't see the bug it doesn't really tell me anything. I will post an update if I do manage to reproduce.

Any guidance you might have on this issue would be greatly appreciated. I'm just digging in the dark for potential causes at the moment.

@neolit123

This comment has been minimized.

Copy link
Member

@neolit123 neolit123 commented Nov 1, 2018

i haven't seen reports about this before.
maybe someone from sig-node will have historical context.

@steven-sheehy

This comment has been minimized.

Copy link

@steven-sheehy steven-sheehy commented Nov 15, 2018

I saw this happen as well in Kubernetes v1.11.4. In my case it changed to a random StatefulSet pod name.

@steven-sheehy

This comment has been minimized.

Copy link

@steven-sheehy steven-sheehy commented Nov 19, 2018

@neolit123 This occurred again in a production environment. I was able to capture the kubelet logs from journalctl and it clearly shows the hostname changing in the output. Also, the hostname is changing to the pod (edge-mongodb-1) that is crashing and appears in the logs. I can also tell you that the pod was crashing because CRI-O had pids_limit set too low and the container could not create new pthreads after a while.

Nov 16 13:04:28 fsprdce1c02 kubelet[957]: E1116 13:04:28.756467     957 dns.go:180] CheckLimitsForResolvConf: Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!
Nov 16 13:04:37 fsprdce1c02 kubelet[957]: E1116 13:04:37.452214     957 dns.go:131] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.3.10.11 10.3.10.12 10.110.10.13
Nov 16 13:04:39 fsprdce1c02 kubelet[957]: E1116 13:04:39.452200     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:39 fsprdce1c02 kubelet[957]: E1116 13:04:39.749313     957 remote_runtime.go:187] CreateContainer in sandbox "62f9a6586cb9bc9607f99726e319c1ed845c03cc6f363a1f052aadb213219ee8" from runtime service failed: rpc error: code = Unknown desc = container create failed: container_linux.go:330: creating new parent process caused "container_linux.go:1759: running lstat on namespace path \"/proc/21054/ns/ipc\" caused \"lstat /proc/21054/ns/ipc: no such file or directory\""
Nov 16 13:04:39 fsprdce1c02 kubelet[957]: E1116 13:04:39.749383     957 kuberuntime_manager.go:733] container start failed: CreateContainerError: container create failed: container_linux.go:330: creating new parent process caused "container_linux.go:1759: running lstat on namespace path \"/proc/21054/ns/ipc\" caused \"lstat /proc/21054/ns/ipc: no such file or directory\""
Nov 16 13:04:39 fsprdce1c02 kubelet[957]: E1116 13:04:39.749420     957 pod_workers.go:186] Error syncing pod c2b4c200-e92b-11e8-a646-0050568593cc ("edge-mongodb-1_production(c2b4c200-e92b-11e8-a646-0050568593cc)"), skipping: failed to "StartContainer" for "mongodb" with CreateContainerError: "container create failed: container_linux.go:330: creating new parent process caused \"container_linux.go:1759: running lstat on namespace path \\\"/proc/21054/ns/ipc\\\" caused \\\"lstat /proc/21054/ns/ipc: no such file or directory\\\"\"\n"
Nov 16 13:04:39 fsprdce1c02 kubelet[957]: I1116 13:04:39.452092     957 kuberuntime_manager.go:513] Container {Name:mongodb Image:mongo:3.2.18 Command:[mongod] Args:[--config=/data/configdb/mongod.conf --dbpath=/data/db --replSet=rs0 --port=27017 --bind_ip=0.0.0.0 --auth --keyFile=/data/configdb/key.txt] WorkingDir: Ports:[{Name:mongodb HostPort:0 ContainerPort:27017 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[cpu:{i:{value:1500 scale:-3} d:{Dec:<nil>} s:1500m Format:DecimalSI} memory:{i:{value:3221225472 scale:0} d:{Dec:<nil>} s:3Gi Format:BinarySI}] Requests:map[memory:{i:{value:3221225472 scale:0} d:{Dec:<nil>} s:3Gi Format:BinarySI} cpu:{i:{value:1500 scale:-3} d:{Dec:<nil>} s:1500m Format:DecimalSI}]} VolumeMounts:[{Name:datadir ReadOnly:false MountPath:/data/db SubPath: MountPropagation:<nil>} {Name:configdir ReadOnly:false MountPath:/data/configdb SubPath: MountPropagation:<nil>} {Name:workdir ReadOnly:false MountPath:/work-dir SubPath: MountPropagation:<nil>} {Name:default-token-dxzpm ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:&ExecAction{Command:[mongo --eval db.adminCommand('ping')],},HTTPGet:nil,TCPSocket:nil,},InitialDelaySeconds:30,TimeoutSeconds:4,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:&Probe{Handler:Handler{Exec:&ExecAction{Command:[mongo --eval db.adminCommand('ping')],},HTTPGet:nil,TCPSocket:nil,},InitialDelaySeconds:5,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Nov 16 13:04:39 fsprdce1c02 kubelet[957]: I1116 13:04:39.452261     957 kuberuntime_manager.go:757] checking backoff for container "mongodb" in pod "edge-mongodb-1_production(c2b4c200-e92b-11e8-a646-0050568593cc)"
Nov 16 13:04:40 fsprdce1c02 kubelet[957]: E1116 13:04:40.452310     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:47 fsprdce1c02 kubelet[957]: E1116 13:04:47.453519     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:49 fsprdce1c02 kubelet[957]: E1116 13:04:49.452197     957 dns.go:131] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.3.10.11 10.3.10.12 10.110.10.13
Nov 16 13:04:52 fsprdce1c02 kubelet[957]: E1116 13:04:52.452395     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:52 fsprdce1c02 kubelet[957]: I1116 13:04:52.452262     957 kuberuntime_manager.go:513] Container {Name:mongodb Image:mongo:3.2.18 Command:[mongod] Args:[--config=/data/configdb/mongod.conf --dbpath=/data/db --replSet=rs0 --port=27017 --bind_ip=0.0.0.0 --auth --keyFile=/data/configdb/key.txt] WorkingDir: Ports:[{Name:mongodb HostPort:0 ContainerPort:27017 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[memory:{i:{value:3221225472 scale:0} d:{Dec:<nil>} s:3Gi Format:BinarySI} cpu:{i:{value:1500 scale:-3} d:{Dec:<nil>} s:1500m Format:DecimalSI}] Requests:map[memory:{i:{value:3221225472 scale:0} d:{Dec:<nil>} s:3Gi Format:BinarySI} cpu:{i:{value:1500 scale:-3} d:{Dec:<nil>} s:1500m Format:DecimalSI}]} VolumeMounts:[{Name:datadir ReadOnly:false MountPath:/data/db SubPath: MountPropagation:<nil>} {Name:configdir ReadOnly:false MountPath:/data/configdb SubPath: MountPropagation:<nil>} {Name:workdir ReadOnly:false MountPath:/work-dir SubPath: MountPropagation:<nil>} {Name:default-token-dxzpm ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:&ExecAction{Command:[mongo --eval db.adminCommand('ping')],},HTTPGet:nil,TCPSocket:nil,},InitialDelaySeconds:30,TimeoutSeconds:4,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:&Probe{Handler:Handler{Exec:&ExecAction{Command:[mongo --eval db.adminCommand('ping')],},HTTPGet:nil,TCPSocket:nil,},InitialDelaySeconds:5,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:nil Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Nov 16 13:04:52 fsprdce1c02 kubelet[957]: I1116 13:04:52.452449     957 kuberuntime_manager.go:757] checking backoff for container "mongodb" in pod "edge-mongodb-1_production(c2b4c200-e92b-11e8-a646-0050568593cc)"
Nov 16 13:04:52 fsprdce1c02 kubelet[957]: W1116 13:04:52.627537     957 container.go:507] Failed to update stats for container "/libcontainer_21049_systemd_test_default.slice": open /sys/fs/cgroup/cpu,cpuacct/libcontainer_21049_systemd_test_default.slice/cpuacct.usage_percpu: no such file or directory, continuing to push stats
Nov 16 13:04:53 edge-mongodb-1 kubelet[957]: E1116 13:04:53.764055     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:54 edge-mongodb-1 kubelet[957]: E1116 13:04:54.453702     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:58 edge-mongodb-1 kubelet[957]: E1116 13:04:58.453070     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:58 edge-mongodb-1 kubelet[957]: E1116 13:04:58.454798     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:04:58 edge-mongodb-1 kubelet[957]: E1116 13:04:58.756806     957 dns.go:180] CheckLimitsForResolvConf: Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!
Nov 16 13:05:11 edge-mongodb-1 kubelet[957]: E1116 13:05:11.454002     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:05:12 edge-mongodb-1 kubelet[957]: E1116 13:05:12.453992     957 dns.go:131] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.3.10.11 10.3.10.12 10.110.10.13
Nov 16 13:05:17 edge-mongodb-1 kubelet[957]: E1116 13:05:17.452281     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:05:19 edge-mongodb-1 kubelet[957]: E1116 13:05:19.453795     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com
Nov 16 13:05:27 edge-mongodb-1 kubelet[957]: E1116 13:05:27.452297     957 dns.go:131] Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.3.10.11 10.3.10.12 10.110.10.13
Nov 16 13:05:28 edge-mongodb-1 kubelet[957]: E1116 13:05:28.757075     957 dns.go:180] CheckLimitsForResolvConf: Resolv.conf file '/etc/resolv.conf' contains search line consisting of more than 3 domains!
Nov 16 13:05:29 edge-mongodb-1 kubelet[957]: E1116 13:05:29.452455     957 dns.go:121] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: production.svc.cluster.local svc.cluster.local cluster.local corp.kwiktrip.com kwiktrip.com dmz.kwiktrip.com

This bug causes major issues with the cluster by causing the node with a bad transient hostname to be marked unready and all pods evicted. Unfortunately, I don't have the ability to reproduce this manually or try it in Kubernetes v1.12.

@mrunalp

This comment has been minimized.

Copy link
Contributor

@mrunalp mrunalp commented Nov 20, 2018

We (CRI-O team) are looking into this. Thanks!

Please tag us or open an issue on CRI-O issue tracker for us to be notified quicker :)

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Nov 20, 2018

@neolit123: Closing this issue.

In response to this:

thanks doesn't seem to be a kubeadm bug. @mrunalp

@bhperry @steven-sheehy

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123

This comment has been minimized.

Copy link
Member

@neolit123 neolit123 commented Nov 20, 2018

/reopen
oops wrong issue. :]

@k8s-ci-robot k8s-ci-robot reopened this Nov 20, 2018
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented Nov 20, 2018

@neolit123: Reopened this issue.

In response to this:

/reopen
oops wrong issue. :]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bartelsb

This comment has been minimized.

Copy link

@bartelsb bartelsb commented Dec 5, 2018

Hi @mrunalp, do you have an issue for this opened in the https://github.com/kubernetes-sigs/cri-o project? If so, please add a link to it so that we can follow the discussion and be alerted when the fix is available. Thanks!

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Mar 5, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@falencastro

This comment has been minimized.

Copy link

@falencastro falencastro commented Mar 8, 2019

This is still happening on kubernetes 1.13/cri-o 1.13. I too would like to know if there's an issue open with the cri-o team so we can keep track.

@steven-sheehy

This comment has been minimized.

Copy link

@steven-sheehy steven-sheehy commented Mar 8, 2019

@falencastro There doesn't look to be. You might want to create one.

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented Apr 7, 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@fejta-bot

This comment has been minimized.

Copy link

@fejta-bot fejta-bot commented May 7, 2019

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

@k8s-ci-robot k8s-ci-robot commented May 7, 2019

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.