Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-problem-detector does not work if I change kubelet hostname to node ip. #23

Closed
wangyumi opened this issue Jul 11, 2016 · 4 comments
Assignees

Comments

@wangyumi
Copy link

wangyumi commented Jul 11, 2016

I am running k8s 1.3.0 and with docker image node-problem-detector:v0.1

I changed kubelet --hostname-override with node ip on ehch minion:

root@SZX1000116607:~# cat /etc/default/kubelet
KUBELET_OPTS='--hostname-override=10.22.109.119 --api-servers=http://10.22.109.119:8080,http://10.22.69.237:8080,http://10.22.117.82:8080 --pod-infra-container-image=xxxxx/kubernetes/pause:latest --cluster-dns=192.168.1.1  --cluster-domain=test1 --low-diskspace-threshold-mb=2048 --cert-dir=/var/run/kubelet --allow-privileged=true'

when I start node-problem-detector as a daemon set, I got a lot of error messages:

2016-07-11T07:49:17.310422203Z I0711 07:49:17.309761       1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
2016-07-11T07:49:17.310510490Z I0711 07:49:17.310026       1 kernel_monitor.go:93] Got system boot time: 2016-07-08 01:40:54.310019188 +0000 UTC
2016-07-11T07:49:17.311878174Z I0711 07:49:17.311436       1 kernel_monitor.go:102] Start kernel monitor
2016-07-11T07:49:17.311907663Z I0711 07:49:17.311654       1 kernel_log_watcher.go:173] unable to parse line: "", can't find timestamp prefix "kernel: [" in line ""
2016-07-11T07:49:17.311926515Z I0711 07:49:17.311696       1 kernel_log_watcher.go:110] Start watching kernel log
2016-07-11T07:49:17.311942118Z I0711 07:49:17.311720       1 problem_detector.go:60] Problem detector started
2016-07-11T07:49:17.314020808Z 2016/07/11 07:49:17 Seeked /log/kern.log - &{Offset:0 Whence:0}
2016-07-11T07:49:18.355974460Z E0711 07:49:18.355712       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:19.314395255Z E0711 07:49:19.314110       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:20.314302138Z E0711 07:49:20.313982       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:21.315151897Z E0711 07:49:21.314849       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:22.313977681Z E0711 07:49:22.313834       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:23.313906286Z E0711 07:49:23.313619       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:24.314428608Z E0711 07:49:24.314141       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:25.316626549Z E0711 07:49:25.316326       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:26.314346471Z E0711 07:49:26.314142       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:27.313901894Z E0711 07:49:27.313759       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:28.314356686Z E0711 07:49:28.314198       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:29.314747334Z E0711 07:49:29.314450       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:30.314562756Z E0711 07:49:30.314235       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found

It seems node-problem-detector still use the original os hostname to access node instead of overrided hostname that is registered in etcd.

/wangyumi

@Random-Liu
Copy link
Member

Random-Liu commented Jul 11, 2016

@wangyumi Node problem detector also needs a hostname-overwrite flag. I'll add it soon.
Thanks for reporting the issue! :)

It seems node-problem-detector still use the original os hostname to access node instead of overrided hostname that is registered in etcd.

node-problem-detector needs to use the node name to access the resource in etcd to get the node name. There is an chicken-egg issue here.

@Random-Liu
Copy link
Member

@wangyumi It seems that kube-proxy also has the problem if you don't manually change salt to override the hostname. Does kube-proxy work in your cluster after you override the hostname? :)

@KeithTt
Copy link

KeithTt commented Nov 2, 2017

meet the same issue:

Nov  2 05:48:49 uy05-13 kubelet[29322]: E1102 05:48:49.083813   29322 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:390: Failed to list *v1.Node: Get https://192.168.5.42:6443/api/v1/nodes?fieldSelector=metadata.name%3D192.168.5.42&resourceVersion=0: dial tcp 192.168.5.42:6443: getsockopt: connection refused
Nov  2 05:48:49 uy05-13 kubelet[29322]: E1102 05:48:49.281283   29322 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:382: Failed to list *v1.Service: Get https://192.168.5.42:6443/api/v1/services?resourceVersion=0: dial tcp 192.168.5.42:6443: getsockopt: connection refused

I modify the unit file of kubeadm:

Nov  2 05:48:49 uy05-13 kubelet[29322]: E1102 05:48:49.083813   29322 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:390: Failed to list *v1.Node: Get https://192.168.5.42:6443/api/v1/nodes?fieldSelector=metadata.name%3D192.168.5.42&resourceVersion=0: dial tcp 192.168.5.42:6443: getsockopt: connection refused
Nov  2 05:48:49 uy05-13 kubelet[29322]: E1102 05:48:49.281283   29322 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:382: Failed to list *v1.Service: Get https://192.168.5.42:6443/api/v1/services?resourceVersion=0: dial tcp 192.168.5.42:6443: getsockopt: connection refused

but it does not work...

@KeithTt
Copy link

KeithTt commented Nov 2, 2017

furthermore, it always dial tcp port 6443, but this port is not listening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants