New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socat locks up when attempting port-forwarding #740

Closed
weikinhuang opened this Issue Oct 19, 2017 · 19 comments

Comments

Projects
None yet
3 participants
@weikinhuang
Contributor

weikinhuang commented Oct 19, 2017

Deployed single node bootkube@v0.8.0 on CoreOS@1548.2.0 & kubelet-wrapper systemd using --network-provider=experimental-calico.

When trying to use helm/tiller or kubectl port-forward socat locks up the system with 100% cpu usage when using the kubelet image quay.io/coreos/hyperkube@v1.8.0_coreos.0 or quay.io/coreos/hyperkube@v1.8.1_coreos.0. However if i downgrade just the kubelet image down to quay.io/coreos/hyperkube@v1.7.8_coreos.0 and keep all the other components on v1.8.1 it works fine.

I don't really know if this is an issue with bootkube, the coreos hyperkube image or the regular hyperkube image. But i'm setting up my cluster with bootkube which is why I'm asking here first.

@dghubble

This comment has been minimized.

Show comment
Hide comment
@dghubble

dghubble Oct 19, 2017

Collaborator

I deployed v1.8.1 clusters (with Calico and CL 1520.6.0) on AWS, GCE, and bare-metal. Checking kubectl port-forward some-kubernetes-dashboard-pod -n kube-system 9090 works as usual for me on AWS and bare-metal.

I do see port-forward connections hang on GCE (edit: both flannel and calico), but I'm investigating some other issues there which are probably the root problem. I think CPU spikes are a red herring, I can't reproduce those, unless maybe your nodes were on the cusp of being overloaded or something.

Collaborator

dghubble commented Oct 19, 2017

I deployed v1.8.1 clusters (with Calico and CL 1520.6.0) on AWS, GCE, and bare-metal. Checking kubectl port-forward some-kubernetes-dashboard-pod -n kube-system 9090 works as usual for me on AWS and bare-metal.

I do see port-forward connections hang on GCE (edit: both flannel and calico), but I'm investigating some other issues there which are probably the root problem. I think CPU spikes are a red herring, I can't reproduce those, unless maybe your nodes were on the cusp of being overloaded or something.

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 19, 2017

Contributor

I don't think so, I was running htop, showed i had a bunch of resources free, socat took up 1 full core at 100% out of 4, and still had 2gb of ram left. I even rebuilt and redeployed a new vm (running on vmware) completely and it still happened. However on another cluster of 5 vms it's not showing the issue.

I was port forwarding echoserver in the default ns. But I don't think that mattered since helm was deployed on kube-system.

Contributor

weikinhuang commented Oct 19, 2017

I don't think so, I was running htop, showed i had a bunch of resources free, socat took up 1 full core at 100% out of 4, and still had 2gb of ram left. I even rebuilt and redeployed a new vm (running on vmware) completely and it still happened. However on another cluster of 5 vms it's not showing the issue.

I was port forwarding echoserver in the default ns. But I don't think that mattered since helm was deployed on kube-system.

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 19, 2017

Contributor

On my single node cluster:

There's no logs for the kubelet journalctl -f kubelet.service

When I run, it looks like it's connecting, but nothing actually happens.

kubectl port-forward -n foo-ns internal-pod 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080

However on the server side:

image

and even after i kill the port-forward command, socat is still running on the kubelet.

Contributor

weikinhuang commented Oct 19, 2017

On my single node cluster:

There's no logs for the kubelet journalctl -f kubelet.service

When I run, it looks like it's connecting, but nothing actually happens.

kubectl port-forward -n foo-ns internal-pod 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080
Handling connection for 8080

However on the server side:

image

and even after i kill the port-forward command, socat is still running on the kubelet.

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

I just added a second node to that cluster, and it's still broken.

Contributor

weikinhuang commented Oct 20, 2017

I just added a second node to that cluster, and it's still broken.

@dghubble

This comment has been minimized.

Show comment
Hide comment
@dghubble

dghubble Oct 20, 2017

Collaborator

What platform are you on? Also, your journalctl command requires a -u for unit. We may be talking about different issues. Works as expected for me.

Collaborator

dghubble commented Oct 20, 2017

What platform are you on? Also, your journalctl command requires a -u for unit. We may be talking about different issues. Works as expected for me.

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

Yes, that was a mistype, I did run journalctl -f -u kubelet.service I see the usual logs about container health but nothing about networking or other issues.

Distro info:

$ cat /etc/os-release 
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1548.2.0
VERSION_ID=1548.2.0
BUILD_ID=2017-10-12-0514
PRETTY_NAME="Container Linux by CoreOS 1548.2.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

The socat is within the kubelet container. htop makes that a little hard to see, but in tree view it's def within the kubelet process tree.

This is how i'm running the kubelet:

[Unit]
Description=Kubelet via Hyperkube ACI
[Service]
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/run/kubelet-pod.uuid \
  --volume=resolv,kind=host,source=/etc/resolv.conf \
  --mount volume=resolv,target=/etc/resolv.conf \
  --volume var-lib-cni,kind=host,source=/var/lib/cni \
  --mount volume=var-lib-cni,target=/var/lib/cni \
  --volume opt-cni-bin,kind=host,source=/opt/cni/bin \
  --mount volume=opt-cni-bin,target=/opt/cni/bin \
  --volume var-log,kind=host,source=/var/log \
  --mount volume=var-log,target=/var/log"
EnvironmentFile=/etc/kubernetes/kubelet.env
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /srv/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/checkpoint-secrets
ExecStartPre=/bin/mkdir -p /etc/kubernetes/inactive-manifests
ExecStartPre=/bin/mkdir -p /var/lib/cni
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/usr/bin/bash -c "grep 'certificate-authority-data' /etc/kubernetes/kubeconfig | awk '{print $2}' | base64 -d > /etc/kubernetes/ca.crt"
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/run/kubelet-pod.uuid
ExecStart=/usr/lib/coreos/kubelet-wrapper \
  --allow-privileged \
  --anonymous-auth=false \
  --client-ca-file=/etc/kubernetes/ca.crt \
  --cluster_dns=10.3.0.10 \
  --cluster_domain=cluster.local \
  --cni-conf-dir=/etc/kubernetes/cni/net.d \
  --exit-on-lock-contention \
  --kubeconfig=/etc/kubernetes/kubeconfig \
  --lock-file=/var/run/lock/kubelet.lock \
  --network-plugin=cni \
  --node-labels=node-role.kubernetes.io/master,master=true \
  --pod-manifest-path=/etc/kubernetes/manifests \
  --eviction-hard=memory.available<5% \
  --eviction-soft=memory.available<7% \
  --eviction-soft-grace-period=memory.available=2m \
  --eviction-pressure-transition-period=5m \
  --require-kubeconfig
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/run/kubelet-pod.uuid
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
$ cat /etc/kubernetes/kubelet.env 
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.8.1_coreos.0
Contributor

weikinhuang commented Oct 20, 2017

Yes, that was a mistype, I did run journalctl -f -u kubelet.service I see the usual logs about container health but nothing about networking or other issues.

Distro info:

$ cat /etc/os-release 
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1548.2.0
VERSION_ID=1548.2.0
BUILD_ID=2017-10-12-0514
PRETTY_NAME="Container Linux by CoreOS 1548.2.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

The socat is within the kubelet container. htop makes that a little hard to see, but in tree view it's def within the kubelet process tree.

This is how i'm running the kubelet:

[Unit]
Description=Kubelet via Hyperkube ACI
[Service]
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/run/kubelet-pod.uuid \
  --volume=resolv,kind=host,source=/etc/resolv.conf \
  --mount volume=resolv,target=/etc/resolv.conf \
  --volume var-lib-cni,kind=host,source=/var/lib/cni \
  --mount volume=var-lib-cni,target=/var/lib/cni \
  --volume opt-cni-bin,kind=host,source=/opt/cni/bin \
  --mount volume=opt-cni-bin,target=/opt/cni/bin \
  --volume var-log,kind=host,source=/var/log \
  --mount volume=var-log,target=/var/log"
EnvironmentFile=/etc/kubernetes/kubelet.env
ExecStartPre=/bin/mkdir -p /etc/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /srv/kubernetes/manifests
ExecStartPre=/bin/mkdir -p /etc/kubernetes/cni/net.d
ExecStartPre=/bin/mkdir -p /etc/kubernetes/checkpoint-secrets
ExecStartPre=/bin/mkdir -p /etc/kubernetes/inactive-manifests
ExecStartPre=/bin/mkdir -p /var/lib/cni
ExecStartPre=/bin/mkdir -p /opt/cni/bin
ExecStartPre=/usr/bin/bash -c "grep 'certificate-authority-data' /etc/kubernetes/kubeconfig | awk '{print $2}' | base64 -d > /etc/kubernetes/ca.crt"
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/run/kubelet-pod.uuid
ExecStart=/usr/lib/coreos/kubelet-wrapper \
  --allow-privileged \
  --anonymous-auth=false \
  --client-ca-file=/etc/kubernetes/ca.crt \
  --cluster_dns=10.3.0.10 \
  --cluster_domain=cluster.local \
  --cni-conf-dir=/etc/kubernetes/cni/net.d \
  --exit-on-lock-contention \
  --kubeconfig=/etc/kubernetes/kubeconfig \
  --lock-file=/var/run/lock/kubelet.lock \
  --network-plugin=cni \
  --node-labels=node-role.kubernetes.io/master,master=true \
  --pod-manifest-path=/etc/kubernetes/manifests \
  --eviction-hard=memory.available<5% \
  --eviction-soft=memory.available<7% \
  --eviction-soft-grace-period=memory.available=2m \
  --eviction-pressure-transition-period=5m \
  --require-kubeconfig
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/run/kubelet-pod.uuid
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
$ cat /etc/kubernetes/kubelet.env 
KUBELET_IMAGE_URL=quay.io/coreos/hyperkube
KUBELET_IMAGE_TAG=v1.8.1_coreos.0
@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

For a more concrete example: After deploying bootkube cluster and running helm init

On the node itself:

core@n101 ~ $ curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 49.8M  100 49.8M    0     0  4201k      0  0:00:12  0:00:12 --:--:-- 3328k
core@n101 ~ $ chmod +x kubectl
core@n101 ~ $ KUBECONFIG=/etc/kubernetes/kubeconfig ./kubectl get po --all-namespaces
NAMESPACE     NAME                                                   READY     STATUS    RESTARTS   AGE
kube-system   calico-node-6mdmd                                      2/2       Running   0          36m
kube-system   calico-node-gqkf6                                      2/2       Running   0          41m
kube-system   kube-apiserver-4z6ls                                   1/1       Running   0          41m
kube-system   kube-controller-manager-5dbc5d8c6b-96cwp               1/1       Running   0          41m
kube-system   kube-controller-manager-5dbc5d8c6b-9p2nw               1/1       Running   0          41m
kube-system   kube-dns-598c789574-h7p89                              3/3       Running   0          41m
kube-system   kube-proxy-4lxwh                                       1/1       Running   0          41m
kube-system   kube-proxy-hbbcz                                       1/1       Running   0          36m
kube-system   kube-scheduler-bdb68cdc-gjjlg                          1/1       Running   0          41m
kube-system   kube-scheduler-bdb68cdc-wn6j9                          1/1       Running   0          41m
kube-system   pod-checkpointer-bh9f2                                 1/1       Running   0          41m
kube-system   pod-checkpointer-bh9f2-n101.node.k8s.weikinhuang.com   1/1       Running   0          41m
kube-system   tiller-deploy-cffb976df-qwj98                          1/1       Running   0          34m
core@n101 ~ $ KUBECONFIG=/etc/kubernetes/kubeconfig ./kubectl port-forward -n kube-system tiller-deploy-cffb976df-qwj98 44134:44134
Forwarding from 127.0.0.1:44134 -> 44134
Handling connection for 44134

in another tab:

core@n101 ~ $ curl 127.0.0.1:44134
# just hangs

image


Edit: just tried downgrading to kubelet KUBELET_IMAGE_TAG=v1.7.8_coreos.2 which was built 2 days ago, and it seems to work. Looks like it's something with the 1.8.x hyperkube images.

Contributor

weikinhuang commented Oct 20, 2017

For a more concrete example: After deploying bootkube cluster and running helm init

On the node itself:

core@n101 ~ $ curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 49.8M  100 49.8M    0     0  4201k      0  0:00:12  0:00:12 --:--:-- 3328k
core@n101 ~ $ chmod +x kubectl
core@n101 ~ $ KUBECONFIG=/etc/kubernetes/kubeconfig ./kubectl get po --all-namespaces
NAMESPACE     NAME                                                   READY     STATUS    RESTARTS   AGE
kube-system   calico-node-6mdmd                                      2/2       Running   0          36m
kube-system   calico-node-gqkf6                                      2/2       Running   0          41m
kube-system   kube-apiserver-4z6ls                                   1/1       Running   0          41m
kube-system   kube-controller-manager-5dbc5d8c6b-96cwp               1/1       Running   0          41m
kube-system   kube-controller-manager-5dbc5d8c6b-9p2nw               1/1       Running   0          41m
kube-system   kube-dns-598c789574-h7p89                              3/3       Running   0          41m
kube-system   kube-proxy-4lxwh                                       1/1       Running   0          41m
kube-system   kube-proxy-hbbcz                                       1/1       Running   0          36m
kube-system   kube-scheduler-bdb68cdc-gjjlg                          1/1       Running   0          41m
kube-system   kube-scheduler-bdb68cdc-wn6j9                          1/1       Running   0          41m
kube-system   pod-checkpointer-bh9f2                                 1/1       Running   0          41m
kube-system   pod-checkpointer-bh9f2-n101.node.k8s.weikinhuang.com   1/1       Running   0          41m
kube-system   tiller-deploy-cffb976df-qwj98                          1/1       Running   0          34m
core@n101 ~ $ KUBECONFIG=/etc/kubernetes/kubeconfig ./kubectl port-forward -n kube-system tiller-deploy-cffb976df-qwj98 44134:44134
Forwarding from 127.0.0.1:44134 -> 44134
Handling connection for 44134

in another tab:

core@n101 ~ $ curl 127.0.0.1:44134
# just hangs

image


Edit: just tried downgrading to kubelet KUBELET_IMAGE_TAG=v1.7.8_coreos.2 which was built 2 days ago, and it seems to work. Looks like it's something with the 1.8.x hyperkube images.

@dghubble

This comment has been minimized.

Show comment
Hide comment
@dghubble

dghubble Oct 20, 2017

Collaborator

I can't produce the CPU spike issue and port-forward is ok on my AWS, bare-metal, and Digital Ocean test clusters. Can you try using the upstream hyperkube gcr.io/google_containers/hyperkube:v1.8.1? Starting in v1.8, bootkube should now be able to use either just fine, the quay.io/coreos/hyperkube patches are only relevant for Tectonic.

Another shot in the dark, I see socat has a vaguely recent changelog fix about CPU load being high when resolution problems occur and I see hyperkube's socat is older. Can you resolve node names if using FQDN names?

Collaborator

dghubble commented Oct 20, 2017

I can't produce the CPU spike issue and port-forward is ok on my AWS, bare-metal, and Digital Ocean test clusters. Can you try using the upstream hyperkube gcr.io/google_containers/hyperkube:v1.8.1? Starting in v1.8, bootkube should now be able to use either just fine, the quay.io/coreos/hyperkube patches are only relevant for Tectonic.

Another shot in the dark, I see socat has a vaguely recent changelog fix about CPU load being high when resolution problems occur and I see hyperkube's socat is older. Can you resolve node names if using FQDN names?

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

Can't seem to start the gcr.io/google_containers/hyperkube:v1.8.1 image via kubelet-wrapper

Oct 20 19:02:11 n101.example.com systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Oct 20 19:02:11 n101.example.com systemd[1]: Stopped Kubelet via Hyperkube ACI.
Oct 20 19:02:11 n101.example.com systemd[1]: Starting Kubelet via Hyperkube ACI...
Oct 20 19:02:11 n101.example.com rkt[2263]: rm: unable to resolve UUID from file: open /var/run/kubelet-pod.uuid: no such file or directory
Oct 20 19:02:11 n101.example.com rkt[2263]: rm: failed to remove one or more pods
Oct 20 19:02:11 n101.example.com systemd[1]: Started Kubelet via Hyperkube ACI.
Oct 20 19:02:11 n101.example.com kubelet-wrapper[2271]: + exec /usr/bin/rkt run --uuid-file-save=/var/run/kubelet-pod.uuid --volume=resolv,kind=host,source=/etc/resolv.conf --mount volume=resolv,target=/etc/resolv.conf --volume var-lib-cni,kind=host,source=/var/lib/cni --mount volume=var-lib-cni,target=/var/lib/cni --volume opt-cni-bin,kind=host,source=/opt/cni/bin --mount volume=opt-cni-bin,target=/opt/cni/bin --volume var-log,kind=host,source=/var/log --mount volume=var-log,target=/var/log --volume coreos-etc-kubernetes,kind=host,source=/etc/kubernetes,readOnly=false --volume coreos-etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume coreos-usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume coreos-var-lib-docker,kind=host,source=/var/lib/docker,readOnly=false --volume coreos-var-lib-kubelet,kind=host,source=/var/lib/kubelet,readOnly=false,recursive=true --volume coreos-var-log,kind=host,source=/var/log,readOnly=false --volume coreos-os-release,kind=host,source=/usr/lib/os-release,readOnly=true --volume coreos-run,kind=host,source=/run,readOnly=false --volume coreos-lib-modules,kind=host,source=/lib/modules,readOnly=true --mount volume=coreos-etc-kubernetes,target=/etc/kubernetes --mount volume=coreos-etc-ssl-certs,target=/etc/ssl/certs --mount volume=coreos-usr-share-certs,target=/usr/share/ca-certificates --mount volume=coreos-var-lib-docker,target=/var/lib/docker --mount volume=coreos-var-lib-kubelet,target=/var/lib/kubelet --mount volume=coreos-var-log,target=/var/log --mount volume=coreos-os-release,target=/etc/os-release --mount volume=coreos-run,target=/run --mount volume=coreos-lib-modules,target=/lib/modules --stage1-from-dir=stage1-fly.aci gcr.io/google_containers/hyperkube:v1.8.1 --exec=/kubelet -- --allow-privileged --anonymous-auth=false --client-ca-file=/etc/kubernetes/ca.crt --cluster_dns=10.3.0.10 --cluster_domain=cluster.local --cni-conf-dir=/etc/kubernetes/cni/net.d --exit-on-lock-contention --hostname-override=n101.example.com --kubeconfig=/etc/kubernetes/kubeconfig --lock-file=/var/run/
Oct 20 19:02:11 n101.example.com kubelet-wrapper[2271]: lock/kubelet.lock --network-plugin=cni --node-labels=node-role.kubernetes.io/master,master=true --pod-manifest-path=/etc/kubernetes/manifests '--eviction-hard=memory.available<5' '--eviction-soft=memory.available<7' --eviction-soft-grace-period=memory.available=2m --eviction-pressure-transition-period=5m --require-kubeconfig
Oct 20 19:02:12 n101.example.com kubelet-wrapper[2271]: run: discovery failed
Oct 20 19:02:12 n101.example.com systemd[1]: kubelet.service: Main process exited, code=exited, status=254/n/a
Oct 20 19:02:12 n101.example.com systemd[1]: kubelet.service: Unit entered failed state.
Oct 20 19:02:12 n101.example.com systemd[1]: kubelet.service: Failed with result 'exit-code'.

Edit: got the image to run, however it's the same problem. Hangs with full core usage.

My node's hostname is in a public dns provider and resolv.conf as follows:

nameserver 8.8.8.8
nameserver 8.8.4.4
Contributor

weikinhuang commented Oct 20, 2017

Can't seem to start the gcr.io/google_containers/hyperkube:v1.8.1 image via kubelet-wrapper

Oct 20 19:02:11 n101.example.com systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Oct 20 19:02:11 n101.example.com systemd[1]: Stopped Kubelet via Hyperkube ACI.
Oct 20 19:02:11 n101.example.com systemd[1]: Starting Kubelet via Hyperkube ACI...
Oct 20 19:02:11 n101.example.com rkt[2263]: rm: unable to resolve UUID from file: open /var/run/kubelet-pod.uuid: no such file or directory
Oct 20 19:02:11 n101.example.com rkt[2263]: rm: failed to remove one or more pods
Oct 20 19:02:11 n101.example.com systemd[1]: Started Kubelet via Hyperkube ACI.
Oct 20 19:02:11 n101.example.com kubelet-wrapper[2271]: + exec /usr/bin/rkt run --uuid-file-save=/var/run/kubelet-pod.uuid --volume=resolv,kind=host,source=/etc/resolv.conf --mount volume=resolv,target=/etc/resolv.conf --volume var-lib-cni,kind=host,source=/var/lib/cni --mount volume=var-lib-cni,target=/var/lib/cni --volume opt-cni-bin,kind=host,source=/opt/cni/bin --mount volume=opt-cni-bin,target=/opt/cni/bin --volume var-log,kind=host,source=/var/log --mount volume=var-log,target=/var/log --volume coreos-etc-kubernetes,kind=host,source=/etc/kubernetes,readOnly=false --volume coreos-etc-ssl-certs,kind=host,source=/etc/ssl/certs,readOnly=true --volume coreos-usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume coreos-var-lib-docker,kind=host,source=/var/lib/docker,readOnly=false --volume coreos-var-lib-kubelet,kind=host,source=/var/lib/kubelet,readOnly=false,recursive=true --volume coreos-var-log,kind=host,source=/var/log,readOnly=false --volume coreos-os-release,kind=host,source=/usr/lib/os-release,readOnly=true --volume coreos-run,kind=host,source=/run,readOnly=false --volume coreos-lib-modules,kind=host,source=/lib/modules,readOnly=true --mount volume=coreos-etc-kubernetes,target=/etc/kubernetes --mount volume=coreos-etc-ssl-certs,target=/etc/ssl/certs --mount volume=coreos-usr-share-certs,target=/usr/share/ca-certificates --mount volume=coreos-var-lib-docker,target=/var/lib/docker --mount volume=coreos-var-lib-kubelet,target=/var/lib/kubelet --mount volume=coreos-var-log,target=/var/log --mount volume=coreos-os-release,target=/etc/os-release --mount volume=coreos-run,target=/run --mount volume=coreos-lib-modules,target=/lib/modules --stage1-from-dir=stage1-fly.aci gcr.io/google_containers/hyperkube:v1.8.1 --exec=/kubelet -- --allow-privileged --anonymous-auth=false --client-ca-file=/etc/kubernetes/ca.crt --cluster_dns=10.3.0.10 --cluster_domain=cluster.local --cni-conf-dir=/etc/kubernetes/cni/net.d --exit-on-lock-contention --hostname-override=n101.example.com --kubeconfig=/etc/kubernetes/kubeconfig --lock-file=/var/run/
Oct 20 19:02:11 n101.example.com kubelet-wrapper[2271]: lock/kubelet.lock --network-plugin=cni --node-labels=node-role.kubernetes.io/master,master=true --pod-manifest-path=/etc/kubernetes/manifests '--eviction-hard=memory.available<5' '--eviction-soft=memory.available<7' --eviction-soft-grace-period=memory.available=2m --eviction-pressure-transition-period=5m --require-kubeconfig
Oct 20 19:02:12 n101.example.com kubelet-wrapper[2271]: run: discovery failed
Oct 20 19:02:12 n101.example.com systemd[1]: kubelet.service: Main process exited, code=exited, status=254/n/a
Oct 20 19:02:12 n101.example.com systemd[1]: kubelet.service: Unit entered failed state.
Oct 20 19:02:12 n101.example.com systemd[1]: kubelet.service: Failed with result 'exit-code'.

Edit: got the image to run, however it's the same problem. Hangs with full core usage.

My node's hostname is in a public dns provider and resolv.conf as follows:

nameserver 8.8.8.8
nameserver 8.8.4.4
@dghubble

This comment has been minimized.

Show comment
Hide comment
@dghubble

dghubble Oct 20, 2017

Collaborator

Oh, you have to use docker://gcr.io/google_containers/hyperkube:v1.8.1 and set the rkt argument --insecure-options=image https://coreos.com/rkt/docs/latest/running-docker-images.html

Collaborator

dghubble commented Oct 20, 2017

Oh, you have to use docker://gcr.io/google_containers/hyperkube:v1.8.1 and set the rkt argument --insecure-options=image https://coreos.com/rkt/docs/latest/running-docker-images.html

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

I figured it out! looks like inside the rkt pod for kubelet it can't resolve localhost. Your suggestion about FQDN was the correct answer.

Looks like kubelet-wrapper is broken for me when running the hyperkube image as the kubelet image.

When running the hyperkube images with docker

core@n101 ~ $ docker run -it --rm quay.io/coreos/hyperkube:v1.8.1_coreos.0 cat /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.2      dfb09a09e951

core@n101 ~ $ docker run -it --rm quay.io/coreos/hyperkube:v1.7.8_coreos.2 cat /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.2      4fa5a0dfc07c

Running the image with kubelet-wrapper

core@n101 ~ $ rkt list
UUID            APP             IMAGE NAME                                      STATE   CREATED         STARTED         NETWORKS
38d3d778        hyperkube       gcr.io/google_containers/hyperkube:v1.8.1       running 3 seconds ago   3 seconds ago
f420bb4a        etcd            quay.io/coreos/etcd:v3.2.9                      running 37 minutes ago  37 minutes ago

core@n101 ~ $ sudo rkt enter 38d3d778
enter: no command specified, assuming "/bin/bash"
# cat /etc/hosts
# 

However when running with the quay.io/coreos/hyperkube:v1.7.8_coreos.2 image, the hosts file is correctly set.

core@n101 ~ $ rkt list
UUID            APP             IMAGE NAME                                      STATE   CREATED         STARTED         NETWORKS
37ea1711        hyperkube       quay.io/coreos/hyperkube:v1.7.8_coreos.2        running 1 minute ago    1 minute ago
f420bb4a        etcd            quay.io/coreos/etcd:v3.2.9                      running 43 minutes ago  43 minutes ago
core@n101 ~ $ sudo rkt enter 37ea1711
enter: no command specified, assuming "/bin/bash"
root@n101:/# cat /etc/hosts
127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters

root@n101:/# 
Contributor

weikinhuang commented Oct 20, 2017

I figured it out! looks like inside the rkt pod for kubelet it can't resolve localhost. Your suggestion about FQDN was the correct answer.

Looks like kubelet-wrapper is broken for me when running the hyperkube image as the kubelet image.

When running the hyperkube images with docker

core@n101 ~ $ docker run -it --rm quay.io/coreos/hyperkube:v1.8.1_coreos.0 cat /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.2      dfb09a09e951

core@n101 ~ $ docker run -it --rm quay.io/coreos/hyperkube:v1.7.8_coreos.2 cat /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
172.17.0.2      4fa5a0dfc07c

Running the image with kubelet-wrapper

core@n101 ~ $ rkt list
UUID            APP             IMAGE NAME                                      STATE   CREATED         STARTED         NETWORKS
38d3d778        hyperkube       gcr.io/google_containers/hyperkube:v1.8.1       running 3 seconds ago   3 seconds ago
f420bb4a        etcd            quay.io/coreos/etcd:v3.2.9                      running 37 minutes ago  37 minutes ago

core@n101 ~ $ sudo rkt enter 38d3d778
enter: no command specified, assuming "/bin/bash"
# cat /etc/hosts
# 

However when running with the quay.io/coreos/hyperkube:v1.7.8_coreos.2 image, the hosts file is correctly set.

core@n101 ~ $ rkt list
UUID            APP             IMAGE NAME                                      STATE   CREATED         STARTED         NETWORKS
37ea1711        hyperkube       quay.io/coreos/hyperkube:v1.7.8_coreos.2        running 1 minute ago    1 minute ago
f420bb4a        etcd            quay.io/coreos/etcd:v3.2.9                      running 43 minutes ago  43 minutes ago
core@n101 ~ $ sudo rkt enter 37ea1711
enter: no command specified, assuming "/bin/bash"
root@n101:/# cat /etc/hosts
127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters

root@n101:/# 
@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

Should I open an issue somewhere else? Or should I now mount /etc/hosts in the rkt options?

Contributor

weikinhuang commented Oct 20, 2017

Should I open an issue somewhere else? Or should I now mount /etc/hosts in the rkt options?

@dghubble

This comment has been minimized.

Show comment
Hide comment
@dghubble

dghubble Oct 20, 2017

Collaborator

cc @euank

Collaborator

dghubble commented Oct 20, 2017

cc @euank

@euank

This comment has been minimized.

Show comment
Hide comment
@euank

euank Oct 20, 2017

Contributor

@weikinhuang

I suspect the root cause of the host file difference is this kubernetes change: kubernetes/kubernetes#48535 .. rkt and docker do indeed handle the existence or non-existence of such a file in the image quite differently.

Rather than bindmounting through /etc/hosts, the rkt option --hosts-entry=host should copy over the hosts entry you have in /etc/hosts.

We'd be happy to take a PR against the kubelet-wrapper in the coreos-overlay repo if you'd like to make that the default (do note you'll need to bump the ebuild revision too if you do make such a PR).

Other fixes could be for the upstream hyperkube image to continue including a proper /etc/hosts file, or for the kubelet to only use 127.0.0.1 instead of relying on localhost to resolve to it.

I might make that last change when I get time.

I do think that this is not a bootkube issue but rather a kubernetes+hyperkube or container-linux+kubelet-wrapper issue.

Contributor

euank commented Oct 20, 2017

@weikinhuang

I suspect the root cause of the host file difference is this kubernetes change: kubernetes/kubernetes#48535 .. rkt and docker do indeed handle the existence or non-existence of such a file in the image quite differently.

Rather than bindmounting through /etc/hosts, the rkt option --hosts-entry=host should copy over the hosts entry you have in /etc/hosts.

We'd be happy to take a PR against the kubelet-wrapper in the coreos-overlay repo if you'd like to make that the default (do note you'll need to bump the ebuild revision too if you do make such a PR).

Other fixes could be for the upstream hyperkube image to continue including a proper /etc/hosts file, or for the kubelet to only use 127.0.0.1 instead of relying on localhost to resolve to it.

I might make that last change when I get time.

I do think that this is not a bootkube issue but rather a kubernetes+hyperkube or container-linux+kubelet-wrapper issue.

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

yep, I'll see if I can make a PR to coreos-overlay when I have a moment.

Contributor

weikinhuang commented Oct 20, 2017

yep, I'll see if I can make a PR to coreos-overlay when I have a moment.

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

Can confirm adding --hosts-entry=host fixes the issue

Contributor

weikinhuang commented Oct 20, 2017

Can confirm adding --hosts-entry=host fixes the issue

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

Will open a PR on coreos-overlay

Contributor

weikinhuang commented Oct 20, 2017

Will open a PR on coreos-overlay

@dghubble

This comment has been minimized.

Show comment
Hide comment
@dghubble

dghubble Oct 20, 2017

Collaborator

This actually does have some bearing on the port-forward GCE issue I was describing. I never found any CPU spikes, but rkt entering the kubelet-wrapper kubelet does show the /etc/hosts file is empty.

This does not cause a port-forward problem on AWS, DO, and bare-metal (at least on my networks) because the upstream network DNS resolves localhost queries to 127.0.0.1 for the kubelet despite the lack of /etc/hosts. On Google Cloud specifically, their resolver does not and port-forward doesn't work.

The fix to the rkt args is the same and should be applied on all platforms.

Collaborator

dghubble commented Oct 20, 2017

This actually does have some bearing on the port-forward GCE issue I was describing. I never found any CPU spikes, but rkt entering the kubelet-wrapper kubelet does show the /etc/hosts file is empty.

This does not cause a port-forward problem on AWS, DO, and bare-metal (at least on my networks) because the upstream network DNS resolves localhost queries to 127.0.0.1 for the kubelet despite the lack of /etc/hosts. On Google Cloud specifically, their resolver does not and port-forward doesn't work.

The fix to the rkt args is the same and should be applied on all platforms.

@weikinhuang

This comment has been minimized.

Show comment
Hide comment
@weikinhuang

weikinhuang Oct 20, 2017

Contributor

Yes, that's also why it didn't break my other cluster, because localhost is defined on my internal network's dns server

Contributor

weikinhuang commented Oct 20, 2017

Yes, that's also why it didn't break my other cluster, because localhost is defined on my internal network's dns server

dghubble added a commit to coreos/matchbox that referenced this issue Oct 21, 2017

Add hosts-entry=host to kubelet-wrapper rkt ags
* Kubernetes v1.8.x hyperkube no longer has its own /etc/hosts
* Fixes potential port-forward issues on networks which do not
resolve localhost for nodes
* See kubernetes-incubator/bootkube#740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment