Kubernetes nodes lose InternalIP and ExternalIP temporarily #280

stieler-it · 2018-09-01T10:23:33Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: InternalIP and ExternalIP are lost for some nodes (in our case: 2 of 5 are broken). I am not sure if this is related to the cloud provider.

>kubectl describe node intern-master1
...
Addresses:
  Hostname:  intern-master1

>kubectl describe node intern-worker1
...
Addresses:
  InternalIP:  10.0.0.15
  ExternalIP:  (hidden)
  Hostname:    intern-worker1

This leads to problems, e.g. when trying to create an L4 load balancer:
Error creating load balancer (will retry): failed to ensure load balancer for service gitlab/gitlab-nginx-ingress-controller: error getting address for node intern-master1: no address found for host

What you expected to happen: I expect both InternalIP and ExternalIP not to change if infrastructure doesn't change

How to reproduce it (as minimally and precisely as possible): I can't tell it just "happened". Last week it worked, today it doesn't.

Anything else we need to know?: I saw a similar bug report here Azure/acs-engine#3503

Environment:

openstack-cloud-controller-manager version: N/A
OS (e.g. from /etc/os-release): RancherOS v1.4.0
Kernel (e.g. uname -a): Linux intern-worker1 4.14.32-rancher2 Merge working repository and history to new upstream #1 SMP Fri May 11 11:30:31 UTC 2018 x86_64 GNU/Linux
Install tools: Rancher 2.0.x with RKE
Others: Kubernetes: Client v1.11.2, Server v1.11.1

The text was updated successfully, but these errors were encountered:

stieler-it · 2018-09-04T13:33:22Z

After starting new nodes a few days ago that should replace the corrupt nodes, I now noticed that the state has changed.

master1 had no internal IP two days ago, now shows it again
master2 was started a few days ago, now has lost its IP addresses
master3 was started a few days ago and still has its IP addresses

So I though, maybe a failing refresh tasks that can fix itself?

And in fact, on the failed node master2 I saw the following errors (master3 has no such messages):

E0901 10:08:10.638009    4644 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "intern-master2": Get https://127.0.0.1:6443/api/v1/nodes/intern-master2?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0901 10:08:10.638283    4644 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "intern-master2": Get https://127.0.0.1:6443/api/v1/nodes/intern-master2?timeout=10s: write tcp 127.0.0.1:47466->127.0.0.1:6443: use of closed network connection
(last error repeated at the same time daily for some days?)
E0901 10:08:10.639301    4644 server.go:222] Unable to authenticate the request due to an error: Post https://127.0.0.1:6443/apis/authentication.k8s.io/v1beta1/tokenreviews: read tcp 127.0.0.1:47466->127.0.0.1:6443: use of closed network connection; some request body already written
E0901 10:08:41.067689    4644 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "intern-master2": Get https://127.0.0.1:6443/api/v1/nodes/intern-master2?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
(last error repeated at the same time daily for some days?)
I0901 10:24:09.045364    4644 logs.go:49] http: TLS handshake error from 10.0.0.13:42112: EOF

Ok, but the actual interesting message can be found repeatedly since yesterday:

W0903 17:35:57.479608    4644 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Internal Server Error
W0903 17:38:36.630387    4644 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Internal Server Error
W0903 22:02:39.594404    4644 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s

On a node without problems I found a message that points to an unavailable OpenStack API:
Expected HTTP response code [200 204 300] when accessing [GET https://compute.our.openstack.provider/v2.1/servers/detail?name=%5Eintern-master3%24], but got 502 instead

This seems related to kubernetes/kubernetes#46969 - but instead of the node being marked as "NotReady" it loses required attributes.

Workaround: docker restart the kubelet container.

stieler-it · 2018-09-04T16:53:03Z

I just got the information from someone else that they did not experience this issue with K8S 1.10.5 but after upgrading to 1.11.1.

alexandrem · 2018-09-04T17:20:39Z

We have experienced this as well on some internal clusters that are at 1.11.1

stieler-it · 2018-09-04T17:52:56Z

Looks like there might have been changes to the OpenStack cloud provider with K8S 1.11.x: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#sig-openstack

FengyunPan2 · 2018-09-06T09:37:57Z

/cc

zetaab · 2018-09-08T14:19:05Z

we have this issue in openshift 3.9 as well (kubernetes 1.9). We are fixing nodes by deleting them and restarting kubelet. After that we can see addresses back again. However, the weird thing is why this is happening? I tried to for instance block access to openstack api with firewall, but still the ip address was there.

desaintmartin · 2018-09-26T09:03:51Z

I experience a related issue since upgrading to 1.11 (1.10.x -> 1.11.3), using Openstack.

I get repeating logs on a few machines (interestingly, only about ~1/3 of nodes of the cluster)
Sep 26 08:45:57 kubernetesnode16 kubelet[29922]: W0926 08:45:57.215477 29922 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider that matches ip: 51.38.44.227

That seems to cause some intermittent network outage within the cluster (kubedns on this machine is sometimes inaccessible).

I STILL have all ExternalIP/InternalIP set.

Restarting kubelet solves the problem (temporarily? Let's see).

It looks like this is the same cause, this is why I post here instead of creating a new issue.

Also note that during the upgrade, we switched from iptables to ipvs. It should not be related, but let's write it nonetheless.

Edit: network issues do not seem related to this problem.

alexandrem · 2018-09-28T14:44:47Z

They might have fixed this bug kubernetes/kubernetes#65226

Haven't tested the new code yet.

mitchellmaler · 2018-10-05T16:05:04Z

A workaround would be to use the configdrive and setup the metadata search to only search using that. This way it won't use the api and fail.

openstacker · 2018-10-23T12:52:51Z

We have same issue with clould-provier-openstack and k8s v1.11.2

openstacker · 2018-10-23T22:19:56Z

@mitchellmaler Hi Mitchell, would you mind explaining more details about the workaround? Are you talking about this
[Metadata]search-order=configDrive

in cloud-config? Thanks.

mitchellmaler · 2018-10-23T22:20:48Z

@openstacker I tried that and I thought it was working but after a few days I had the same issue.
Was premature :)

mitchellmaler · 2018-10-23T22:21:14Z

Still just hoping to get the fix backported to 1.11

openstacker · 2018-10-23T22:22:55Z

Ah, then yes, that perfectly matches what we just discussed.

(11:12:22) flwang: i have asked in  https://github.com/kubernetes/kubernetes/pull/65226 to backport this to v1.11
(11:12:37) flwang: and many people want that in v1.11 as well
(11:13:36) strigazi: I think because of this https://github.com/kubernetes/kubernetes/blob/13705ac81e00f154434b5c66c1ad92ac84960d7f/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L503
(11:13:47) strigazi: it always uses the metadata service
(11:14:09) strigazi: it says in the comments: We're avoiding using cached metadata (or the configdrive),
(11:14:27) strigazi: relying on the metadata service.
(11:15:29) flwang: so though there is a config, the code just always skip it?
(11:16:05) strigazi: that's my understanding

mitchellmaler · 2018-10-23T22:24:01Z

@openstacker Oh wow! That is good to know

stieler-it · 2018-12-18T22:02:15Z

We upgraded to K8s v1.11.5 a few days ago and the bug seems to be gone.

prein · 2019-02-08T15:44:16Z

I'm experiencing this in v1.11.5. According to kubernetes/kubernetes#68270 (comment) it was merged to v1.11.6 not v.1.11.5

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2018

stieler-it changed the title ~~Kubernetes node loses InternalIP and ExternalIP after a while~~ Kubernetes nodes lose InternalIP and ExternalIP temporarily Sep 4, 2018

stieler-it mentioned this issue Sep 4, 2018

Kubernetes 1.11.1 nodes occasionally do not register internal IP address rancher/rke#860

Closed

stieler-it mentioned this issue Sep 5, 2018

One Node loose private and external ip address kubernetes/kubernetes#68270

Closed

lingxiankong mentioned this issue Oct 24, 2018

Automated cherry pick of #65226: Put all the node address cloud provider retrival complex kubernetes/kubernetes#70154

Merged

stieler-it closed this as completed Dec 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes nodes lose InternalIP and ExternalIP temporarily #280

Kubernetes nodes lose InternalIP and ExternalIP temporarily #280

stieler-it commented Sep 1, 2018 •

edited

Loading

stieler-it commented Sep 4, 2018 •

edited

Loading

stieler-it commented Sep 4, 2018

alexandrem commented Sep 4, 2018

stieler-it commented Sep 4, 2018

FengyunPan2 commented Sep 6, 2018

zetaab commented Sep 8, 2018 •

edited

Loading

desaintmartin commented Sep 26, 2018 •

edited

Loading

alexandrem commented Sep 28, 2018

mitchellmaler commented Oct 5, 2018

openstacker commented Oct 23, 2018

openstacker commented Oct 23, 2018

mitchellmaler commented Oct 23, 2018

mitchellmaler commented Oct 23, 2018

openstacker commented Oct 23, 2018 •

edited

Loading

mitchellmaler commented Oct 23, 2018

stieler-it commented Dec 18, 2018

prein commented Feb 8, 2019 •

edited

Loading

Kubernetes nodes lose InternalIP and ExternalIP temporarily #280

Kubernetes nodes lose InternalIP and ExternalIP temporarily #280

Comments

stieler-it commented Sep 1, 2018 • edited Loading

stieler-it commented Sep 4, 2018 • edited Loading

stieler-it commented Sep 4, 2018

alexandrem commented Sep 4, 2018

stieler-it commented Sep 4, 2018

FengyunPan2 commented Sep 6, 2018

zetaab commented Sep 8, 2018 • edited Loading

desaintmartin commented Sep 26, 2018 • edited Loading

alexandrem commented Sep 28, 2018

mitchellmaler commented Oct 5, 2018

openstacker commented Oct 23, 2018

openstacker commented Oct 23, 2018

mitchellmaler commented Oct 23, 2018

mitchellmaler commented Oct 23, 2018

openstacker commented Oct 23, 2018 • edited Loading

mitchellmaler commented Oct 23, 2018

stieler-it commented Dec 18, 2018

prein commented Feb 8, 2019 • edited Loading

stieler-it commented Sep 1, 2018 •

edited

Loading

stieler-it commented Sep 4, 2018 •

edited

Loading

zetaab commented Sep 8, 2018 •

edited

Loading

desaintmartin commented Sep 26, 2018 •

edited

Loading

openstacker commented Oct 23, 2018 •

edited

Loading

prein commented Feb 8, 2019 •

edited

Loading