Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes nodes lose InternalIP and ExternalIP temporarily #280

Closed
stieler-it opened this issue Sep 1, 2018 · 17 comments
Closed

Kubernetes nodes lose InternalIP and ExternalIP temporarily #280

stieler-it opened this issue Sep 1, 2018 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@stieler-it
Copy link

stieler-it commented Sep 1, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: InternalIP and ExternalIP are lost for some nodes (in our case: 2 of 5 are broken). I am not sure if this is related to the cloud provider.

>kubectl describe node intern-master1
...
Addresses:
  Hostname:  intern-master1

>kubectl describe node intern-worker1
...
Addresses:
  InternalIP:  10.0.0.15
  ExternalIP:  (hidden)
  Hostname:    intern-worker1

This leads to problems, e.g. when trying to create an L4 load balancer:
Error creating load balancer (will retry): failed to ensure load balancer for service gitlab/gitlab-nginx-ingress-controller: error getting address for node intern-master1: no address found for host

What you expected to happen: I expect both InternalIP and ExternalIP not to change if infrastructure doesn't change

How to reproduce it (as minimally and precisely as possible): I can't tell it just "happened". Last week it worked, today it doesn't.

Anything else we need to know?: I saw a similar bug report here Azure/acs-engine#3503

Environment:

  • openstack-cloud-controller-manager version: N/A
  • OS (e.g. from /etc/os-release): RancherOS v1.4.0
  • Kernel (e.g. uname -a): Linux intern-worker1 4.14.32-rancher2 Merge working repository and history to new upstream #1 SMP Fri May 11 11:30:31 UTC 2018 x86_64 GNU/Linux
  • Install tools: Rancher 2.0.x with RKE
  • Others: Kubernetes: Client v1.11.2, Server v1.11.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 1, 2018
@stieler-it
Copy link
Author

stieler-it commented Sep 4, 2018

After starting new nodes a few days ago that should replace the corrupt nodes, I now noticed that the state has changed.

master1 had no internal IP two days ago, now shows it again
master2 was started a few days ago, now has lost its IP addresses
master3 was started a few days ago and still has its IP addresses

So I though, maybe a failing refresh tasks that can fix itself?

And in fact, on the failed node master2 I saw the following errors (master3 has no such messages):

E0901 10:08:10.638009    4644 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "intern-master2": Get https://127.0.0.1:6443/api/v1/nodes/intern-master2?timeout=10s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0901 10:08:10.638283    4644 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "intern-master2": Get https://127.0.0.1:6443/api/v1/nodes/intern-master2?timeout=10s: write tcp 127.0.0.1:47466->127.0.0.1:6443: use of closed network connection
(last error repeated at the same time daily for some days?)
E0901 10:08:10.639301    4644 server.go:222] Unable to authenticate the request due to an error: Post https://127.0.0.1:6443/apis/authentication.k8s.io/v1beta1/tokenreviews: read tcp 127.0.0.1:47466->127.0.0.1:6443: use of closed network connection; some request body already written
E0901 10:08:41.067689    4644 kubelet_node_status.go:391] Error updating node status, will retry: error getting node "intern-master2": Get https://127.0.0.1:6443/api/v1/nodes/intern-master2?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
(last error repeated at the same time daily for some days?)
I0901 10:24:09.045364    4644 logs.go:49] http: TLS handshake error from 10.0.0.13:42112: EOF

Ok, but the actual interesting message can be found repeatedly since yesterday:

W0903 17:35:57.479608    4644 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Internal Server Error
W0903 17:38:36.630387    4644 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Internal Server Error
W0903 22:02:39.594404    4644 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s

On a node without problems I found a message that points to an unavailable OpenStack API:
Expected HTTP response code [200 204 300] when accessing [GET https://compute.our.openstack.provider/v2.1/servers/detail?name=%5Eintern-master3%24], but got 502 instead

This seems related to kubernetes/kubernetes#46969 - but instead of the node being marked as "NotReady" it loses required attributes.

Workaround: docker restart the kubelet container.

@stieler-it stieler-it changed the title Kubernetes node loses InternalIP and ExternalIP after a while Kubernetes nodes lose InternalIP and ExternalIP temporarily Sep 4, 2018
@stieler-it
Copy link
Author

I just got the information from someone else that they did not experience this issue with K8S 1.10.5 but after upgrading to 1.11.1.

@alexandrem
Copy link

We have experienced this as well on some internal clusters that are at 1.11.1

@stieler-it
Copy link
Author

Looks like there might have been changes to the OpenStack cloud provider with K8S 1.11.x: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#sig-openstack

@FengyunPan2
Copy link
Contributor

/cc

@zetaab
Copy link
Member

zetaab commented Sep 8, 2018

we have this issue in openshift 3.9 as well (kubernetes 1.9). We are fixing nodes by deleting them and restarting kubelet. After that we can see addresses back again. However, the weird thing is why this is happening? I tried to for instance block access to openstack api with firewall, but still the ip address was there.

@desaintmartin
Copy link
Member

desaintmartin commented Sep 26, 2018

I experience a related issue since upgrading to 1.11 (1.10.x -> 1.11.3), using Openstack.

I get repeating logs on a few machines (interestingly, only about ~1/3 of nodes of the cluster)
Sep 26 08:45:57 kubernetesnode16 kubelet[29922]: W0926 08:45:57.215477 29922 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider that matches ip: 51.38.44.227

That seems to cause some intermittent network outage within the cluster (kubedns on this machine is sometimes inaccessible).

I STILL have all ExternalIP/InternalIP set.

Restarting kubelet solves the problem (temporarily? Let's see).

It looks like this is the same cause, this is why I post here instead of creating a new issue.

Also note that during the upgrade, we switched from iptables to ipvs. It should not be related, but let's write it nonetheless.

Edit: network issues do not seem related to this problem.

@alexandrem
Copy link

They might have fixed this bug kubernetes/kubernetes#65226

Haven't tested the new code yet.

@mitchellmaler
Copy link

A workaround would be to use the configdrive and setup the metadata search to only search using that. This way it won't use the api and fail.

@openstacker
Copy link
Contributor

We have same issue with clould-provier-openstack and k8s v1.11.2

@openstacker
Copy link
Contributor

@mitchellmaler Hi Mitchell, would you mind explaining more details about the workaround? Are you talking about this
[Metadata]search-order=configDrive

in cloud-config? Thanks.

@mitchellmaler
Copy link

@openstacker I tried that and I thought it was working but after a few days I had the same issue.
Was premature :)

@mitchellmaler
Copy link

Still just hoping to get the fix backported to 1.11

@openstacker
Copy link
Contributor

openstacker commented Oct 23, 2018

Ah, then yes, that perfectly matches what we just discussed.

(11:12:22) flwang: i have asked in  https://github.com/kubernetes/kubernetes/pull/65226 to backport this to v1.11
(11:12:37) flwang: and many people want that in v1.11 as well
(11:13:36) strigazi: I think because of this https://github.com/kubernetes/kubernetes/blob/13705ac81e00f154434b5c66c1ad92ac84960d7f/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L503
(11:13:47) strigazi: it always uses the metadata service
(11:14:09) strigazi: it says in the comments: We're avoiding using cached metadata (or the configdrive),
(11:14:27) strigazi: relying on the metadata service.
(11:15:29) flwang: so though there is a config, the code just always skip it?
(11:16:05) strigazi: that's my understanding

@mitchellmaler
Copy link

@openstacker Oh wow! That is good to know

@stieler-it
Copy link
Author

We upgraded to K8s v1.11.5 a few days ago and the bug seems to be gone.

@prein
Copy link

prein commented Feb 8, 2019

I'm experiencing this in v1.11.5. According to kubernetes/kubernetes#68270 (comment) it was merged to v1.11.6 not v.1.11.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

9 participants