-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes nodes lose InternalIP and ExternalIP temporarily #280
Comments
After starting new nodes a few days ago that should replace the corrupt nodes, I now noticed that the state has changed. master1 had no internal IP two days ago, now shows it again So I though, maybe a failing refresh tasks that can fix itself? And in fact, on the failed node master2 I saw the following errors (master3 has no such messages):
Ok, but the actual interesting message can be found repeatedly since yesterday:
On a node without problems I found a message that points to an unavailable OpenStack API: This seems related to kubernetes/kubernetes#46969 - but instead of the node being marked as "NotReady" it loses required attributes. Workaround: |
I just got the information from someone else that they did not experience this issue with K8S 1.10.5 but after upgrading to 1.11.1. |
We have experienced this as well on some internal clusters that are at 1.11.1 |
Looks like there might have been changes to the OpenStack cloud provider with K8S 1.11.x: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#sig-openstack |
/cc |
we have this issue in openshift 3.9 as well (kubernetes 1.9). We are fixing nodes by deleting them and restarting kubelet. After that we can see addresses back again. However, the weird thing is why this is happening? I tried to for instance block access to openstack api with firewall, but still the ip address was there. |
I experience a related issue since upgrading to 1.11 (1.10.x -> 1.11.3), using Openstack. I get repeating logs on a few machines (interestingly, only about ~1/3 of nodes of the cluster) That seems to cause some intermittent network outage within the cluster (kubedns on this machine is sometimes inaccessible). I STILL have all ExternalIP/InternalIP set. Restarting kubelet solves the problem (temporarily? Let's see). It looks like this is the same cause, this is why I post here instead of creating a new issue. Also note that during the upgrade, we switched from iptables to ipvs. It should not be related, but let's write it nonetheless. Edit: network issues do not seem related to this problem. |
They might have fixed this bug kubernetes/kubernetes#65226 Haven't tested the new code yet. |
A workaround would be to use the configdrive and setup the metadata search to only search using that. This way it won't use the api and fail. |
We have same issue with clould-provier-openstack and k8s v1.11.2 |
@mitchellmaler Hi Mitchell, would you mind explaining more details about the workaround? Are you talking about this in cloud-config? Thanks. |
@openstacker I tried that and I thought it was working but after a few days I had the same issue. |
Still just hoping to get the fix backported to 1.11 |
Ah, then yes, that perfectly matches what we just discussed. (11:12:22) flwang: i have asked in https://github.com/kubernetes/kubernetes/pull/65226 to backport this to v1.11 (11:12:37) flwang: and many people want that in v1.11 as well (11:13:36) strigazi: I think because of this https://github.com/kubernetes/kubernetes/blob/13705ac81e00f154434b5c66c1ad92ac84960d7f/pkg/cloudprovider/providers/openstack/openstack_volumes.go#L503 (11:13:47) strigazi: it always uses the metadata service (11:14:09) strigazi: it says in the comments: We're avoiding using cached metadata (or the configdrive), (11:14:27) strigazi: relying on the metadata service. (11:15:29) flwang: so though there is a config, the code just always skip it? (11:16:05) strigazi: that's my understanding |
@openstacker Oh wow! That is good to know |
We upgraded to K8s v1.11.5 a few days ago and the bug seems to be gone. |
I'm experiencing this in v1.11.5. According to kubernetes/kubernetes#68270 (comment) it was merged to v1.11.6 not v.1.11.5 |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened: InternalIP and ExternalIP are lost for some nodes (in our case: 2 of 5 are broken). I am not sure if this is related to the cloud provider.
This leads to problems, e.g. when trying to create an L4 load balancer:
Error creating load balancer (will retry): failed to ensure load balancer for service gitlab/gitlab-nginx-ingress-controller: error getting address for node intern-master1: no address found for host
What you expected to happen: I expect both InternalIP and ExternalIP not to change if infrastructure doesn't change
How to reproduce it (as minimally and precisely as possible): I can't tell it just "happened". Last week it worked, today it doesn't.
Anything else we need to know?: I saw a similar bug report here Azure/acs-engine#3503
Environment:
uname -a
): Linux intern-worker1 4.14.32-rancher2 Merge working repository and history to new upstream #1 SMP Fri May 11 11:30:31 UTC 2018 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: