Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destination return expired ip when endpoints in namespace kube-system #4055

Closed
humboldt-xie opened this issue Feb 14, 2020 · 12 comments
Closed
Assignees
Labels

Comments

@humboldt-xie
Copy link

humboldt-xie commented Feb 14, 2020

the key code:

func (ew *EndpointsWatcher) addEndpoints(obj interface{}) {
	endpoints := obj.(*corev1.Endpoints)
	if endpoints.Namespace == kubeSystem {
		return
	}
	id := ServiceID{
		Namespace: endpoints.Namespace,
		Name:      endpoints.Name,
	}

	sp := ew.getOrNewServicePublisher(id)

	sp.updateEndpoints(endpoints)
}

full step:

# run the  demo-client 
$ kubectl  exec -it demo-client-7bd774d7d8-2xvp2 bash 
[root@client-7bd774d7d8-2xvp2 client]# while true; do curl example.kube-system.svc.cluster.local; sleep 1; done

# on other temininal

$ kubectl  delete pod -n linkerd linkerd-destination-6bb5d8685c-bft49
pod "linkerd-destination-6bb5d8685c-bft49" deleted

$ kubectl  get endpoints example -n kube-system
NAME         ENDPOINTS                         AGE
example   172.20.14.96:80,172.20.3.140:80   665d

$ linkerd endpoints example.kube-system.svc.cluster.local
NAMESPACE     IP             PORT   POD                                   SERVICE
kube-system   172.20.3.140   80     example-canary-855889ff7c-ctths    example.kube-system
kube-system   172.20.14.96   80     example-release-757d7c9cb9-g7kml   example.kube-system

$ kubectl delete pod example-canary-855889ff7c-ctths -n kube-system
pod "example-canary-855889ff7c-ctths" deleted

$ kubectl  get endpoints example -n kube-system
NAME         ENDPOINTS                          AGE
example   172.20.10.253:80,172.20.14.96:80   665d

$ linkerd endpoints example.kube-system.svc.cluster.local
NAMESPACE     IP             PORT   POD                                   SERVICE
kube-system   172.20.14.96   80     example-release-757d7c9cb9-g7kml   example.kube-system
kube-system   172.20.3.140   80     example-canary-855889ff7c-ctths    example.kube-system

the 172.20.3.140 is invalid

I recommend using svc ip when encpoints in kube-system

@grampelberg
Copy link
Contributor

grampelberg commented Feb 14, 2020

The service IP would break load balancing, which would be a weird gotcha. Do you know why the endpoint stale in k8s for this?

Also, the service IP would try to load balance to the wrong IP anyways, resulting in the same problem =/

@humboldt-xie
Copy link
Author

The service IP would break load balancing, which would be a weird gotcha. Do you know why the endpoint stale in k8s for this?

Also, the service IP would try to load balance to the wrong IP anyways, resulting in the same problem =/

diff,see the code ,

if endpoints.Namespace == kubeSystem {
		return
}

the endpoint ip never update until restart destination

@grampelberg
Copy link
Contributor

It sounds like we just need to remove those lines? Why should we be using the service ip and how would any of this help the invalid output that you're getting from kubectl?

@humboldt-xie
Copy link
Author

It sounds like we just need to remove those lines? Why should we be using the service ip and how would any of this help the invalid output that you're getting from kubectl?

the kubectl get ip is valid.

linkerd return ip 172.20.14.124 is invalid

@grampelberg
Copy link
Contributor

From your paste:

$ kubectl get endpoints -n kube-system k8s-deploy
NAME         ENDPOINTS                          AGE
example   172.20.14.124:80,172.20.14.96:80   659d

That contains 172.20.14.124, right?

@humboldt-xie
Copy link
Author

humboldt-xie commented Feb 20, 2020

From your paste:

$ kubectl get endpoints -n kube-system k8s-deploy
NAME         ENDPOINTS                          AGE
example   172.20.14.124:80,172.20.14.96:80   659d

That contains 172.20.14.124, right?

sorry, i make a mistake on the issue, from my paste ,i stop the mesh ,so we get right on linkerd return

follow the step below,you can get the same:

# run the  demo-client 
$ kubectl  exec -it demo-client-7bd774d7d8-2xvp2 bash 
[root@demo-client-7bd774d7d8-2xvp2 client]# while true; do curl example.kube-system.svc.cluster.local; sleep 1; done

# on other temininal

$ kubectl  delete pod -n linkerd linkerd-destination-6bb5d8685c-bft49
pod "linkerd-destination-6bb5d8685c-bft49" deleted

$ kubectl  get endpoints example -n kube-system
NAME         ENDPOINTS                         AGE
example   172.20.14.96:80,172.20.3.140:80   665d

$ linkerd endpoints example.kube-system.svc.cluster.local
NAMESPACE     IP             PORT   POD                                   SERVICE
kube-system   172.20.3.140   80     example-canary-855889ff7c-ctths    example.kube-system
kube-system   172.20.14.96   80     example-release-757d7c9cb9-g7kml   example.kube-system

$ kubectl delete pod example-canary-855889ff7c-ctths -n kube-system
pod "example-canary-855889ff7c-ctths" deleted

$ kubectl  get endpoints example -n kube-system
NAME         ENDPOINTS                          AGE
example   172.20.10.253:80,172.20.14.96:80   665d

$ linkerd endpoints example.kube-system.svc.cluster.local
NAMESPACE     IP             PORT   POD                                   SERVICE
kube-system   172.20.14.96   80     example-release-757d7c9cb9-g7kml   example.kube-system
kube-system   172.20.3.140   80     example-canary-855889ff7c-ctths    example.kube-system



@grampelberg
Copy link
Contributor

Ahha, that makes way more sense! Up for a PR? @adleong any ideas why we'd just not bother doing discovery on kube-system?

@adleong
Copy link
Member

adleong commented Feb 25, 2020

I did some digging into the history and that kube-system exclusion has been there from the beginning. I can't recall why we put it in but as far as I can tell it should be safe to remove.

@adleong
Copy link
Member

adleong commented Feb 25, 2020

@humboldt-xie are you interested in working on this?

humboldt-xie added a commit to humboldt-xie/linkerd2 that referenced this issue Mar 4, 2020
…stem linkerd#4055

Signed-off-by: humboldt <humboldt_xie@163.com>
@humboldt-xie
Copy link
Author

@humboldt-xie are you interested in working on this?

i had remove the code when namespace is kube-system .

review please

@ericsuhong
Copy link

ericsuhong commented Mar 28, 2020

We were being affected by this issue. We had two services running in kube-system namespace talking to each other (via Linkerd proxy), and whenever some pods restarted, they stopped communicating each other because linkerd could not find proper endpoints.

Logs from Linkerd-proxy:
outbound:accept{peer.addr=10.240.1.149:38896}:source{target.addr=10.0.184.69:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out

Simply moving services out of kube-system namespace resolved this issue.

@stale
Copy link

stale bot commented Jun 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 26, 2020
@stale stale bot closed this as completed Jul 11, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants