-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ServiceEntry has stale endpoints #35404
Comments
I added some debug logs and found out that the issue is due to informer getting event out of order.
however, istiod gets event in this order:
From my testing, the new pod event comes at |
In theory, the informer events are in order. Thanks for your investigation. I need to have a closer look. |
I suspect they are in order. But the new pod legitimately starts with the new IP before the old pod is fully deleted. Just a guess though |
Actually as long as the pod starts terminating, it is marked as a delete event here and should be deleted in the service registry. But that event comes later than the new pod event. I think IP is released after the pod starts terminating, so the terminating event should come before the new pod starts up event, which indicates to me that the informer events are out of order. (by the way, to reproduce in a cluster with small load, I had another deployment with 300 pods that is consistently restarting to mimic load) |
This CL fixes the problem described in this issue: istio#35404 workloadInstancesByIP is replaced by workloadInstancesByName, i.e. we don't assume one to one correspondance between wi and ip. workloadInstancesIPsByName is removed since it's covered by the new map. There is a refactor going on for the ServiceEntryStore which should also fix the problem: istio#35369. But that probably won't get backported to 1.11, so I created this smaller fix to be patched onto 1.11. Change-Id: I62992082979cb12ff5fecef4d85f20e3c9fd435a Reviewed-on: https://gerrit.musta.ch/c/public/istio/+/1690 Reviewed-by: Weibo He <weibo.he@airbnb.com>
This CL fixes the problem described in this issue: istio#35404 workloadInstancesByIP is replaced by workloadInstancesByName, i.e. we don't assume one to one correspondance between wi and ip. workloadInstancesIPsByName is removed since it's covered by the new map. There is a refactor going on for the ServiceEntryStore which should also fix the problem: istio#35369. But that probably won't get backported to 1.11, so I created this smaller fix to be patched onto 1.11. Change-Id: I62992082979cb12ff5fecef4d85f20e3c9fd435a Reviewed-on: https://gerrit.musta.ch/c/public/istio/+/1690 Reviewed-by: Weibo He <weibo.he@airbnb.com>
* istio: fix service entry stale endpoints This CL fixes the problem described in this issue: #35404 workloadInstancesByIP is replaced by workloadInstancesByName, i.e. we don't assume one to one correspondance between wi and ip. workloadInstancesIPsByName is removed since it's covered by the new map. There is a refactor going on for the ServiceEntryStore which should also fix the problem: #35369. But that probably won't get backported to 1.11, so I created this smaller fix to be patched onto 1.11. Change-Id: I62992082979cb12ff5fecef4d85f20e3c9fd435a Reviewed-on: https://gerrit.musta.ch/c/public/istio/+/1690 Reviewed-by: Weibo He <weibo.he@airbnb.com> * lint * add release notes
This CL fixes the problem described in this issue: istio#35404 workloadInstancesByIP is replaced by workloadInstancesByName, i.e. we don't assume one to one correspondance between wi and ip. workloadInstancesIPsByName is removed since it's covered by the new map. There is a refactor going on for the ServiceEntryStore which should also fix the problem: istio#35369. But that probably won't get backported to 1.11, so I created this smaller fix to be patched onto 1.11. Change-Id: I62992082979cb12ff5fecef4d85f20e3c9fd435a Reviewed-on: https://gerrit.musta.ch/c/public/istio/+/1690 Reviewed-by: Weibo He <weibo.he@airbnb.com>
Fix is merged |
* istio: fix service entry stale endpoints This CL fixes the problem described in this issue: #35404 workloadInstancesByIP is replaced by workloadInstancesByName, i.e. we don't assume one to one correspondance between wi and ip. workloadInstancesIPsByName is removed since it's covered by the new map. There is a refactor going on for the ServiceEntryStore which should also fix the problem: #35369. But that probably won't get backported to 1.11, so I created this smaller fix to be patched onto 1.11. Change-Id: I62992082979cb12ff5fecef4d85f20e3c9fd435a Reviewed-on: https://gerrit.musta.ch/c/public/istio/+/1690 Reviewed-by: Weibo He <weibo.he@airbnb.com> * lint * add release notes Co-authored-by: ying-zhu <ying.zhu@airbnb.com>
Bug Description
We have a ServiceEntry in an external Istiod cluster that selects k8s pods in a remote cluster, when the pods are restarted, sometimes there will be stale endpoints in the cluster corresponding to the ServiceEntry. Looking at Istiod's
debug/endpointz
, all of the istiod pods have the same stale states, restarting the Istiod pods fix the problem.To more easily reproduce, I just scale the deployment to 100 replicas, and then scale down to 1, sometimes some endpoints are not correctly deleted and there will be more than 1 endpoint under the cluster corresponding to the ServiceEntry. It do appear that it's easier to reproduce in more heavy loaded k8s clusters.
Here is the code for reproduction:
In the external Istiod cluster:
In the remote cluster:
Version
Additional Information
If I turn on debug logging, for the stale endpoints, I didn't see these lines:
2021-09-28T18:02:19.160451Z debug Handle event delete for service instance (from 100.117.83.50) in namespace airmesh-test-ying
The log comes from here: https://github.com/istio/istio/blob/master/pilot/pkg/serviceregistry/serviceentry/servicediscovery.go#L444
So looks like it's either the delete event is missing or redundantEventForPod is true (which probably is not) ?
The text was updated successfully, but these errors were encountered: