Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition getting pod IP and sending listeners list #11779

Closed
rvansa opened this issue Feb 15, 2019 · 5 comments

Comments

@rvansa
Copy link

commented Feb 15, 2019

Describe the bug
When I start many nodes concurrently some nodes get stuck, the sidecar never boots. This is happening because pod's IP:port is not among the listeners list; pilot/pkg/serviceregistry/kube/controller.go:GetProxyServiceInstances() tries to get pod info from PodCache on the controller, but gets nil at that point. Adding some debug logs I've found that pilot/pkg/serviceregistry/kube/pod.go:event(...) that updates the ip - pod mapping arrives to late. It seems that the list of listeners is not generated again (not sure if that is the problem).

Expected behavior
All sidecars boot.

Steps to reproduce the bug
Create 20 deployment configurations that will spawn 20 pods... usually at least one of them gets stuck.

Version
Istio master as of 2019-2-14

Installation
Helm charts.

Environment
Openshift 3.10

@rvansa

This comment has been minimized.

Copy link
Author

commented Feb 18, 2019

It seems I can fix this by calling pc.c.XDSUpdater.ConfigUpdate(true) instead of pc.c.XDSUpdater.WorkloadUpdate(...) in pod.go:event(...); I doubt this is a correct fix as it could result in excessive pushes, though. I could limit it the full push to cases when the IP is set in PodCache for the first time; would that be acceptable?

@andraxylia

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

This is a known timing issues, Pods and Endpoints event can be received in any order.

I thought this was taken care of. cc @costinm who can comment more on the XDSUpdater.

Please keep in mind we will move to synthetic service entries generated in Galley, see PR #11293 , and we will ultimately get rid of this code in pilot.

@rvansa

This comment has been minimized.

Copy link
Author

commented Feb 20, 2019

@andraxylia Thanks for the response. IIUC, even when you let Pilot listen to updates from Galley rather than directly from Kubernetes, the timing issue will be there, wouldn't it?
The fix is pretty simple: rvansa@2fdcf58 - should I file this as PR or just wait after #11293 gets integrated (and test then)?

@stale

This comment has been minimized.

Copy link

commented May 21, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 21, 2019

@stale

This comment has been minimized.

Copy link

commented Jun 20, 2019

This issue has been automatically closed because it has not had activity in the last month and a half. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this Jun 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.