Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod fails to start: Application container unable to access network before sidecar ready #4341

Closed
2 tasks
mandarjog opened this issue Mar 16, 2018 · 15 comments
Closed
2 tasks
Assignees
Milestone

Comments

@mandarjog
Copy link
Contributor

When a pod starts, the sidecar and the application containers all start together.

If an application container attempts to access a network service before the sidecar is ready, the connection fails.

  1. Access can fail completely if no listener is present on the sidecar
  2. Access fails with 404 / 503 if listener is present but no routes are available.

If the application is resilient to its dependency availability, then this is not an issue. The application will continue to retry until the connection can be established.
However if the application uses a network endpoint during the startup process and considers it a fatal error if the endpoint cannot be accessed, the application container will die.

As long as restartPolicy is OnFailure (or Always) k8s will restart the container while sidecar gets ready.

  • Test that this really works
  • Document mitigation
@ZackButcher
Copy link
Contributor

I have users that have confirmed both that this problem exists and is painful, and that setting a restartPolicy "fixes" it. Really this is just the user-visible side of #4363.

@stale
Copy link

stale bot commented Jul 22, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 2 weeks unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 22, 2018
@costinm
Copy link
Contributor

costinm commented Aug 7, 2018

Readiness issue.

@stale stale bot removed the stale label Aug 7, 2018
@bluk
Copy link

bluk commented Aug 24, 2018

I've also encountered this issue when running a Kubernetes job which immediately tries to connect to a PostgreSQL instance. The job container failed with an ERROR: connect ECONNREFUSED 10.99.214.72:5432. I was thinking that it was because I enabled mutual TLS with Istio but I eventually found that just having the sidecar injected would cause this issue. If I ran the job without the sidecar being injected, the job would succeed.

Setting a restartPolicy: OnFailure will help the issue as noted. Is there a recommended way to identify if the sidecar is ready?

@violetgo
Copy link

I have also encountered this problem, the current processing is to sleep for a few seconds before the service starts and then connect to the network.

@ZackButcher
Copy link
Contributor

#8983, which will be 1.1, should help address this (by letting applications call out to other services while Envoy is starting up).

@stale
Copy link

stale bot commented Dec 30, 2018

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale
Copy link

stale bot commented Feb 12, 2019

This issue has been automatically closed because it has not had activity in the last month and a half. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Feb 12, 2019
@hzxuzhonghu hzxuzhonghu reopened this Feb 13, 2019
@stale stale bot removed the stale label Feb 13, 2019
@hzxuzhonghu
Copy link
Member

/remove stale
since this is not addressed

@stale
Copy link

stale bot commented May 14, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 14, 2019
@stale
Copy link

stale bot commented Jun 13, 2019

This issue has been automatically closed because it has not had activity in the last month and a half. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Jun 13, 2019
@rlenglet rlenglet modified the milestones: 1.4, 1.3 Jul 9, 2019
@chris922
Copy link

I am using Istio v1.5.2 and still have this issue. Unfortunately the app container doesn't fail so that the workaround with the restart policy doesn't work for me.

Any other ideas for a workaround or plans to solve this? We are evaluating Istio right now for our project and this seems to be a blocking issue.

Only solution I see is to add something like a sleep in the app container before the real app starts.. but I expect that Istio shouldn't really need changes at the app itself to work properly.

@ZackButcher
Copy link
Contributor

The full solution to this in Kubernetes is for k8s to support Sidecar containers as a first class concept, starting them up entirely before starting up the application container. We'd been hopeful this would land in the latest k8s release but it's since been put on indefinite hold by the k8s community and will not ship with K8s 1.19 (at this point we can hope for 1.20, but I haven't been following in k8s closely to see if that's realistic).

Other organizations I've worked with have solved this problem by adding a sleep to the app container. The base framework for services we use at Tetrate incorporates a sleep at startup to paper over this pain too, for example. It's not clean, and violates the design goal of the mesh being transparent, but until there's better support for container lifecycles in underlying platforms that Istio runs on there's not too much we can do here.

@chris922
Copy link

chris922 commented May 2, 2020

Thanks for the detailed information @ZackButcher!

What do you think is the best place to put the "sleep"?

@ZackButcher
Copy link
Contributor

I wrote our sleep to be literally the first thing that the application does at startup. It's effectively the first line of code that executes in a shared main method that all of our services use - that lets us make sure there's standard flags for configuring the startup delay, etc. Making it absolutely the first thing that happens prevents developers from accidentally attempting to do stuff that could fail without a sidecar (like opening up connections to the database or reading some online config store, etc).

Anecdotally, with a 5 second delay at startup we've not seen any startup failures due to waiting on the sidecar in our continuous testing environments. How long the typical startup delay is in your system is mainly a function of Pilot load (number of services in the system, rate of change of pods, services in the system, number of sidecars connected, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants