Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App container unable to connect to network before sidecar is fully running #11130

Open
linsun opened this issue Jan 21, 2019 · 29 comments
Open

App container unable to connect to network before sidecar is fully running #11130

linsun opened this issue Jan 21, 2019 · 29 comments

Comments

@linsun
Copy link
Member

@linsun linsun commented Jan 21, 2019

Describe the feature request
We had users who spend very long time to debug why their app container stops working initially when sidecar is used in istio. They have found out the app container could not reach out to network for simple things like clone a file from GitHub before the envoy proxy is ready and running. It is hard to debug this because when they exec into the container after the deployment is running, everything works fine.

Describe alternatives you've considered
the current work group used by folks is to put a big sleep like 20 or 30 seconds in their app container to give enough time for envoy to start up.

This is fine once they discover the issue and understand how istio works better, but it can take days for them to discover the issue.

How can we make the experience better?
can we provide some startup hook so app container won't start till envoy sidecar is ready, if the app container starts very fast and requires network connectivity.

@linsun
Copy link
Member Author

@linsun linsun commented Jan 21, 2019

@esnible pls feel free to add things I missed. cc @GregHanson

@esnible
Copy link
Contributor

@esnible esnible commented Jan 23, 2019

Currently I tell people to put the following into their .yaml:

command: ["/bin/bash", "-c"]
args: ["until curl --head localhost:15000 ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; ./startup.sh"] # replace startup.sh with actual startup command.

It would be better if networking was ready when the app container started.

A novel approach would be to slow down the app container until networking was available. A hook could set the CPU for containers other than the sidecar to use spec.containers[].resources.requests.cpu: 1m (a milli-CPU). A tool like the Network CNI would raise the CPU to an original/default value after networking started. This should starve anything compute-bound giving Envoy more time to start.

Another idea is to have the init container include pilot-agent and fetch /etc/istio
/proxy/envoy-rev0.json
before any non-init containers start, allowing Envoy to be configured with real values immediately instead of waiting for Pilot while the app container is starting.

@esnible
Copy link
Contributor

@esnible esnible commented Feb 2, 2019

This may be a duplicate of #9454

@esnible
Copy link
Contributor

@esnible esnible commented Feb 3, 2019

This may be a duplicate of #4341

@jackkleeman
Copy link
Member

@jackkleeman jackkleeman commented Apr 18, 2019

Hey, we here at Monzo have open sourced our solution to this sequencing problem:
https://github.com/monzo/envoy-preflight
The idea is, it's a wrapper around your main application, which ensures it starts after envoy is live, and shuts down envoy when its done. You'll still need to prevent sigterms from reaching envoy.

@esnible esnible removed their assignment Apr 30, 2019
@esnible
Copy link
Contributor

@esnible esnible commented Apr 30, 2019

Removing myself because I am not a sidecar networking guru. That is what we need for this item.

@howardjohn
Copy link
Member

@howardjohn howardjohn commented May 31, 2019

Long term fix is #11366 or maybe kubernetes/kubernetes#65502

@hzxuzhonghu
Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Jun 25, 2019

As we have ALLOW_ANY, is this still a big problem?

@idouba
Copy link
Member

@idouba idouba commented Jun 28, 2019

Consider config postStart for app container to check envoy status. such as:
httpGet: path: /healthz/ready

@hzxuzhonghu
Copy link
Member

@hzxuzhonghu hzxuzhonghu commented Jun 28, 2019

I think this makes sense.

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
  - name: app
    image: xxx
    command: 
    lifecycle:
      postStart:           # same as istio-proxy readiness probe, when this hook exec failed, the app container will be restarted
        httpGet:
          path: /healthz/ready
          port: 15020

@xiaozhongliu
Copy link

@xiaozhongliu xiaozhongliu commented Jul 10, 2019

esnible's solution worked for us for a long period. Unfortunately the issue starts to occur again, and even worse.
Our external database can be unavailable for more than 8 seconds after the envoy is ready plus 5 seconds more sleep ...

until curl -s localhost:15000 > /dev/null; do echo '>>> Waiting for sidecar'; sleep 2 ; done ; echo '>>> Sidecar available'; sleep 5 ; ...

Could anyone shed light on this?

@stale
Copy link

@stale stale bot commented Oct 18, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@Jonathan34
Copy link

@Jonathan34 Jonathan34 commented Nov 19, 2019

It's been a year that this issue (or related issues) have been opened.

It would be nice not to have the deployments know that they need to wait for the mesh' sidecar to be ready. That link should not exist and waiting for the sidecar is becoming a best practice and its a known common problem that leads to weak UX and onboarding of new users.

@shamsher31 shamsher31 removed this from the Nebulous Future milestone Jan 13, 2021
@shamsher31 shamsher31 added this to the 1.9 milestone Jan 13, 2021
@shamsher31 shamsher31 moved this from P0 to P1 in Prioritization Jan 13, 2021
@howardjohn howardjohn removed this from the 1.9 milestone Jan 27, 2021
@howardjohn howardjohn added this to the Backlog milestone Jan 27, 2021
@florianakos
Copy link

@florianakos florianakos commented Feb 9, 2021

FYI - this may be useful for those out there running Istio older than v1.7 which has holdApplicationUntilProxyStarts.
So the snippet shared by esnible uses localhost:15000 but this endpoint is for admin access which starts sooner than the Envoy proxy becomes operational, so for me using localhost:15020/healthz/ready instead helped a lot:

until [ $(curl --fail --silent --output /dev/stderr --write-out "%{http_code}" localhost:15020/healthz/ready) -eq 200 ]; do
  echo Waiting for proxy...
  sleep 1
done

@arocki7
Copy link

@arocki7 arocki7 commented Feb 15, 2021

If the pod/app is very sensible to network connectivity at startup, I would recommend adding the following annotation to the application - as @AntonySmirnoff stated.

annotations:
  proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'

It would ensure that proxy is started up and functioning before starting the application. It solved the problem for me.

@omerfsen
Copy link

@omerfsen omerfsen commented Mar 31, 2021

+1

@linsun
Copy link
Member Author

@linsun linsun commented Mar 31, 2021

Yes, this should be closed now I think! :)

@linsun linsun closed this Mar 31, 2021
Prioritization automation moved this from P1 to Done Mar 31, 2021
@howardjohn
Copy link
Member

@howardjohn howardjohn commented Apr 9, 2021

I don't think this is done, it still doesn't support init containers or use stable APIs. Can we keep this open to track the long term solution?

@ldemailly
Copy link
Contributor

@ldemailly ldemailly commented Apr 21, 2021

This maybe a FAQ but how does holdApplicationUntilProxyStarts work? looking at the code it seems what it does is put the envoy proxy container first in the pod (instead of last); how is that holding the application? afaik containers aren't guaranteed to start sequentially (kubernetes/kubernetes#65502 is till open) or am I missing something? am I missing some probe interactions that achieve the goal of starting the app only after envoy is fully up and ready?

@howardjohn
Copy link
Member

@howardjohn howardjohn commented Apr 21, 2021

@ldemailly
Copy link
Contributor

@ldemailly ldemailly commented Apr 21, 2021

Thanks a lot @howardjohn, so if I understand correctly, istio-proxy pod does have a hook such as the readiness is disguised as an am-I-start'ed hook (I didn't know about that hack that k8s was waiting for some feedback at all before starting the next pod; so I thought this was just changing/improving the race condition) - good to know it can be relied upon!

Any known downside ? (out of the kubectl exec thing which anyway seems to ask for a pod) - If no downside why isn't it the default? (I guess some app may be doing a lot of internal maybe disk based work at startup and thus don't need to wait for network/can benefit from parallel execution without waiting for pilot/envoy but that seems to be they could turn that on; most apps/services probably need to contact other systems during startup)

@mmerickel
Copy link

@mmerickel mmerickel commented Apr 21, 2021

kubectl run is one scenario where you cannot select a container and if the hold is enabled then you cannot get into the application itself. You can opt-out via annotations in an override spec but this pretty tedious to do from the CLI. Changing which container is the "default" in a pod is not a great thing to do by default, but to be clear you can do it in the meshConfig on your cluster if you desire. Most apps I've run across do not need to talk to other services in the mesh as part of their startup but that depends on what you're up to I imagine.

@ldemailly
Copy link
Contributor

@ldemailly ldemailly commented Apr 21, 2021

it's not just in cluster it's all networking and it helps adoption to not have gotchas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet