Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network not available first milliseconds of the pod #9454

Closed
ahmetb opened this issue Oct 22, 2018 · 7 comments
Closed

Network not available first milliseconds of the pod #9454

ahmetb opened this issue Oct 22, 2018 · 7 comments

Comments

@ahmetb
Copy link

ahmetb commented Oct 22, 2018

Describe the bug

When I deploy Istio 1.0.2 on GKE, I notice that the networking is unavailable during the first millisecond or so of my main application container.

This is causing issues with some of the libraries I use. For example, if I import Stackdriver Tracing exporting library, it will try to make a network call to GCE metadata server (169.254.169.254) but since it doesn't work, the trace exporter code assumes I'm running outside GCE, so it will never export the metrics.

(For me, this is primarily an issue with Stackdriver client libraries for Tracing, Profiler, Debugger etc. but I assume it would impact other programs as well.)

I think this is happening because there's no ordering between istio-proxy container and my main application container.

Expected behavior
Network should be available when my program starts executing.

Steps to reproduce the bug
I don't have a minimal repro at the moment but I could spend a few hours to get one if the problem is not clear/unreproducible. We ended up adding retries to such network calls at the beginning of the programs in https://github.com/GoogleCloudPlatform/microservices-demo/ repository because of this.

  • This is noticable only in Go programs (probably because they start up super fast). Other services I have (Java, Python, C#) don't expose this problem because I assume it takes time to start up those programs and the first millisecond makes all the difference. That said I haven't tried C/C++/Rust etc but I assume it would happen in those languages with low startup overhead, too.

  • This problem doesn't happen on GKE without Istio. It only exhibits this behavior on istio and it's pretty reproducible (>50%). Sometimes I guess the OS/processes are non-deterministic and istio-proxy starts faster than my main process, so it ends up detecting it's on GCE and works correctly. But half of the time it doesn't.

Version
Kubernetes 1.9, Istio 1.0.2

Installation
Create vanilla GKE 1.9 cluster, apply istio-demo.yaml from release tarball.

Environment
Google Kubernetes Engine

Cluster state
Since I applied the generic istio-demo.yaml on an empty/stock cluster, omitting cluster dump for now. Let me know if it is needed.

@costinm
Copy link
Contributor

costinm commented Oct 30, 2018

Yes, we are well aware of this issue and working on a fix, but it'll take some time.
There are discussions on K8S side ( @nmittler is pushing ) to have container start ordering or similar.
We're also working on CNI - which can solve the problem as well.

The root problem is that sidecar and app start at the same time. There are few workarounds and ways to
reduce the impact, but not full fix.

@esnible
Copy link
Contributor

esnible commented Feb 2, 2019

A work around until this functionality is implemented. Put the following into your .yaml:

command: ["/bin/bash", "-c"]
args: ["until curl --head localhost:15000 ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; ./startup.sh"] # (replace startup.sh with actual startup command.)

@mandarjog
Copy link
Contributor

You have two options

  1. exclude 169.254.169.254 from capture by using traffic.sidecar.istio.io/excludeOutboundIPRanges: 169.254.169.254" annotation.
  2. you can have a startup script before your app that checks if you required endpoints are up, and waits / crashes the app if that is not present.

@stale
Copy link

stale bot commented Jun 18, 2019

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 18, 2019
@ahmetb
Copy link
Author

ahmetb commented Jun 18, 2019

Please add the label to prevent closing, it's still an issue.
Looks like sadly I can't self-serve here with a /remove-lifecycle rotten comment like I do in kubernetes repos here.

@sdake
Copy link
Member

sdake commented Jul 20, 2019

Network specific problem - removing environments label.

@sdake sdake added the area/networking/cni Istio CNI-related issues label Jul 20, 2019
@howardjohn
Copy link
Member

This is a duplicate of #11130. Definitely something we want to fix/are working on fixing, but lets keep track in a single issue to make things simpler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants