Network not available first milliseconds of the pod #9454

ahmetb · 2018-10-22T17:31:52Z

Describe the bug

When I deploy Istio 1.0.2 on GKE, I notice that the networking is unavailable during the first millisecond or so of my main application container.

This is causing issues with some of the libraries I use. For example, if I import Stackdriver Tracing exporting library, it will try to make a network call to GCE metadata server (169.254.169.254) but since it doesn't work, the trace exporter code assumes I'm running outside GCE, so it will never export the metrics.

(For me, this is primarily an issue with Stackdriver client libraries for Tracing, Profiler, Debugger etc. but I assume it would impact other programs as well.)

I think this is happening because there's no ordering between istio-proxy container and my main application container.

Expected behavior
Network should be available when my program starts executing.

Steps to reproduce the bug
I don't have a minimal repro at the moment but I could spend a few hours to get one if the problem is not clear/unreproducible. We ended up adding retries to such network calls at the beginning of the programs in https://github.com/GoogleCloudPlatform/microservices-demo/ repository because of this.

This is noticable only in Go programs (probably because they start up super fast). Other services I have (Java, Python, C#) don't expose this problem because I assume it takes time to start up those programs and the first millisecond makes all the difference. That said I haven't tried C/C++/Rust etc but I assume it would happen in those languages with low startup overhead, too.
This problem doesn't happen on GKE without Istio. It only exhibits this behavior on istio and it's pretty reproducible (>50%). Sometimes I guess the OS/processes are non-deterministic and istio-proxy starts faster than my main process, so it ends up detecting it's on GCE and works correctly. But half of the time it doesn't.

Version
Kubernetes 1.9, Istio 1.0.2

Installation
Create vanilla GKE 1.9 cluster, apply istio-demo.yaml from release tarball.

Environment
Google Kubernetes Engine

Cluster state
Since I applied the generic istio-demo.yaml on an empty/stock cluster, omitting cluster dump for now. Let me know if it is needed.

costinm · 2018-10-30T18:41:23Z

Yes, we are well aware of this issue and working on a fix, but it'll take some time.
There are discussions on K8S side ( @nmittler is pushing ) to have container start ordering or similar.
We're also working on CNI - which can solve the problem as well.

The root problem is that sidecar and app start at the same time. There are few workarounds and ways to
reduce the impact, but not full fix.

esnible · 2019-02-02T15:18:46Z

A work around until this functionality is implemented. Put the following into your .yaml:

command: ["/bin/bash", "-c"]
args: ["until curl --head localhost:15000 ; do echo Waiting for Sidecar; sleep 3 ; done ; echo Sidecar available; ./startup.sh"] # (replace startup.sh with actual startup command.)

mandarjog · 2019-03-20T02:34:15Z

You have two options

exclude 169.254.169.254 from capture by using traffic.sidecar.istio.io/excludeOutboundIPRanges: 169.254.169.254" annotation.
you can have a startup script before your app that checks if you required endpoints are up, and waits / crashes the app if that is not present.

stale · 2019-06-18T02:49:56Z

This issue has been automatically marked as stale because it has not had activity in the last 90 days. It will be closed in the next 30 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

ahmetb · 2019-06-18T05:17:58Z

Please add the label to prevent closing, it's still an issue.
Looks like sadly I can't self-serve here with a /remove-lifecycle rotten comment like I do in kubernetes repos here.

sdake · 2019-07-20T02:33:23Z

Network specific problem - removing environments label.

howardjohn · 2019-08-26T23:55:26Z

This is a duplicate of #11130. Definitely something we want to fix/are working on fixing, but lets keep track in a single issue to make things simpler

douglas-reid added area/environments area/networking labels Oct 30, 2018

douglas-reid assigned costinm and andraxylia Oct 30, 2018

douglas-reid added the env/gke label Oct 30, 2018

andraxylia removed their assignment Oct 31, 2018

esnible mentioned this issue Feb 2, 2019

App container unable to connect to network before sidecar is fully running #11130

Closed

esnible added the area/user experience label Feb 2, 2019

esnible mentioned this issue Feb 3, 2019

Istio creates a race condition as the app crashes with apiserver unavailable #5442

Closed

ahmetb mentioned this issue Apr 23, 2019

Add some retries to prevent crashlooping GoogleCloudPlatform/berglas#11

Closed

esnible mentioned this issue May 8, 2019

the network of container cant connect when spring clound start with istio sidecar #13911

Closed

stale bot added the stale label Jun 18, 2019

stale bot removed the stale label Jun 18, 2019

sdake removed area/environments env/gke labels Jul 20, 2019

sdake added the area/networking/cni Istio CNI-related issues label Jul 20, 2019

howardjohn closed this as completed Aug 26, 2019

ddseapy mentioned this issue Oct 3, 2019

Workflows with istio (or other service mesh) argoproj/argo-workflows#1282

Closed

ghost mentioned this issue Dec 24, 2020

Proxy pass jwt to application #29762

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network not available first milliseconds of the pod #9454

Network not available first milliseconds of the pod #9454

ahmetb commented Oct 22, 2018

costinm commented Oct 30, 2018

esnible commented Feb 2, 2019

mandarjog commented Mar 20, 2019

stale bot commented Jun 18, 2019

ahmetb commented Jun 18, 2019

sdake commented Jul 20, 2019

howardjohn commented Aug 26, 2019

Network not available first milliseconds of the pod #9454

Network not available first milliseconds of the pod #9454

Comments

ahmetb commented Oct 22, 2018

costinm commented Oct 30, 2018

esnible commented Feb 2, 2019

mandarjog commented Mar 20, 2019

stale bot commented Jun 18, 2019

ahmetb commented Jun 18, 2019

sdake commented Jul 20, 2019

howardjohn commented Aug 26, 2019