Application experiences connection refused for all outbound requests during startup #2704

dwj300 · 2019-04-16T16:55:11Z

Bug Report

What is the issue?

(More or a question, or a request to improve docs): During the initialization of our application, we make a handful of HTTP calls to external services. While the proxy is initializing (acquiring its cert, etc), these calls all receive a connection refused error. While we do have retries in place, we do run out of attempts and then fatally fail. While we can simply can add more retries, we were wondering what the recommended pattern is for waiting for the proxy to be ready before making these calls. Ideally, we could instruct kubernetes to not start our application container until the proxy gives the OK, but I don't believe that is supported today. Should we probe the pod's health endpoint until it is ready? Is there a better way to do this?

How can it be reproduced?

Deploy a pod that makes some HTTP outbound calls (in our case, to the open internet).

Logs, error output, etc

Get https://<URL>: dial tcp <IP>:<PORT>: connect: connection refused

`linkerd check` output

linkerd check
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ control plane namespace exists
√ controller pod is running
√ can initialize the client
√ can query the control plane API

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ no invalid service profiles

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Environment

Kubernetes Version: 1.12
Cluster Environment: AKS
Host OS: Ubuntu 16.04
Linkerd version: edge-19.4.4

Possible solution

Additional context

The text was updated successfully, but these errors were encountered:

olix0r · 2019-04-16T19:22:05Z

Thanks @dwj300 . We're hopeful that Kubernetes will provide a better primitive so that there's a principled way to achieve this, but I have a workaround you can try in the meantime:

A few weeks ago, I wrote a little tool that you can use to prevent your application from running until the proxy is ready: https://github.com/olix0r/linkerd-await

I've tested it for myself and it seems to work well, but since this needs to be added to your application container images, it might not be an ideal solution... Ideally, we'd improve this to only block on readiness when linkerd is injected, but we don't yet add environment variables to non-proxy containers.

Let us know if this workaround is feasible for you and if you think it would be useful to pull into the project more officially.

campbel · 2019-04-19T21:21:16Z

Also noticing this issue, it's not urgent but would make the deployment cleaner. Currently my container will boot, fail to connect to a rabbitmq server and then restart. The second attempt usually succeeds.

dwj300 · 2019-05-02T23:02:35Z

@olix0r sorry for the slow delay, finally got around to trying this - works great! Maybe just add a snippet to the Readme about recommended uses in kubernetes. For us, we now build our images with COPY --from=olix0r/linkerd-await:v0.1.0 /linkerd-await / and then in our kubernetes yamls, we add:

command: ["/linkerd-await"]
args: ["--", "someBinary", "--some", "--args"]

stale · 2019-07-31T23:27:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

poochwashere · 2020-02-13T00:03:54Z

Hi @olix0r, I was wondering if there is any update on this. We experience the same reset connection errors in our apps. I was reviewing the workaround, but I really do not want to put the binary in all of our container images.

grampelberg · 2020-02-13T15:55:16Z

@poochwashere unfortunately, there's not much Linkerd can do here. The options are:

Sidecars as first class citizens in Kubernetes
Adding linkerd-await binary to your container.
Sticking an arbitrary sleep before your process starts in the container.

eeeeeta · 2020-07-10T13:34:05Z

Would it not be possible for the linkerd-init container to block until the sidecar was initialized? That seems like a simple solution that would fix this issue, but maybe I'm missing something here...

alpeb · 2020-07-10T14:27:28Z

Would it not be possible for the linkerd-init container to block until the sidecar was initialized? That seems like a simple solution that would fix this issue, but maybe I'm missing something here...

When a pod is launched, Init-containers like linkerd-init are run first and when they're done the normal containers are launched (the linkerd proxy and your app containers). The problem is to ensure a start sequence (and also shutdown sequence) for the normal containers. That's what the options outlined above try to address.

Tolsto · 2021-01-24T07:08:08Z

Would it be possible to add a post-start lifecycle hook to the linkerd-proxy container that blocks until the proxy is ready to serve requests?

olix0r · 2021-01-24T15:52:11Z

@Tolsto The issue is that we need to block application containers from starting until the proxy has started. I don't think post-start hooks are can do this. Per the docs:

This hook is executed immediately after a container is created. However, there is no guarantee that the hook will execute before the container ENTRYPOINT.

So, even if we modified application containers with a hook that depends on the proxy, I don't think this could reliably block the container from starting.

That said, if you find something that works, please share an example!

Tolsto · 2021-01-25T04:44:10Z

@olix0r I didn't mean to use post-start hooks for the application containers but only for the linkerd-proxy container. Kubernetes starts containers sequentially and a post-start hook will block that process until the post-start hook has completed. This post here describes the idea: https://medium.com/@marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74

The relevant code is: https://github.com/kubernetes/kubernetes/blob/537a602195efdc04cdf2cb0368792afad082d9fd/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L827-L830

and

https://github.com/kubernetes/kubernetes/blob/4c9e96c2388c51a48ea530b111ba7381872e7d7a/pkg/kubelet/kuberuntime/kuberuntime_container.go#L212-L227

grampelberg added the area/usability label Apr 16, 2019

siggy mentioned this issue Apr 22, 2019

Document linkerd-proxy startup / connectivity issues linkerd/website#278

Closed

stale bot added the wontfix label Jul 31, 2019

stale bot closed this as completed Aug 15, 2019

briandealwis mentioned this issue Apr 5, 2021

potential network race condition with devmode and service meshes GoogleCloudPlatform/buildpacks#121

Open

github-actions bot locked as resolved and limited conversation to collaborators Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Application experiences connection refused for all outbound requests during startup #2704

Application experiences connection refused for all outbound requests during startup #2704

dwj300 commented Apr 16, 2019 •

edited

Loading

olix0r commented Apr 16, 2019

campbel commented Apr 19, 2019

dwj300 commented May 2, 2019

stale bot commented Jul 31, 2019

poochwashere commented Feb 13, 2020

grampelberg commented Feb 13, 2020

eeeeeta commented Jul 10, 2020

alpeb commented Jul 10, 2020

Tolsto commented Jan 24, 2021

olix0r commented Jan 24, 2021

Tolsto commented Jan 25, 2021

Application experiences connection refused for all outbound requests during startup #2704

Application experiences connection refused for all outbound requests during startup #2704

Comments

dwj300 commented Apr 16, 2019 • edited Loading

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

olix0r commented Apr 16, 2019

campbel commented Apr 19, 2019

dwj300 commented May 2, 2019

stale bot commented Jul 31, 2019

poochwashere commented Feb 13, 2020

grampelberg commented Feb 13, 2020

eeeeeta commented Jul 10, 2020

alpeb commented Jul 10, 2020

Tolsto commented Jan 24, 2021

olix0r commented Jan 24, 2021

Tolsto commented Jan 25, 2021

dwj300 commented Apr 16, 2019 •

edited

Loading

`linkerd check` output