-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linkerd proxy randomly failing to start on a fresh linkerd install #5681
Comments
Adding more logs from a fresh install Alicloud install
Local(Minikube) install
So the proxy fail to get certified in Alicloud. I don't see any ERROR or WARN in the logs of any of the linkerd pods |
@Patanouk I'm not familiar with alicloud container kubernetes. If you have access to the kubernetes api-server logs, can you share those? It might also help to see the output from |
I tried on an Here is my current NAME READY STATUS RESTARTS AGE linkerd-controller-7fd676b57c-jfbpd 1/2 Running 0 14m linkerd-destination-5b987c797f-8944p 1/2 Running 0 14m linkerd-grafana-595b8f95b-5mxk5 1/2 Running 0 14m linkerd-identity-7698cc6b64-hstpl 2/2 Running 0 14m linkerd-prometheus-674695458c-j7kch 1/2 Running 0 14m linkerd-proxy-injector-d5c75475-p8fpb 1/2 Running 0 14m linkerd-sp-validator-8f794b4fd-bt7fn 1/2 Running 0 14m linkerd-tap-744784cf94-n2zvv 2/2 Running 0 14m linkerd-web-7c86967466-mn8br 2/2 Running 0 14m Here are the We also have |
Thanks for the report, @Patanouk! At a glance, this might be related to another issue we are investigating #5599 which also reports issues connecting to pods which start slowly. @hawkw does this look related to you? do you think this might be reproducible by adding a sleep to the control plane pods? |
@Patanouk do you happen to have logs from the proxy in the |
Yes, I do
Logs from the identity container :
Logs from the Adding the logs from the |
@Patanouk thanks for sharing detailed logs It appears that the controller proxy is unable to resolve the SRV record via DNS. For instance:
I don't know enough about alicloud's default DNS configuration to guess why we'd be getting NXDomain responses for these SRV record lookups, but this is definitely the problem. |
Do you know if your cluster has a custom domain? I can reproduce this issue by creating a cluster with a custom domain (i.e. other than Otherwise, I'd suggest trying to run |
Thanks everyone for the help. It's Chinese new year here, so I will check back next Thursday @olix0r I Don't think the cluster has a custom domain name. I already checked that, since I saw other tickets open relative to a custom domain name |
Quick update here
A The issue is probably something related to Alicloud, but I'm not knowledgeable enough here to try to debug the issue here :/ We ultimately went with Istio (sorry), so this ticket can most likely be closed Thx everyone for your help here |
Appreciate the update @Patanouk! |
Bug Report
What is the issue?
Installation method :
linkerd install
,helm install
andlocal helm install with the fetched chart
all trigger the same behaviourThe
linkerd-proxy
containers are randomly failing their readiness checkThe
linkerd-proxy
containers have two endpoints for readiness :/live
and/ready
/live
always returns a 200 status code/ready
returns a 503 status code for some of the podsSee below for the pods in the
linkerd
namespaceDoing a rollout restart of the pods with a non-started proxy doesn't help
The pods with
2/2
containers running are not always the same ones, but thelinkerd-identity
pod always has a correctly started proxyHow can it be reproduced?
Hard to say, considering the installation is working fine locally with the same helm chart.
It seems to be related to the startup speed of the pod. The slowest pods get a non-functional
linkerd-proxy
Logs, error output, etc
linkerd check
outputEnvironment
Possible solution
The
identity
component fails to validate the identity for some of thelinkerd-proxy
Here is a line of log from the
linkerd-proxy
container of thecontroller
podOutput of
grep -q "Certified identity"
matches the status of the proxy (Pods with this line of log have a correctly startedlinkerd-proxy
)Additional context
The issue seems related to the startup speed of the pods.
According to my non-scientific tests, the
linkerd-proxy
are correctly started if the pod have a 'fast' startup (e.g. less than 10 seconds?)I also tried to fiddle with the
initialDelaySeconds
values of the livenessProbe checks in the helm chart, but that didn't seems to helpThe text was updated successfully, but these errors were encountered: