Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic got routed to an endpoint that does not belong to the target service #11763

Closed
Wenliang-CHEN opened this issue Dec 14, 2023 · 9 comments

Comments

@Wenliang-CHEN
Copy link
Contributor

What is the issue?

Hey all, here is what we saw so far: our alerting system detected some failing traffic.

From the linkerd-proxy logs of the outbound pod, it seems there is a problem with target resolution.

A pod that does not belong to the target service was resolved as a target endpoint.

Meanwhile, from the linkerd-proxy logs, we could see that the proxy was at the same time trying a lot of different endpoints.

We are not sure where to look yet. And it doesn't happen often. We have alert against this situation as well. I will update here if it happens again.

Meanwhile, please let me know what you think. Thanks!

How can it be reproduced?

We are not completely sure. But it seems this issue happens after a bigger reshuffling of internal IPs, e.g. after the deployment of a workload that has 400 pods.

Logs, error output, etc

{
	"content": {
		"timestamp": "2023-12-14T10:00:13.454Z",
		"message": "Failed to connect",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"ns": "prod",
					"port": "80",
					"name": "service"
				},
				{
					"name": "endpoint",
					"addr": "10.250.155.107:80"
				}
			],
			"level": "WARN",
			"fields": {
				"error": "Connection refused (os error 111)"
			},
			"timestamp": "[ 74858.318395s]",
			"target": "linkerd_reconnect"
		}
	}
}

The "proxy"

				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},

here is the IP of the target service, service A

And the "endpoint"

				{
					"name": "endpoint",
					"addr": "10.250.155.107:80"
				}

here is a pod that belongs to another service, service B

When this happened, from the endpoints_available metrics, there is no change to both service A and service B

output of linkerd check -o short

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all pods
√ cluster networks contains all services

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-12-28T07:32:15Z
    see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2024-01-01T13:43:33Z
    see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
√ sp-validator webhook has valid cert
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2024-01-04T08:56:13Z
    see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
√ policy-validator webhook has valid cert
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-12-28T09:31:39Z
    see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    is running version 2.14.3 but the latest stable version is 2.14.6
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    is running version 2.14.3 but the latest stable version is 2.14.6
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-7565484496-67p6h (v2.210.2)
	* linkerd-destination-7565484496-6bm7j (v2.210.2)
	* linkerd-destination-7565484496-f5lbf (v2.210.2)
	* linkerd-destination-7565484496-kfbpc (v2.210.2)
	* linkerd-destination-7565484496-ld969 (v2.210.2)
	* linkerd-identity-78fb66464f-5ptjq (v2.210.2)
	* linkerd-identity-78fb66464f-6h256 (v2.210.2)
	* linkerd-identity-78fb66464f-9qjvf (v2.210.2)
	* linkerd-identity-78fb66464f-t5h6w (v2.210.2)
	* linkerd-identity-78fb66464f-vbrwm (v2.210.2)
	* linkerd-proxy-injector-5c55846f8c-26ksp (v2.210.2)
	* linkerd-proxy-injector-5c55846f8c-7q4ch (v2.210.2)
	* linkerd-proxy-injector-5c55846f8c-9h9mc (v2.210.2)
	* linkerd-proxy-injector-5c55846f8c-qmhkq (v2.210.2)
	* linkerd-proxy-injector-5c55846f8c-sfwfd (v2.210.2)
	* linkerd-sp-validator-85575f7c47-27fvt (v2.186.0-light-metrics)
	* linkerd-sp-validator-85575f7c47-2t8v7 (v2.186.0-light-metrics)
	* linkerd-sp-validator-85575f7c47-dn2qh (v2.186.0-light-metrics)
	* linkerd-sp-validator-85575f7c47-dt57f (v2.186.0-light-metrics)
	* linkerd-sp-validator-85575f7c47-h2dnt (v2.186.0-light-metrics)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-7565484496-67p6h running v2.210.2 but cli running stable-2.14.3
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
√ multiple replicas of control plane pods

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ can initialize the client
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
‼ linkerd-viz pods are injected
    could not find proxy container for prometheus-scrape-1-5585795fbd-fpcxq pod
    see https://linkerd.io/2.14/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
    container "linkerd-proxy" in pod "prometheus-scrape-1-5585795fbd-fpcxq" is not ready
    see https://linkerd.io/2.14/checks/#l5d-viz-pods-running for hints
√ viz extension proxies are healthy
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* grafana-76f7b4b7f9-9pbg4 (v2.210.2)
	* metrics-api-dcb7f4875-jjzj4 (v2.210.2)
	* metrics-api-dcb7f4875-xbkg6 (v2.210.2)
	* tap-7dfcf4686-76m9p (v2.210.2)
	* tap-injector-699b466bbd-whdzr (v2.210.2)
	* web-59b65979b5-5hjxg (v2.210.2)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-76f7b4b7f9-9pbg4 running v2.210.2 but cli running stable-2.14.3
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints
√ prometheus is installed and configured correctly
√ viz extension self-check

linkerd-smi
-----------
√ linkerd-smi extension Namespace exists
√ SMI extension service account exists
√ SMI extension pods are injected
√ SMI extension pods are running
√ SMI extension proxies are healthy

Status check results are √

Environment

  • EKS
  • Kubernetes 1.25
  • Linkerd 2.14.3

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

@adleong
Copy link
Member

adleong commented Dec 14, 2023

Hi @Wenliang-CHEN

This sounds a bit similar to an issue we had where the destination controller could become locked and stop processing service discovery updates. However, this bug was fixed in stable-2.14.2 and should not affect you in stable-2.14.3. In order to rule out that possibility, you could take a look at the endpoints_updates counter metric exposed by the destination controller:

linkerd diagnostics controller-metrics | grep endpoints_updates

You should see this counter incremented when the endpoints of a service change. If, instead, this counter remains at the same value, it means that the destination controller is not processing updates for some reason.

In stable-2.14.4 we added *_informer_lag_secs histogram metrics to the destination controller for even more visibility. If you upgrade to stable-2.14.4 or later you can use these histograms to see if there is a substantial lag between when endpoints are updated in Kubernetes vs when the destination controller processes those updates.

@Wenliang-CHEN
Copy link
Contributor Author

Hey @adleong , thanks for the reply.

And yes, I do see the endpoints_updates counter incremented after the deployment of the target service: service A. With that I guess the destination controller was processing.

A couple of things worth mentioning:

  • the issue happens about 20mins after the deployment of the target service.
  • If we restart the deployment that owns the outbound pod, the issue is solved

Does it change anything?

And as action item, I think we will try to update to stable-2.14.4 and take a look at *_informer_lag_secs as well.

Meanwhile, if we found anything new, we will report in the thread again.

Thanks!

@kflynn
Copy link
Member

kflynn commented Dec 21, 2023

@Wenliang-CHEN Any joy trying with stable-2.14.4? 🙂

@Wenliang-CHEN
Copy link
Contributor Author

Wenliang-CHEN commented Dec 21, 2023

Hey @kflynn not yet...around Christmas holiday. I will let you know 😄

But there has not been another instance since I reported the issue. But to be safe, we are still observing...

@kflynn
Copy link
Member

kflynn commented Dec 21, 2023

@Wenliang-CHEN Keeping fingers crossed for you -- enjoy the holiday! 🙂

@kflynn
Copy link
Member

kflynn commented Jan 4, 2024

@Wenliang-CHEN Happy new year!! Just wanted to make sure this was still on your radar. 🙂

@Wenliang-CHEN
Copy link
Contributor Author

Hey @kflynn happy new year!

And yes, we have not forgotten this. We just upgraded to v2.14.9. And so far we did not get any report about the same issue.

Hopefully the upgrade somehow fixes it. We will monitor it through out Feb. If there is no further report, I think we can close it for now. Thanks!

@Wenliang-CHEN
Copy link
Contributor Author

Okay, the issue happens again.

We are able to get the linkerd.endpoints_updates, linkerd.endpointslices_informer_lag_seconds.bucket and linkerd.endpoints_informer_lag_seconds.bucket

It seems they go in patterns: the linkerd.endpointslices_informer_lag_seconds.bucket goes with linkerd.endpoints_updates:

Screenshot 2024-02-08 at 14 54 34 Screenshot 2024-02-08 at 14 54 43

And the linkerd.endpoints_informer_lag_seconds.bucket is aways 0

Screenshot 2024-02-08 at 14 54 50

We are not sure how to understand this. Do they mean anything particular? Or are they totally normal?

Copy link

stale bot commented May 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 10, 2024
@stale stale bot closed this as completed May 26, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants