When Topology Aware Hints are disabled, kube-proxy shouldn't spam the logs #124341

diranged · 2024-04-16T22:05:22Z

What happened?

We have a mix of smaller and larger clusters - and for our larger clusters it is critical that the Topology Aware Routing is used on some of our high volume services. it works great, and it's fallback mode is reasonable.

The problem is this line: https://github.com/kubernetes/kubernetes/blob/v1.28.9/pkg/proxy/topology.go#L168-L171

With this line in place, on a reasonably sized cluster, we emit hundreds of millions of these log messages daily:

I0416 21:50:58.359353       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.360686       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.360944       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.503272       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.504681       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.505831       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"

A LogQL query shows that on one cluster, we had over 1.2 billion log events from this message:

Here's an hour of the data:

What did you expect to happen?

I understand that its useful to tell people that the zone aware routing isn't working - but doing it in log messages seems less than useful. I don't know anyone who operates clusters and monitors these log messages for errors, but rather would use metrics to alert on such a behavior.

I expect that the Kubernetes Service will report that Topology Aware Routing is or is not working (which it does) via the kubectl describe command - and that's it. Other than that, I expect kube-proxy to consider this as some kind of a debug message and not emit it as an info level message.

How can we reproduce it (as minimally and precisely as possible)?

Create a service with service.kubernetes.io/topology-mode: auto in the annotations, and only put up one or two pods.. then start sending traffic to the service. Check logs.

Anything else we need to know?

We're going to drop these messages with Promtail filtering ... but we really don't want them at all because it honestly takes time up on the hosts to write the messages to local disk, it also makes the kubectl logs kube-proxy-... command nearly useless, and it's just overall a waste of space I think.

Kubernetes version

% kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.7-eks-b9c9ed7

Cloud provider

AWS EKS 1.28

OS version

AWS Bottlerocket 1.19.2

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

mengjiao-liu · 2024-04-17T02:53:36Z

/sig network

Someone has encountered the same problem and increased the log verbosity to 2 in PR #123322. This change will be included in Kubernetes version 1.30. You can wait 1 day for the 1.30 version to be released. After version 1.30 is released, you can upgrade the cluster to version 1.30, then as long as the log verbosity is less than 2 this problem can be avoided.

mengjiao-liu · 2024-04-17T03:00:38Z

/area logging

diranged · 2024-04-17T13:43:39Z

/sig network

Someone has encountered the same problem and increased the log verbosity to 2 in PR #123322. This change will be included in Kubernetes version 1.30. You can wait 1 day for the 1.30 version to be released. After version 1.30 is released, you can upgrade the cluster to version 1.30, then as long as the log verbosity is less than 2 this problem can be avoided.

Any chance we can see this back-ported to 1.28 and 1.29? It seems like a relatively safe thing to do, and 1.30 is going to take a while to get upgraded to (especially because we use EKS, which of course is a little bit delayed on the release cycle).

aojea · 2024-04-17T16:49:34Z

Any chance we can see this back-ported to 1.28 and 1.2

anyone willing to take on this ?

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md

/help
/triage accepted

k8s-ci-robot · 2024-04-17T16:49:36Z

@aojea:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Any chance we can see this back-ported to 1.28 and 1.2

anyone willing to take on this ?

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md

/help
/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

liangyuanpeng · 2024-04-18T01:43:21Z

/assign

aojea · 2024-04-30T07:17:37Z

/close

cherry picks were merged

k8s-ci-robot · 2024-04-30T07:17:42Z

@aojea: Closing this issue.

In response to this:

/close

cherry picks were merged

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

diranged · 2024-04-30T18:13:36Z

Thanks everyone!

diranged added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 16, 2024

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 17, 2024

k8s-ci-robot added the area/logging label Apr 17, 2024

k8s-ci-robot assigned liangyuanpeng Apr 18, 2024

This was referenced Apr 18, 2024

Automated cherry pick of #123322: add log verbosity to endpoint topology hint #124355

Merged

Automated cherry pick of #123322: add log verbosity to endpoint topology hint #124354

Merged

k8s-ci-robot closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When Topology Aware Hints are disabled, kube-proxy shouldn't spam the logs #124341

When Topology Aware Hints are disabled, kube-proxy shouldn't spam the logs #124341

diranged commented Apr 16, 2024

mengjiao-liu commented Apr 17, 2024 •

edited

mengjiao-liu commented Apr 17, 2024

diranged commented Apr 17, 2024

aojea commented Apr 17, 2024

k8s-ci-robot commented Apr 17, 2024

liangyuanpeng commented Apr 18, 2024

aojea commented Apr 30, 2024

k8s-ci-robot commented Apr 30, 2024

diranged commented Apr 30, 2024

When Topology Aware Hints are disabled, kube-proxy shouldn't spam the logs #124341

When Topology Aware Hints are disabled, kube-proxy shouldn't spam the logs #124341

Comments

diranged commented Apr 16, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

mengjiao-liu commented Apr 17, 2024 • edited

mengjiao-liu commented Apr 17, 2024

diranged commented Apr 17, 2024

aojea commented Apr 17, 2024

k8s-ci-robot commented Apr 17, 2024

Guidelines

liangyuanpeng commented Apr 18, 2024

aojea commented Apr 30, 2024

k8s-ci-robot commented Apr 30, 2024

diranged commented Apr 30, 2024

mengjiao-liu commented Apr 17, 2024 •

edited