Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When Topology Aware Hints are disabled, kube-proxy shouldn't spam the logs #124341

Closed
diranged opened this issue Apr 16, 2024 · 9 comments
Closed
Assignees
Labels
area/logging help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@diranged
Copy link

What happened?

We have a mix of smaller and larger clusters - and for our larger clusters it is critical that the Topology Aware Routing is used on some of our high volume services. it works great, and it's fallback mode is reasonable.

The problem is this line: https://github.com/kubernetes/kubernetes/blob/v1.28.9/pkg/proxy/topology.go#L168-L171

With this line in place, on a reasonably sized cluster, we emit hundreds of millions of these log messages daily:

I0416 21:50:58.359353       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.360686       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.360944       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.503272       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.504681       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"
I0416 21:50:58.505831       1 topology.go:169] "Skipping topology aware endpoint filtering since one or more endpoints is missing a zone hint"

A LogQL query shows that on one cluster, we had over 1.2 billion log events from this message:
image

Here's an hour of the data:
image

What did you expect to happen?

I understand that its useful to tell people that the zone aware routing isn't working - but doing it in log messages seems less than useful. I don't know anyone who operates clusters and monitors these log messages for errors, but rather would use metrics to alert on such a behavior.

I expect that the Kubernetes Service will report that Topology Aware Routing is or is not working (which it does) via the kubectl describe command - and that's it. Other than that, I expect kube-proxy to consider this as some kind of a debug message and not emit it as an info level message.

How can we reproduce it (as minimally and precisely as possible)?

Create a service with service.kubernetes.io/topology-mode: auto in the annotations, and only put up one or two pods.. then start sending traffic to the service. Check logs.

Anything else we need to know?

We're going to drop these messages with Promtail filtering ... but we really don't want them at all because it honestly takes time up on the hosts to write the messages to local disk, it also makes the kubectl logs kube-proxy-... command nearly useless, and it's just overall a waste of space I think.

Kubernetes version

% kubectl version
Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.7-eks-b9c9ed7

Cloud provider

AWS EKS 1.28

OS version

AWS Bottlerocket 1.19.2

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@diranged diranged added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 16, 2024
@mengjiao-liu
Copy link
Member

mengjiao-liu commented Apr 17, 2024

/sig network

Someone has encountered the same problem and increased the log verbosity to 2 in PR #123322. This change will be included in Kubernetes version 1.30. You can wait 1 day for the 1.30 version to be released. After version 1.30 is released, you can upgrade the cluster to version 1.30, then as long as the log verbosity is less than 2 this problem can be avoided.

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 17, 2024
@mengjiao-liu
Copy link
Member

/area logging

@diranged
Copy link
Author

/sig network

Someone has encountered the same problem and increased the log verbosity to 2 in PR #123322. This change will be included in Kubernetes version 1.30. You can wait 1 day for the 1.30 version to be released. After version 1.30 is released, you can upgrade the cluster to version 1.30, then as long as the log verbosity is less than 2 this problem can be avoided.

Any chance we can see this back-ported to 1.28 and 1.29? It seems like a relatively safe thing to do, and 1.30 is going to take a while to get upgraded to (especially because we use EKS, which of course is a little bit delayed on the release cycle).

@aojea
Copy link
Member

aojea commented Apr 17, 2024

Any chance we can see this back-ported to 1.28 and 1.2

anyone willing to take on this ?

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md

/help
/triage accepted

@k8s-ci-robot
Copy link
Contributor

@aojea:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Any chance we can see this back-ported to 1.28 and 1.2

anyone willing to take on this ?

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md

/help
/triage accepted

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 17, 2024
@liangyuanpeng
Copy link
Contributor

/assign

@aojea
Copy link
Member

aojea commented Apr 30, 2024

/close

cherry picks were merged

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close

cherry picks were merged

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@diranged
Copy link
Author

Thanks everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logging help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants