Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty response warning with latest kube-client (v0.26) #146

Closed
JorTurFer opened this issue Jan 24, 2023 · 16 comments · Fixed by kubernetes/kubernetes#115978
Closed

Empty response warning with latest kube-client (v0.26) #146

JorTurFer opened this issue Jan 24, 2023 · 16 comments · Fixed by kubernetes/kubernetes#115978
Assignees
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@JorTurFer
Copy link
Contributor

JorTurFer commented Jan 24, 2023

Recent changes in https://github.com/kubernetes/client-go have changed the behavior of discovery client.
These changes are released as part of v0.26 and produced a scenario where the tooling like kubectl or helm shows a warning during the resource discovery:

E1212 15:54:39.565250   21101 memcache.go:255] couldn't get resource list for external.metrics.k8s.io/v1beta1: Got empty response for: external.metrics.k8s.io/v1beta1

As a first action, I reviewed if we are doing something wrong in KEDA but after debugging helm and kubectl I found that the reason isn't related with KEDA (prometheus-adapter produces the same error but about custom.metrics) itself.
After the changes, the discovery client list all the api resources and also all the resources inside the api resources, so the scenario where a fresh installation of KEDA/prometheus-adapter without any metric exposed automatically shows this warning on every update tooling, which is annoying (and looks hideous).

In KEDA, we return the current metrics available in response of listing requests, and I have seen that other tools like prometheus-adapter does the same for external metrics and custom metrics

Is this approach the correct? I mean, should we return the metrics instead of the metric type (like metrics.k8s.io does). Exposing all the metrics available as part of listing requests can produce timeouts if the amount of available metrics is huge, and can produce this annoying (and unrelated) warning in case of non metrics for scaling available.
I can see that this approach is also used in the example code, but if this is how we should do the things exposing metrics, the discovery client should consider this to ignore those errors.

Could you provide any help/guidance about this topic. We have received some notifications about this warning in KEDA project and users think it's related with a KEDA problem, but we follow the available example to develop our metrics server.

Is correct from custom-metrics-apiserver usage if we return a fixed collection of generic types (in our case, scaledobjects) like metrics.k8s.io api does (with 'pods' and 'nodes'). This change would solve the timeouts in huge clusters and also the issue without any metric. Considering that these metrics aren't browsable like metric-k8s-io are, maybe this could be acceptable from custom-metrics-apiserver implementation pov

@JorTurFer
Copy link
Contributor Author

@olivierlemasle @dgrisonnet FYI

@jbilliau-rcd
Copy link

Agreed. We run Keda 2.7.1 and prometheus-adapter on our clusters and this is what a Helm upgrade command looks like while it's running:

image

Developers freak out because they think something is hosed....not a very good user experience.

@dgrisonnet
Copy link
Member

Is this approach the correct? I mean, should we return the metrics instead of the metric type (like metrics.k8s.io does). Exposing all the metrics available as part of listing requests can produce timeouts if the amount of available metrics is huge, and can produce this annoying (and unrelated) warning in case of non metrics for scaling available.

This is the approach we are taking everywhere else for custom and external metrics so KEDA is definitely doing things properly.

That said I am unsure this is the correct approach from a Kubernetes standpoint. AFAIK we are the only one doing something like that in the Kubernetes ecosystem. Aggregated APIs usually have a known list of Resources they want to expose that rarely changes and it never happens dynamically (e.g PodMetrics/NodeMetrics). But in our case we are exposing metrics as if they were Kubernetes types and we are doing that dynamically meaning that at runtime our API may grow without having to restart the aggregated apiserver. We are essentially doing CRDs the wrong way.

In the past I questioned that approach when investigating kubernetes-sigs/prometheus-adapter#292, which is a bug about the infamous spammy log line:

apiserver was unable to write a JSON response: http2: stream closed

My findings were that at ~34 Resources exposed by the API, this error was starting to appear, but I never got the time to further investigate what were the implication of that, but it definitely didn't sound right.

--

To answer the question of whether we could just expose the ExternalMetricValue type instead of the list of all external metrics, the answer is not in the current state of things. We are currently relying on the fact that the metric name is passed in the URL to be able to determine which metric the HPA is trying to get:

I can hardly see a way around that with the way aggregated APIs work today, but there is definitely something wrong with the current approach.

I will investigate what can be done to improve the situation, but maybe we could get a first quick win by getting the logs silenced for our APIs or moved to a higher verbosity level.

@JorTurFer
Copy link
Contributor Author

JorTurFer commented Jan 24, 2023

Yes, but I'm talking only about using a fixed value for listing operations. I have done this draft based on another contributor PR.
If you see the changes:
image
basically, we return a fixed slice with always the same content. This solves the timeout issue and also the 0 length issue, and it works, at least in KEDA (I have tested the change with e2e tests)

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1"
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"external.metrics.k8s.io/v1beta1","resources":[{"name":"scaledobjects","singularName":"","namespaced":true,"kind":"ExternalMetricValueList","verbs":["get"]}]}

The routing based on the metric name hasn't changed, just the response of ListAllExternalMetrics

@dgrisonnet
Copy link
Member

dgrisonnet commented Jan 24, 2023

I guess it makes sense that it works because the HPA controller is most likely not caring at all about discovery. I would expect it to just send a request to the endpoint of the Metric the HPA targets.

I wonder if that change would be possible to make here. I have no clue if some users actually rely on the discovery endpoint.

I revived a related Kubernetes issue and pinged some people, let's hear what they say maybe we can find a solution that would not be breaking any behavior.

@JorTurFer
Copy link
Contributor Author

I revived a related Kubernetes issue and pinged some people, let's hear what they say maybe we can find a solution that would not be breaking any behavior.

Sure, I think this topic is complex to solve in only a few minutes, as is part of the discovery. Let's hear other opinions to solve this properly

I have no clue if some users actually rely on the discovery endpoint.

I don't think so as the endpoint itself doesn't support listing, I mean, the only allowed verb is get. AFAIK, there isn't any way to introspect the resources just using the namespaces, as the list verb isn't allowed, just get.

BTW, thanks for you help 🙇

@dgrisonnet
Copy link
Member

There is a specific /apis endpoint for discovery that is available to any clients with the right certificates.

@JorTurFer
Copy link
Contributor Author

JorTurFer commented Jan 25, 2023

Do you mean executing kubectl get --raw "/apis/external.metrics.k8s.io"?

I can't see any /apis inside external.metrics.k8s.io. I meant that even if you discover external.metrics.k8s.io from /apis, the response of /apis/external.metrics.k8s.io/v1beta1 is a list of available metrics with the supported verbs, and the only supported verb is get, so the discovery has ended here, as you can't do kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespace/xxxxxx" to list the metrics available in the namespace, like you can do with pods/nodes, for example in an arbitrary namespace:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/keda/random-metric-name"     
Error from server (NotFound): the server could not find the requested resource

kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/keda/pods"     
{"kind":"PodMetricsList","apiVersion":"metrics.k8s.io/v1beta1","metadata":{},"items":[{"metadata":{"name":"keda-admission-5ff995d9b5-t7ss8","namespace":"keda","creationTimestamp":"2023-01-25T13:20:10Z","labels":{"app":"keda-admission-webhooks","name":"keda-admission-webhooks","pod-template-hash":"5ff995d9b5"}},"timestamp":"2023-01-25T13:19:28Z","window":"1m0.394s","containers":[{"name":"keda-admission-webhooks","usage":{"cpu":"173025n","memory":"16964Ki"}}]},{"metadata":{"name":"keda-metrics-apiserver-8c6db6798-fp7kc","namespace":"keda","creationTimestamp":"2023-01-25T13:20:10Z","labels":{"app":"keda-metrics-apiserver","pod-template-hash":"8c6db6798"}},"timestamp":"2023-01-25T13:19:29Z","window":"1m3.599s","containers":[{"name":"keda-metrics-apiserver","usage":{"cpu":"3448706n","memory":"43872Ki"}}]},{"metadata":{"name":"keda-operator-545fbd6565-kwxfp","namespace":"keda","creationTimestamp":"2023-01-25T13:20:10Z","labels":{"app":"keda-operator","name":"keda-operator","pod-template-hash":"545fbd6565"}},"timestamp":"2023-01-25T13:19:34Z","window":"1m12.775s","containers":[{"name":"keda-operator","usage":{"cpu":"2060106n","memory":"54264Ki"}}]}]}

This is what I meant with:

I don't think so as the endpoint itself doesn't support listing, I mean, the only allowed verb is get. AFAIK, there isn't any way to introspect the resources just using the namespaces, as the list verb isn't allowed, just get.

The metrics we are registering aren't available on all the namespaces

@logicalhan
Copy link

/triage accepted
/assign @dgrisonnet

@felixrb86
Copy link

The same thing happens to me, just running a simple command...

image

@JorTurFer
Copy link
Contributor Author

I think we need to find a solution soon, because the amount of noise will increase with the time, helm is already affected in latest version (because they have bumped their deps)

@secustor
Copy link

I think we need to find a solution soon, because the amount of noise will increase with the time, helm/helm#11772 affected in latest version (because they have bumped their deps)

I'm seeing this with plain Kubectl too.

@tonedefdev
Copy link

I think we need to find a solution soon, because the amount of noise will increase with the time, helm/helm#11772 affected in latest version (because they have bumped their deps)

I'm seeing this with plain Kubectl too.

We're also seeing it with any version of kubectl greater than v1.24 in clusters where keda is installed

@JorTurFer
Copy link
Contributor Author

Independently of the action taken in the upstream, what do you think about exposing a fixed metric based on metric types instead of dynamically calculating the available metrics @dgrisonnet @logicalhan ?

This could also solve the timeout problem when there are multiple metrics

@JorTurFer
Copy link
Contributor Author

Sorry for being a pain but we will release KEDA v2.10 and we would like to patch this somehow. Currently we have 2 options:

Could you suggest something to us @dgrisonnet @logicalhan ?

@wamak9
Copy link

wamak9 commented Mar 6, 2023

can we keep this ticket open untill the fix is deployed and in working condition ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants