keda-operator memory leak when prometheus scaler having errors #5248

GoaMind · 2023-12-04T14:54:41Z

Report

When prometheus scaler having errors while fetching metrics, memory starts to grow on keda-operator until it gets OOMKilled.

Behaviour as follows:

Installation of Keda is done via plain manifest: https://github.com/kedacore/keda/releases/download/v2.11.2/keda-2.11.2.yaml

Expected Behavior

Memory is not growing when any of scalers have errors

Actual Behavior

Memory is growing, when prometheus scaler having errors (example fetch metrics from prometheus)

Steps to Reproduce the Problem

Deploy service with prometheus scaler type with address that does not exists

	        - type: prometheus
            metadata:
              query: sum(rate(rabbitmq_client_messages_published_total{service_name=~'kafka-api-events-to-rabbitmq'}[2m]))
              threshold: '200'
              serverAddress: https://non-existing-prometheus-url # that returns 404

keda-operator will start pushing Errors in stderr
memory usage will start to grow

Logs from KEDA operator

2023-12-04T14:38:18Z	ERROR	prometheus_scaler	error executing prometheus query	{"type": "ScaledObject", "namespace": "tooling", "name": "debug-service", "error": "prometheus query api returned error. status: 404 response: "}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/prometheus_scaler.go:359
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:139
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).GetScaledObjectMetrics
	/workspace/pkg/scaling/scale_handler.go:508
github.com/kedacore/keda/v2/pkg/metricsservice.(*GrpcServer).GetMetrics
	/workspace/pkg/metricsservice/server.go:47
github.com/kedacore/keda/v2/pkg/metricsservice/api._MetricsService_GetMetrics_Handler
	/workspace/pkg/metricsservice/api/metrics_grpc.pb.go:99
google.golang.org/grpc.(*Server).processUnaryRPC
	/workspace/vendor/google.golang.org/grpc/server.go:1343
google.golang.org/grpc.(*Server).handleStream
	/workspace/vendor/google.golang.org/grpc/server.go:1737
google.golang.org/grpc.(*Server).serveStreams.func1.1
	/workspace/vendor/google.golang.org/grpc/server.go:986

KEDA Version

2.12.1

Kubernetes Version

1.26

Platform

Amazon Web Services

Scaler Details

prometheus

Anything else?

No response

The text was updated successfully, but these errors were encountered:

JorTurFer · 2023-12-04T23:33:55Z

Hello,
Are you using default values for memory & cpu?
I'm trying v2.12.1 and I can't reproduce the issue (maybe it's solved or default values need to be updated). How many ScaledObjects do you have in cluster? How many are failing?

I have deployed 10 ScaledObjects with pollingInterval: 1 and using prometheus scaler (all of them return 404). The memory looks stable:

I'm going to keep it all the night just in case, it could need some hours. Is there any other step that I can do to reproduce it?

I've been profiling the pod, looking for something weird, but I haven't seen anything

JorTurFer · 2023-12-05T07:34:23Z

After 8 hours, nothing has changed:

Thne memory is stable even though I added 10 failing ScaledObject more.

Now I'm going to downgrade KEDA to v2.11.2 to check if I can reproduce the issue to double-check if it has been solved

GoaMind · 2023-12-05T09:48:40Z

Good day Jorge,

This happens in all 10 clusters that we have and number of ScaledObject varies from 12 to 110.
Memory starts to grow even if only 1 failed ScaledObject is added.
And we observed this behaviour with Keda versions: 2.9.2, 2.11.2 as well
For keda-operator we have increased requested memory from 100Mi to 200Mi, other params remains the same as in original manifest

        resources:
          limits:
            cpu: 1000m
            memory: 1000Mi
          requests:
            cpu: 100m
            memory: 200Mi

For experiment I took one cluster, that already had 12 ScaledObject and I have added one more (with failed prometheus trigger):

NAME                       SCALETARGETKIND      SCALETARGETNAME            MIN   MAX   TRIGGERS   AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
onboarding-debug-service   apps/v1.Deployment   onboarding-debug-service   3     10    cpu                         True    False    False      Unknown   10m

Full manifest for this ScaledObject

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  annotations:
    meta.helm.sh/release-name: onboarding-debug-service
    meta.helm.sh/release-namespace: tooling
  creationTimestamp: "2023-12-05T09:29:56Z"
  finalizers:
  - finalizer.keda.sh
  generation: 2
  labels:
    app: onboarding-debug-service
    app.kubernetes.io/managed-by: Helm
    repository: onboarding-debug-service
    scaledobject.keda.sh/name: onboarding-debug-service
  name: onboarding-debug-service
  namespace: tooling
  resourceVersion: "400763468"
  uid: 0809533e-9fa5-40a4-b310-80fa69e9df4d
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 10
    scalingModifiers: {}
  cooldownPeriod: 300
  maxReplicaCount: 10
  minReplicaCount: 3
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: onboarding-debug-service
  triggers:
  - metadata:
      type: Utilization
      value: "70"
    type: cpu
  - metadata:
      type: Utilization
      value: "10"
    type: memory
  - metadata:
      query: sum(rate(rabbitmq_client_messages_published_total{service_name=~'kafka-api-events-to-rabbitmq'}[2m]))
      serverAddress: https://prometheus-404-endpoint
      threshold: "200"
    type: prometheus
status:
  conditions:
  - message: ScaledObject is defined correctly and is ready for scaling
    reason: ScaledObjectReady
    status: "True"
    type: Ready
  - message: Scaling is not performed because triggers are not active
    reason: ScalerNotActive
    status: "False"
    type: Active
  - message: No fallbacks are active on this scaled object
    reason: NoFallbackFound
    status: "False"
    type: Fallback
  - status: Unknown
    type: Paused
  externalMetricNames:
  - s2-prometheus
  health:
    s2-prometheus:
      numberOfFailures: 48
      status: Failing
  hpaName: keda-hpa-onboarding-debug-service
  originalReplicaCount: 10
  resourceMetricNames:
  - cpu
  - memory
  scaleTargetGVKR:
    group: apps
    kind: Deployment
    resource: deployments
    version: v1
  scaleTargetKind: apps/v1.Deployment

And right after adding this one single scaler memory utilisation started to grow staidly (from 35% to 43% in 15m)

JorTurFer · 2023-12-06T19:56:25Z

Thanks for the info,
I kept KEDA deployed with current the same scenario and after 36h it looks exactly the same. It uses ~180Mi stable (16 ScaledObject, 8 of them with 404 in the endpoint and other 8 with invalid url):

And right after adding this one single scaler memory utilisation started to grow staidly (from 35% to 43% in 15m)

Is this over the requests or over the limits? the original 40% can be 40Mi or 400Mi and current 7% can be 14Mi or 70Mi xD
I mean, 14Mi of memory increasing can could be just because it tries to reconnect, increasing the memory usage due to allocation of resources for regenerating the internal cache

I'm going to test the same scenario with KEDA v2.9.2 because we introduced several performance improvements and maybe 2.9.2 was affected by something.

JorTurFer · 2023-12-06T22:27:00Z

After 2 hours with v2.9.2, it's almost the same:

JorTurFer · 2023-12-06T22:30:45Z

Latest KEDA version has the option for enabling the profile port.
You can do it by setting an extra arg --profiling-bind-address=:PORT
If could you enable the profiler, export the heap and send it to us, we can go deeper on your case. Don't enable the profiler on huge clusters because profiling can have a performance impact.

JorTurFer · 2023-12-07T16:31:11Z

Same behavior using KEDA v2.9.2, we'd need the memory dump to check the root cause

zroubalik · 2023-12-07T17:23:45Z

@GoaMind Thanks for reporting! Could you please also share an example Deployment for the workload that you are scaling?

@JorTurFer do you have the same configuration of ScaledObjects?

JorTurFer · 2023-12-07T17:26:57Z

IDK, I hope so,
This is an example:

spec:
  cooldownPeriod: 1
  maxReplicaCount: 2
  minReplicaCount: 0
  pollingInterval: 1
  scaleTargetRef:
    name: prometheus-test-deployment
  triggers:
    - metadata:
        activationThreshold: '20'
        metricName: http_requests_total
        query: >-
          sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
        serverAddress: http://20.238.174.237
        threshold: '20'
      type: prometheus

@GoaMind , Could you confirm that this is similar to yours? The IP is public (and mine) so you can try the ScaledObject in your cluster if you want

zroubalik · 2023-12-07T17:27:54Z

IDK, I hope so, This is an example:

spec:
  cooldownPeriod: 1
  maxReplicaCount: 2
  minReplicaCount: 0
  pollingInterval: 1
  scaleTargetRef:
    name: prometheus-test-deployment
  triggers:
    - metadata:
        activationThreshold: '20'
        metricName: http_requests_total
        query: >-
          sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
        serverAddress: http://20.238.174.237
        threshold: '20'
      type: prometheus

@GoaMind , Could you confirm that this is similar to yours? The IP is public (and mine) so you can try the ScaledObject in your cluster if you want

That would be great!

GoaMind · 2023-12-08T12:33:11Z

Hey,

Also here is the behaviour with Memory consumption points:

To verify that reflection is correct, I also checked on K8s side, at some point:

NAME                                      CPU(cores)   MEMORY(bytes)
keda-admission-7744888c69-2sphk           1m           40Mi
keda-metrics-apiserver-599b5f957c-pfmxq   3m           31Mi
keda-operator-6d5686ff7c-94xnb            7m           967Mi

And after OOMKill:

NAME                                      CPU(cores)   MEMORY(bytes)
keda-admission-7744888c69-2sphk           2m           40Mi
keda-metrics-apiserver-599b5f957c-pfmxq   3m           31Mi
keda-operator-6d5686ff7c-94xnb            8m           71Mi

If describing the pod:

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 05 Dec 2023 17:14:47 +0200
      Finished:     Fri, 08 Dec 2023 11:12:28 +0200

I will put a scaler provided by @JorTurFer to monitor if Keda will behave the same.
And play with profiling in the beginning of next week.

zroubalik · 2023-12-08T12:41:08Z

@GoaMind great, thanks for the update!

GoaMind · 2023-12-08T15:25:48Z

I have deployed trigger proposed by @JorTurFer ) and there was not any memory leak visible.

But, after changing serverAddress from IP (20.238.174.237) to random DNS that returns. 404

zroubalik · 2023-12-08T16:34:30Z

@GoaMind what happens if you put there random IP that returns 404s?

JorTurFer · 2023-12-08T17:15:22Z

Could you share an example of the random DNS? I'd like to replicate your case closest as possible

JorTurFer · 2023-12-08T18:58:49Z

I've added a ScaledObject like this:

- metadata:
        activationThreshold: '20'
        metricName: http_requests_total
        query: >-
          sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
        serverAddress: https://google.es
        threshold: '20'
      type: prometheus

and the memory has increased a bit but it's still stable:

Could you share with us a ScaledObject that I can use to replicate the scenario please?

GoaMind · 2023-12-12T20:29:59Z

I was a bit wrong that it is reproducible with random DNS.

In fact I was able to reproduce only when calling our internal prometheus server that is not available from outside.

I have tried to configure my personal server https://kedacore-test.hdo.ee/test to give response the same way. However even I was able to align all the headers and set wildcard SSL certs, I still cannot reproduce it on this server. I have compared ciphers and other configs, but still cannot figure out what is the difference.

Here is what I checked so far (but without luck)

Checked full internal prometheus server response, that causes memory leak in keda-operator

*   Trying xxx.xxx.xxx.xxx:443...
* Connected to XXXX (xxx.xxx.xxx.xxx) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=*.XXXX.XXX
*  start date: Nov  6 06:57:35 2023 GMT
*  expire date: Feb  4 06:57:34 2024 GMT
*  subjectAltName: host "XXXX" matched cert's "*.XXXX.XXX"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://XXXX/test
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: XXXX]
* [HTTP/2] [1] [:path: /test]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET /test HTTP/2
> Host: XXXX
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/2 404
HTTP/2 404
< content-length: 0
content-length: 0
< date: Tue, 12 Dec 2023 19:05:26 GMT
date: Tue, 12 Dec 2023 19:05:26 GMT

<
* Connection #0 to host XXXX left intact

Full specs for SO that I used:

spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 10
    scalingModifiers: {}
  cooldownPeriod: 1
  maxReplicaCount: 2
  minReplicaCount: 0
  pollingInterval: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: onboarding-debug-service
  triggers:
  - metadata:
      activationThreshold: "20"
      query: sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
      serverAddress: https://XXXXXXX # With https://kedacore-test.hdo.ee/test it is not reproducible for now
      threshold: "20"
    type: prometheus

To replicate internal server response with https://kedacore-test.hdo.ee/test I have configured Nginx (requires nginx-extras to be installed to make it working):
nginx.conf in addition to defaults

http {
	server_tokens off;
        more_clear_headers Server;
        
        ssl_protocols TLSv1.2 TLSv1.3; 
	ssl_prefer_server_ciphers off;
        ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256;
}

/etc/nginx/sites-enabled/test full conf:

server {
   listen 443 ssl http2;
   server_name kedacore-test.hdo.ee;
   ssl_certificate /etc/letsencrypt/live/kedacore-test.hdo.ee/fullchain.pem;
   ssl_certificate_key /etc/letsencrypt/live/kedacore-test.hdo.ee/privkey.pem;

   location / {
     more_clear_headers 'Content-Type';
     more_clear_headers 'last-modified';
     more_clear_headers 'etag';
     more_clear_headers 'accept-ranges';
     more_set_headers "Content-Length: 0"
     return 404;
   }
   location /api {
     more_clear_headers 'Content-Type';
     more_set_headers "Content-Length: 0"
     return 404;
   }
}

But no luck, despite response have only difference in hosts, IPs and cipher
I'm out of ideas for now.

I will try to play with profiling in next few days and will drop an update once anything is figured out.

JorTurFer · 2023-12-13T22:26:13Z

We'd appreciate a profile as we can go deeper on the issue.
The easiest way is using main branch and enabling the profiler as argument on pod: https://github.com/kedacore/keda/blob/main/cmd/operator/main.go#L93
But in case of this isn't possible (because you use other KEDA version for example), you can take a look to this post: https://dev.to/tsuyoshiushio/enabling-memory-profiling-on-keda-v2-157g

GoaMind · 2023-12-15T08:26:03Z

@JorTurFer thank you for information, I will check it shortly.

Could you please check if you can reproduce it with this trigger:

  - metadata:
      activationThreshold: "20"
      customHeaders: Host=abc.pipedrive.tools
      query: sum(rate(http_requests_total{app="prometheus-test-monitored-app"}[2m]))
      serverAddress: https://pimp.pipedrive.tools
      threshold: "20"
    type: prometheus

Key thing here is that you need to get 404 without body, that why we use customHeaders here

JorTurFer · 2023-12-15T08:31:50Z

Let me check your trigger :)

JorTurFer · 2023-12-15T08:54:01Z

I'm not 100% sure... maybe it's not related... xD

Now seriously, thanks for the report and reproduction path. I do confirm that I can reproduce the issue, I'll check it in dept later on

GoaMind · 2023-12-15T08:59:46Z

Thank you for prompt checking. 🙇
I'm still curios why this trigger produces such behaviour, while manually prepared address with 404 error -https://kedacore-test.hdo.ee/test does not have such footprints.

JorTurFer · 2023-12-15T09:34:21Z

It looks that after some time the consumption is stable

But definitively it looks weird and I'll profile the workload to detect the cause

JorTurFer · 2023-12-15T17:54:51Z

I think that I've found a possible problem. I'll draft a PR later on, but before merging it, would you be willing to test the fix if I build an image that contains it @GoaMind ?

GoaMind · 2023-12-15T18:11:34Z

Hey, sure I can test it once the docker image is available. If it is possible go this way.

JorTurFer · 2023-12-15T19:10:41Z

Yeah, I'm preparing the PR and once I open it, I'll give you the docker tag 😄
thanks!

JorTurFer · 2023-12-15T21:30:53Z

Hey @GoaMind
This tag ghcr.io/kedacore/keda-test:pr-5293-fe19d3a3233bef79ac7f53ba4f967a58b569f5f8 has been created based on this PR (so basically it's main + my PR).
Could you give a try and tell us if the problem is solved?

GoaMind · 2023-12-19T09:18:18Z

@JorTurFer apologise for late reply.

I can confirm that I do not observe memory leaking with your provided image.
Thank you so much for digging into this problem 🙇

JorTurFer · 2023-12-19T09:38:58Z

Nice!
The fix is already merged so it'll be included as part of next release 😁

GoaMind added the bug Something isn't working label Dec 4, 2023

JorTurFer mentioned this issue Dec 11, 2023

Release: 2.13.0 #5275

Closed

25 tasks

JorTurFer self-assigned this Dec 15, 2023

JorTurFer mentioned this issue Dec 15, 2023

fix: Prevented memory leak generated by not correctly cleaned HTTP resources #5293

Merged

1 task

zroubalik closed this as completed in #5293 Dec 18, 2023

zroubalik mentioned this issue Dec 18, 2023

getRootCAs() - add mutex for concurrency safety #5299

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keda-operator memory leak when prometheus scaler having errors #5248

keda-operator memory leak when prometheus scaler having errors #5248

GoaMind commented Dec 4, 2023

JorTurFer commented Dec 4, 2023 •

edited

Loading

JorTurFer commented Dec 5, 2023

GoaMind commented Dec 5, 2023 •

edited

Loading

JorTurFer commented Dec 6, 2023 •

edited

Loading

JorTurFer commented Dec 6, 2023

JorTurFer commented Dec 6, 2023

JorTurFer commented Dec 7, 2023

zroubalik commented Dec 7, 2023

JorTurFer commented Dec 7, 2023

zroubalik commented Dec 7, 2023

GoaMind commented Dec 8, 2023

zroubalik commented Dec 8, 2023

GoaMind commented Dec 8, 2023 •

edited

Loading

zroubalik commented Dec 8, 2023

JorTurFer commented Dec 8, 2023

JorTurFer commented Dec 8, 2023

GoaMind commented Dec 12, 2023

JorTurFer commented Dec 13, 2023

GoaMind commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

JorTurFer commented Dec 15, 2023 •

edited

Loading

GoaMind commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

GoaMind commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

GoaMind commented Dec 19, 2023

JorTurFer commented Dec 19, 2023

keda-operator memory leak when prometheus scaler having errors #5248

keda-operator memory leak when prometheus scaler having errors #5248

Comments

GoaMind commented Dec 4, 2023

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

JorTurFer commented Dec 4, 2023 • edited Loading

JorTurFer commented Dec 5, 2023

GoaMind commented Dec 5, 2023 • edited Loading

JorTurFer commented Dec 6, 2023 • edited Loading

JorTurFer commented Dec 6, 2023

JorTurFer commented Dec 6, 2023

JorTurFer commented Dec 7, 2023

zroubalik commented Dec 7, 2023

JorTurFer commented Dec 7, 2023

zroubalik commented Dec 7, 2023

GoaMind commented Dec 8, 2023

zroubalik commented Dec 8, 2023

GoaMind commented Dec 8, 2023 • edited Loading

zroubalik commented Dec 8, 2023

JorTurFer commented Dec 8, 2023

JorTurFer commented Dec 8, 2023

GoaMind commented Dec 12, 2023

Here is what I checked so far (but without luck)

JorTurFer commented Dec 13, 2023

GoaMind commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

JorTurFer commented Dec 15, 2023 • edited Loading

GoaMind commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

GoaMind commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

JorTurFer commented Dec 15, 2023

GoaMind commented Dec 19, 2023

JorTurFer commented Dec 19, 2023

JorTurFer commented Dec 4, 2023 •

edited

Loading

GoaMind commented Dec 5, 2023 •

edited

Loading

JorTurFer commented Dec 6, 2023 •

edited

Loading

GoaMind commented Dec 8, 2023 •

edited

Loading

JorTurFer commented Dec 15, 2023 •

edited

Loading