Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes SD consuming too much/leaking memory #2685

Closed
fabxc opened this Issue May 8, 2017 · 19 comments

Comments

Projects
None yet
5 participants
@fabxc
Copy link
Member

fabxc commented May 8, 2017

We added support for 1.6 to the Prometheus operator and attempted switching to it.
This turned out very problematic. At a target heap size of 1.5GB we have a memory limit of 2GB. About 20 (stable) targets are monitored at a load of 2k/samples/sec at ~30k mem series.

Prometheus 1.6 consistently grows memory to 2GB and gets OOM killed. The urgency score is very unstable and randomly shoots up to 1 and sometimes stays there for hours without any apparent reason (again, environment is very stable) and as a consequency stops ingesting any data.

image

Bottom line a 1.5 for comparison. Not sure whether this relates to GOGC=40 or some other changes that were made. But practically speaking, this version seems completely unusable for some reason. As we haven't gotten significant other reports yet, I wonder whether it's something we do wrong but given the environment, I've no idea what that could possibly be.
What supports that is that 1.5 also gets OOM killed way more often than I'd expect at that load – but it recovers sort of immediately and doesn't block ingestion for hours. All memory issues of <=1.5 aside, even then >2GB seems excessive for that load.

This is the 1.6 pod manifest in case it shows some misconfiguration: https://gist.github.com/brancz/ae4f6e6767999718a09d83d52c882c49

@beorn7

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented May 8, 2017

We are having similar reports in the operator repo and told people to just allocate more resources. But as we experienced now ourselves, there must be something else going wrong: coreos/prometheus-operator#326

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 8, 2017

I think the urgency score behaves reasonably, given the circumstances, which are that this Prometheus server is using too much RAM. So urgency goes up to 1, ingestion stops, all to reduce memory usage to prevent OOMing. The urgency calculation of 1.6 works completely different from 1.5. The latter does not incorporate heap size at all. So it'll happily OOM without a dent in the urgency score.

What we have to look for is why so much RAM is used even with this modest load. Could be memory-hungry queries, memory-hungry LevelDB, … First step is to look at the process_resident_memory_bytes and max_over_time and min_over_time of go_memstats_alloc_bytes. Then compare this to prometheus_local_storage_open_head_chunks, prometheus_local_storage_chunks_to_persist, and prometheus_local_storage_memory_chunks. If you post both plots here, I'll help interpreting them.

It could also be a "Prometheus cannot count that low" issue. The container manifest has only 1GiB target heap size. Perhaps even with the modest load, there is so much baseline memory usage, that 1GiB heap size (and 1.5GiB or even 2GiB total RAM) is not enough. Having said that, we are running a couple of 1GiB target heap size Prometheis here at SC. They seem to do just fine. The two smallest are configured at 1.08 GiB target heap size. They show quite different patterns:

  1. around 1.0GiB peak heap size (as configured), around 1.2GiB peak RSS. about 400k memory chunks, 15k series in memory, 10 targets, 125 samples/s.
  2. around 950MiB peak heap size (weird that it is less than configured, perhaps there are shorter peaks not caught by the plot), 1.0 GiB peak RSS, 650k memory chunks, 8k series in memory, 5 targets, 260 samples/s.

In general, even lighter load than in your case. Not sure if that explains it already. The difference might also be in the queries (incl. rule evaluation), relabeling, larger LevelDB size because of longer/more labels, whatever… As said, the memory plot as described above would be helpful to look at.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 8, 2017

I found another one that gets a bit closer to your load:

Target heap size 1.3GiB, peak heap size just below that as expected, peak RSS 1.55GiB, 750k memory chunks, 60k series in memory, 60 targets, 1.4k samples/s.

This one is doing just fine, too. All examples are EC2 instances. Non-containerized…

@brancz

This comment has been minimized.

Copy link
Member

brancz commented May 8, 2017

process_resident_memory_bytes
screenshot from 2017-05-08 14-31-03

max_over_time(go_memstats_alloc_bytes)
screenshot from 2017-05-08 14-30-01

min_over_time(go_memstats_alloc_bytes)
screenshot from 2017-05-08 14-30-24

prometheus_local_storage_open_head_chunks
screenshot from 2017-05-08 14-31-55

prometheus_local_storage_chunks_to_persist
screenshot from 2017-05-08 14-32-19

prometheus_local_storage_memory_chunks
screenshot from 2017-05-08 14-32-44

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 8, 2017

First result (byproduct): the prometheus_local_storage_open_head_chunks has a bug and gets negative... I guess that happens if a chunk is closed due to time-out. Perhaps we decrement it twice or something. Will look into it... see #2687

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented May 8, 2017

Preliminary analysis:

I have to guess a bit which one is 1.5 and which one is 1.6 (not visible in the legend), and I also have to guess when the OOMs happen, but I'm pretty sure I guessed right. So:

1.5 just goes up and up in RAM usage until it OOMs. This is correlated with an increase in memory chunks, but the RAM increases way more than can be justified by the memory chunks. I.e. you max out at about 125k memory chunks, while using 2.0GiB RSS (heap range between 1.0GiB and 1.8GiB). That's a 14x overhead in terms of RSS, which is quite a lot. (I saw an overhead like that only with an extremely large index due to excessive amount of labels and length of label names and values (but not necessarily high series cardinality)). This is really weird, something must go horribly wrong here.

First conclusion: This is a problem that exists with 1.5 as well. I don't know what is happening, but the different behavior of 1.6 can be explained be the different strategy to deal with this problem. See next paragraph.

So what does 1.6 do? RAM is going up and up and up, so when it gets to the limits, 1.6 evicts all the chunks it can (you see memory chunk dropping). This helps a bit in terms of memory usage, you see the heap size drop. But it only makes a little dent. Memory usage still increases after the drop. Since there are no more chunks to evict, 1.6 simply stops ingestion completely. Number of memory chunks drops to 0 eventually. However, the heap still grows!!! No ingestion, no chunks in memory, but still something takes more and more heap space. Eventually, even 1.6 OOMs, and everything starts from the beginning.

You have to find out what takes so much heap, even if no ingestion is happening anymore and all memary chunks have been evicted. 1.6 make that easy as it actually does the above, while in 1.5, you still have the "noise" of ingesting and memory chunks. If you take a heap profile of a 1.6 server at the time it has practically no memory chunks anymore, you should clearly see where the heap is consumed.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented May 8, 2017

FTR prometheus-k8s-b-0 is v1.5.2 and prometheus-k8s-0 is v1.6.1.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented May 8, 2017

Thanks for the thorough analysis @beorn7
So we agree that something's fishy about the setup in general but it doesn't seem to be configuration. The explanation of 1.6's behavior makes sense. I guess it's more leaning towards denial of service but given the circumstances we are in at this point already, it is not a problem in itself.

We'll look into more detailed heap profiles then.

@fabxc fabxc changed the title Prometheus 1.6 runs unstable Kubernetes SD consuming too much/leaking memory May 10, 2017

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented May 10, 2017

We also ran a 2.0 server and it runs into the same issues. A mem profile there was rather interesting, because there's almost nothing to see aside from k8s SD.

image

So there's obviously a problem with the SD and we are receiving about 15 updates/sec on service endpoints across 8 endpoint configs. That alone doesn't seem to justify the k8s client decoders holding over 800MB of memory though.
prombench also uses endpoint configs and has no indication whatsoever of using remotely that much memory for it even though it has hundreds of targets instead of just a dozen.

So while we are certainly duplicating subscriptions across SD configs, that alone shouldn't really be responsible for what are seeing here. I renamed the issue as it's obviously a different problem.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented May 10, 2017

This seems to mostly occur when monitoring endpoints of k8s components (scheduler, controller manager, ...) which basically receive an update every second (without actually any changes in the endpoint list).

What's getting updated (every second or less) is an annotation in the control plane components:
control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"ip-172-20-35-164","leaseDurationSeconds":15,"acquireTime":"2017-05-10T10:04:34Z","renewTime":"2017-05-10T10:24:41Z","leaderTransitions":0}'

Current theory is that those updates trigger us to rebuild the targets from those endpoints, which should only hit our caches. But maybe the caches hold the raw data and every cache access triggers a decoding and its just that decoding is incredibly memory hungry. That would explain why we get endpoint events but memory usage shows in pods.

@f0

This comment has been minimized.

Copy link

f0 commented May 18, 2017

I can confirm with 1.6.2 , the probleme is gone,,,

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented May 18, 2017

Great, thanks!

@fabxc fabxc closed this May 18, 2017

@ichekrygin

This comment has been minimized.

Copy link

ichekrygin commented Jun 15, 2017

Hello, I think I see "memory leak" issue in v1.6.2, not entirely sure though.

image

Runtime & Build:
image

Configuration:

global:
  scrape_interval: 30s
  scrape_timeout: 30s

rule_files:
- /etc/prometheus/alert.rules

scrape_configs:
- job_name: etcd
  static_configs:
    - targets:
      - 10.72.1.2:2379
      - 10.72.1.3:2379
      - 10.72.1.4:2379
      - 10.72.1.5:2379
      - 10.72.1.6:2379

- job_name: 'prometheus'
  static_configs:
    - targets: ['localhost:9090']

- job_name: 'kube-state-metrics'
  static_configs:
    - targets: ['prometheus-kube-state-metrics:8080']

- job_name: 'kubernetes-apiservers'
  kubernetes_sd_configs:
  - role: endpoints
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: default;kubernetes;https

- job_name: 'kube-dns-dnsmasq'
  kubernetes_sd_configs:
  - role: endpoints
  scheme: http
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: kube-system;kube-dns;metrics-sidecar

- job_name: 'kube-dns-skydns'
  kubernetes_sd_configs:
  - role: endpoints
  scheme: http
  relabel_configs:
  - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
    action: keep
    regex: kube-system;kube-dns;metrics-kubedns

- job_name: 'kubernetes-nodes'
  scheme: https
  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: container_cpu_usage_seconds_total(.*)
    action: drop

- job_name: 'kubernetes-service-endpoints'
  scheme: https
  kubernetes_sd_configs:
  - role: endpoints
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: (.+)(?::\d+);(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_service_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

- job_name: 'kubernetes-services'
  scheme: https
  metrics_path: /probe
  params:
    module: [http_2xx]
  kubernetes_sd_configs:
  - role: service
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
    action: keep
    regex: true
  - source_labels: [__address__]
    target_label: __param_target
  - target_label: __address__
    replacement: blackbox
  - source_labels: [__param_target]
    target_label: instance
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_service_namespace]
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    target_label: kubernetes_name

Please LMK if you want me to open a separate issue, or need more info?

Thank you

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Jun 15, 2017

@ichekrygin can you run go tool pprof -inuse_space -png http://<prometheus>:9090/debug/pprof/heap > out.png when it's at its peak usage and share the resulting image?

@ichekrygin

This comment has been minimized.

Copy link

ichekrygin commented Jun 16, 2017

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Jun 16, 2017

That memory goes to the storage, so not k8s SD related. From which version did you upgrade?
In 1.6 we introduced the flag for the storage's target heap size, which changes the storage's internal memory handling. I think however, that the old flags should still have the same effect as before. @beorn7 might know more.

@ichekrygin

This comment has been minimized.

Copy link

ichekrygin commented Jun 16, 2017

I see, so it could be we just packing too many TS's into it. We upgraded from 1.5.1

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 19, 2017

@ichekrygin You need to check your flag values. The forward compatibility is just educated guess work, i.e. Prometheus tries to guess from your configured value of -storage.local.memory-chunks how much memory you intend to use. The recommended way is to not use -storage.local.memory-chunks at all but directly set -storage.local.target-heap-size. If you don't want to use more than the ~27GiB you used before, I recommend to set the flag to 18 to 19 GiB, e.g. -storage.local.target-heap-size=19000000000.

Further reading: https://prometheus.io/docs/operating/storage/
Further watching: https://www.youtube.com/watch?v=hPC60ldCGm8

If you still have questions, please mail the prometheus-users mailing list as this doesn't seem to have anything to do with the K8s issue this GitHub issue is about.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.