Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes: Memory usage continually increases. Process enters crash recovery loop. #1885

Closed
fluxrad opened this Issue Aug 11, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@fluxrad
Copy link

fluxrad commented Aug 11, 2016

Summary

3 million series appear to exhaust memory on a 32GB m4.2xlarge

I'm attempting to use Prometheus to monitor a 68 node kubernetes cluster, with node-exporter and kubernetes scrape configs. When I scrape container-level metrics from the node role, I notice that the number of series begins to approach ~3 million around 24 hours. At this point, the process is killed by the kernel with an OOM error. Prometheus is running on a 32GB (AWS: m4.2xlarge) machine with no resource limits:

# Prometheus pod spec
    spec: 
      containers: 
      - image: prom/prometheus:latest
        imagePullPolicy: Always
        name: prometheus
        args:
          - -config.file=/prometheus-config/config/prometheus.yml
          - -storage.local.memory-chunks=10000000
          - -storage.local.max-chunks-to-persist=5000000
          - -alertmanager.url=http://alertmanager.kube-system.svc.cluster.local:9093
        volumeMounts:
          - name: prometheus-persistent-storage
            mountPath: /prometheus
          - name: prometheus-config
            mountPath: /prometheus-config

After approximately 24 hours, the kernel OOM kills the prometheus docker process, at which point the docker container enters a crash loop trying to recover metrics.

# Capture from `journalctl`
Aug 11 13:56:37 <redacted> kernel: Out of memory: Kill process 2464 (prometheus) score 1975 or sacrifice child
Aug 11 13:56:37 <redacted> kernel: Killed process 2464 (prometheus) total-vm:32158388kB, anon-rss:32083796kB, file-rss:0kB, shmem-rss:0kB

The only recovery mechanism from here is to clear storage and restart prometheus. Both storage.local.memory-chunks and storage.local.max-chunks-to-persist flags seem to have no impact on how quickly the process runs out of memory, or how much memory prometheus consumes overall.

By my calculations 3 million series should consume somewhere between 6-9 million chunks in memory, or 6-9GB. Assuming overhead for queries and general memory consumption, memory usage should still be well under the 32GB provided by the machine. Is this assumption accurate? If so, where is the rest of the memory going? I've mitigated the issue by dropping all container-level metrics from kubernetes node scrapes.

Prometheus memory-hungry config:

# Prometheus config
rule_files:
  - <redacted>

scrape_configs:
- job_name: 'prometheus'

  # Override the global default and scrape targets from this job every 5 seconds.
  scrape_interval: 5s

  static_configs:
    - targets: ['localhost:9090']

- job_name: 'kubernetes-apiserver'

  scheme: https
  tls_config:
    blah

  kubernetes_sd_configs:
  - api_servers:
    - 'https://kubernetes.default.svc'
    in_cluster: true
    role: apiserver

  relabel_configs:
  - source_labels: [__meta_kubernetes_role]
    action: keep
    regex: apiserver
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_role]
    action: replace
    target_label: kubernetes_name


- job_name: 'kubernetes-nodes'
  # This job seems to scrape the lion's share of series

  scheme: https
  tls_config:
    blah

  kubernetes_sd_configs:
  - api_servers:
    - 'https://kubernetes.default.svc'
    in_cluster: true
    role: node

  relabel_configs:
  - source_labels: [__meta_kubernetes_role]
    action: keep
    regex: node
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__meta_kubernetes_role]
    action: replace
    target_label: kubernetes_name

- job_name: 'kubernetes-service-endpoints'

  kubernetes_sd_configs:
  - api_servers:
    - 'https://kubernetes.default.svc'
    in_cluster: true
    role: endpoint

  relabel_configs:
  - source_labels: [__meta_kubernetes_role, __meta_kubernetes_service_annotation_prometheus_io_scrape]
    action: keep
    regex: endpoint;true
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: (.+)(?::\d+);(\d+)
    replacement: $1:$2
  - action: labelmap
    regex: __meta_kubernetes_service_label_(.+)
  - source_labels: [__meta_kubernetes_role]
    action: replace
    target_label: kubernetes_role
  - source_labels: [__meta_kubernetes_service_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_service_name]
    action: replace
    target_label: kubernetes_name

- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - api_servers:
    - 'https://kubernetes.default.svc'
    in_cluster: true
    role: pod

  relabel_configs:
  - source_labels: [__meta_kubernetes_role, __meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: pod;true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
    action: replace
    target_label: __scheme__
    regex: (https?)
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: (.+)(?::\d+);(\d+)
    replacement: $1:$2
  - source_labels: [__meta_kubernetes_role]
    action: replace
    target_label: kubernetes_role
  - action: labelmap
    regex: __meta_kubernetes_pod_label_(.+)
  - source_labels: [__meta_kubernetes_role]
    action: replace
    target_label: kubernetes_role
  - source_labels: [__meta_kubernetes_pod_namespace]
    action: replace
    target_label: kubernetes_namespace
  - source_labels: [__meta_kubernetes_pod_name]
    action: replace
    target_label: kubernetes_name
  - source_labels: [__meta_kubernetes_pod_address]
    action: replace
    target_label: kubernetes_address
  - source_labels: [__meta_kubernetes_pod_container_name]
    action: replace
    target_label: kubernetes_container_name
  - source_labels: [__meta_kubernetes_pod_container_port_name]
    action: replace
    target_label: kubernetes_container_port_name

Host information

$ cat /etc/os-release
NAME=CoreOS
ID=coreos
VERSION=1010.5.0
VERSION_ID=1010.5.0
BUILD_ID=2016-05-26-2225
PRETTY_NAME="CoreOS 1010.5.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"

We do perform a number of production deploys per day, which I expect will contribute significantly to unique time-series metrics coming from the hosts themselves as containers come and go. Even so, I'd expect to be able to handle a week's worth of metrics.

At this point, I'm uncertain how to tell whether or not we have a configuration issue, or whether or not 3 million series simply won't fit onto a 32GB machine. I'm hoping someone can advise.

  • What's the unique number of series (ballpark) that should reasonably be expected to fit on a 32GB machine? I've read issues #1836 and #455 but the numbers don't seem to fit with the experience above.
  • When does prometheus purge series from memory to live on disk exclusively? Do series data for old containers stay in memory until they're purged by the default retention policy, or are they eventually evicted to disk first?

Thanks very much for your help.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 15, 2016

3 million series is quite a bit and the memory usage corresponds to what I recall from production clusters.

@beorn7 do the calculation in the docs have to be adjusted?

At this scale I doubt that you want to keep all of these millions of time series around for weeks, but rather the meaningful aggregates that you are interested in for a longer time. Especially as you might want to scale up even further, you should probably think about starting a sharded setup.

In the beginning this could just mean running one scraping Prometheus server at a retention of about 12h (or however long you are interested in ALL individuall time series). And letting it calculate a set of recording rules that give you meaningful time series. You can then run a second Prometheus server federating these aggregates and storing them for a longer time. This one will likely end up having at least 100x fewer time series.

If one scraping Prometheus server even gets too large for short-term ingestion and aggregation, you can then scale it out into several actual shards using the hashmod relabeling action.

I hope this gives you a rough idea. Brian also wrote something on it in his blog: http://www.robustperception.io/scaling-and-federating-prometheus/

In the future, we'd of course aim for a storage, that handles memory more gracefully and has less tendency for OOMing.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 15, 2016

Also thanks for the detailed issue. Very much appreciated.

@fabxc fabxc added the kind/question label Aug 15, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 15, 2016

@beorn7 do the calculation in the docs have to be adjusted?

Given the various reports we got, we should consider a more elaborate formula, something like
a * memory chunks + b * memory time series + some baseline .

This needed some studying to get a and b right. It would also help on our way to come up with some automated management of memory consumption eventually, cf. #455 .

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Aug 15, 2016

Something I've thought about for a few times: we could start anonymously
collecting memory consumption data, chunk numbers and series numbers. Even
publishing them verbatim might help people already to get a better idea of
the requirements, and these would help to come up with a better formula.

On Mon, Aug 15, 2016 at 6:47 AM Björn Rabenstein notifications@github.com
wrote:

@beorn7 https://github.com/beorn7 do the calculation in the docs have
to be adjusted?

Given the various reports we got, we should consider a more elaborate
formula, something like
a * memory chunks + b * memory time series + some baseline .

This needed some studying to get a and b right. It would also help on
our way to come up with some automated management of memory consumption
eventually, cf. #455 #455
.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#1885 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAANaCStrzeWrB7mPP-EwZ8TA2QcrTmtks5qgEOjgaJpZM4JiXj_
.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 5, 2016

Closing here as the original cause is well known and tracked and not specific to Kubernetes.

@fabxc fabxc closed this Sep 5, 2016

@fluxrad

This comment has been minimized.

Copy link
Author

fluxrad commented Sep 14, 2016

For what it's worth, I've federated my setup. At around 2MM metrics being scraped with a retention of 12hours, and 20Gi of RAM reserved for the prometheus process, I still experience the process being OOM killed, then killed repeatedly on recovery at around 2.1MM metrics.

I understand I must need more RAM, but I was under the impression (from the documentation) that 3 * the number of series for active chunks, then 3 * the number of chunks for total RAM usage would give me a reasonable first approximation. This doesn't appear to be the case. It is still completely opaque to me how to even ballpark how much RAM is required for 2MM series. 64G? 128G?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 15, 2016

@fluxrad Usually, the most important parameter is net the number of time series, but the number of configured memory chunks (albeit the latter is recommended to be at least 3x of the former).

The number from the docs of 3x the number of 1k chunks is a rule-of-thumb minimum. At SoundCloud, our "save default" is 6x, i.e. on a 64GiB machine, we configure ~10M memory chunks (which in turn is good for up to ~3M time series).

@bretep

This comment has been minimized.

Copy link

bretep commented Apr 17, 2017

@fabxc, where is this well known issue being tracked? Can you link to an issue?

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.