Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus takes too much resource (RAM, disk) after 1 day running in a small kubernetes cluster #2141

Closed
ntquyen opened this Issue Nov 1, 2016 · 8 comments

Comments

Projects
None yet
3 participants
@ntquyen
Copy link

ntquyen commented Nov 1, 2016

What did you do?

I'm running prometheus inside kubernetes cluster of ~20 VMs. There are normally ~200 - 250 containers/pods running in the clusters

prometheus deployment config:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      name: prometheus
      labels:
        app: prometheus
    spec:
      nodeSelector:
        type: cluster_monitoring
      containers:
      - name: prometheus
        image: quay.io/coreos/prometheus:v1.1.1
        args:
          - '-storage.local.retention=24h'
          - '-storage.local.memory-chunks=6000000'
          - '-config.file=/etc/prometheus/prometheus.yml'
          - '-storage.local.max-chunks-to-persist=3000000'
          - '-log.level=info'
          - '-storage.local.path="/prometheus/data"'
        resources:
          limits:
            memory: 18000Mi
        ports:
        - name: web
          containerPort: 9090
        volumeMounts:
        - name: prometheus-config-volume
          mountPath: /etc/prometheus
        - name: alert-rules-volume
          mountPath: /etc/prometheus-alert-rules
        - name: prometheus-data-volume
          mountPath: "/prometheus"
      volumes:
      - name: prometheus-config-volume
        configMap:
          name: prometheus
      - name: prometheus-data-volume
        hostPath:
          path: "/data/prometheus"
      - name: alert-rules-volume
        configMap:
          name: prometheus-alert-rules

The scrape_interval is per 60s, and storage.local.retention is 24h, so local storage should be small. No node-exporter running. And in the config file (see below) I tried to ignore most of metrics.

What did you see instead? Under which circumstances?
Prometheus's storage takes 12GB after 1 day, which is huge, every queries are very slow and sometimes the container got OOM. As recovering from OOM, the logs said 2141026 series loaded.

Checking the points by running topk(100, count by (__name__, job)({__name__=~".+"})), the largest metric is 120k series and there is no way they can all add up to 2m series:

container_cpu_usage_seconds_total{job="kubernetes-nodes-cadvisor"}	122268
container_memory_failures_total{job="kubernetes-nodes-cadvisor"}	10568
container_memory_rss{job="kubernetes-nodes-cadvisor"}	2642
container_start_time_seconds{job="kubernetes-nodes-cadvisor"}	2642
container_cpu_user_seconds_total{job="kubernetes-nodes-cadvisor"}	2642
container_memory_cache{job="kubernetes-nodes-cadvisor"}	2642
container_memory_failcnt{job="kubernetes-nodes-cadvisor"}	2642
container_cpu_system_seconds_total{job="kubernetes-nodes-cadvisor"}	2642
container_memory_working_set_bytes{job="kubernetes-nodes-cadvisor"}	2642
....

Is there something wrong in my configuration?

Environment

  • System information:

      Linux 4.7.0-coreos x86_64
    
  • Prometheus version:

    v1.1.1

  • Prometheus configuration file:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus
  namespace: kube-system
data:
  prometheus.yml: |-
    global:
      scrape_interval: 60s
    rule_files:
      - '/etc/prometheus-alert-rules/alert.rules'
    scrape_configs:
    # etcd is living outside of our cluster and we configure
    # it directly.
    - job_name: 'etcd'
      static_configs:
      - targets:
          - etcd-server-1:2379
          - etcd-server-2:2379
          - etcd-server-3:2379
      metric_relabel_configs:
        - source_labels: [__name__]
          action: drop
          regex: go_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_storage_db_compaction_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_snapshot_save(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_storage_db_compaction(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_storage_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_store_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_wal_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_helper_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_rafthttp_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_request_(.*)
        - source_labels: [__name__]
          action: drop
          regex: etcd_server_(.*)
          
    - job_name: 'kubernetes-apiserver-cadvisor'
      kubernetes_sd_configs:
        - api_servers:
            - 'http://k8s-apiserver-1:8080'
            - 'http://k8s-apiserver-2:8080'
          in_cluster: true
          role: apiserver
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - source_labels: [__meta_kubernetes_role]
          action: replace
          target_label: kubernetes_role
      metric_relabel_configs:
        - source_labels: [__name__]
          action: drop
          regex: go_(.*)

    - job_name: 'kubernetes-nodes-cadvisor'
      kubernetes_sd_configs:
        - api_servers:
            - 'http://k8s-apiserver-1:8080'
            - 'http://k8s-apiserver-2:8080'
          in_cluster: true
          role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - source_labels: [__meta_kubernetes_role]
        action: replace
        target_label: kubernetes_role
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:10255'
        target_label: __address__
        
      metric_relabel_configs:
        ## Drop metrics
        - source_labels: [__name__]
          action: drop
          regex: go_(.*)
        - source_labels: [__name__]
          action: drop
          regex: kubelet_pleg_(.*)
        - source_labels: [__name__]
          action: drop
          regex: kubelet_(.*)
        - source_labels: [__name__]
          action: drop
          regex: container_spec_(.*)
        - source_labels: [__name__]
          action: drop
          regex: get_(.*)
        - source_labels: [__name__]
          action: drop
          regex: kubernetes_build_info
        - source_labels: [__name__]
          action: drop
          regex: container_fs_io_(.*)
        - source_labels: [__name__]
          action: drop
          regex: container_fs_(.*)_merged_total
        - source_labels: [__name__]
          action: drop
          regex: container_fs_sector_(.*)
        - source_labels: [__name__]
          action: drop
          regex: container_tasks_state
        - source_labels: [__name__]
          action: drop
          regex: container_last_seen
        - source_labels: [__name__]
          action: drop
          regex: container_fs_(.*)_seconds_total
          
        ## Remove labels
        - target_label: io_kubernetes_container_hash
          replacement: ''
        - target_label: io_kubernetes_container_name
          replacement: ''
        - target_label: io_kubernetes_container_restartCount
          replacement: ''
        - target_label: io_kubernetes_pod_name
          replacement: ''
        - target_label: io_kubernetes_pod_namespace
          replacement: ''
        - target_label: io_kubernetes_pod_terminationGracePeriod
          replacement: ''
        - target_label: io_kubernetes_pod_uid
          replacement: ''
        - target_label: io_kubernetes_container_terminationMessagePath
          replacement: ''          
          
    - job_name: 'kubernetes-services'
      kubernetes_sd_configs:
      - api_servers:
        - 'http://k8s-apiserver-1:8080'
        - 'http://k8s-apiserver-2:8080'
        in_cluster: true
        role: service
      relabel_configs:
      # We only monitor endpoints of services that were annotated with
      # prometheus.io/scrape=true in Kubernetes
      - source_labels: [__meta_kubernetes_role, __meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: endpoint;true
      # Rewrite the Kubernetes service name into the Prometheus job label.
      - source_labels: [__meta_kubernetes_service_name]
        target_label: job
      # Attach the namespace as a label to the monitoring targets.
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      # Attach all service labels to the monitoring targets.
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      metric_relabel_configs:
        - source_labels: [__name__]
          action: drop
          regex: go_(.*)
@ntquyen

This comment has been minimized.

Copy link
Author

ntquyen commented Nov 2, 2016

Update: Up to now, there are 2700000 series. It seems that storage.local.retention=24h doesn't work.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Nov 2, 2016

Can you make sure that you don't have labels-value pairs that have a variable content? Adding for example a uuid for a request as a label can make the number of time-series explode as every new metric-name/label-value combination creates a new time-series.

@ntquyen

This comment has been minimized.

Copy link
Author

ntquyen commented Nov 2, 2016

@brancz Thanks for your response. I don't think I have that kind of metric. Every metrics is scaped from etcd and kubernetes.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 2, 2016

@ntquyen The observed behavior might be completely normal. Note the following:

  • You have configured 6000000 memory chunks. 18GiB RAM for that is really the lower bound for that. An expensive query, many time series, or just memory hungry service discovery (K8s SD seems to be quite memory hungry, at least pre 1.3) will easily blow up your RAM limitation. I'd go with twice as much RAM for 6M chunks, or with 3M chunks for 18GiB RAM to avoid the OOM.
  • The topk(100, count by (__name__, job)({__name__=~".+"})) query only counts metrics that have received samples in the last 5 minutes. You can have way more series in RAM, and ever more archived on disk. Whenever you do something like a re-deploy of your service, labels change and create new time series. The old ones don't receive samples anymore and will eventually be archived (and finally deleted). So you might have (legitimately) millions of time series, even if only ~100k are receiving fresh samples. (The case @brancz referred to is when you accidentally create more time series than intended.)
  • If you have many time series, Prometheus needs several hours or even a day to cycle through all time series for retention cut-off. Setting a retention time of less than 24 hours has diminishing returns. Also take into account that with ~2M series, even only one chunk per series will already take 2GiB on disk, so 12GiB disk space is really not a lot in your case.
@ntquyen

This comment has been minimized.

Copy link
Author

ntquyen commented Nov 3, 2016

@beorn7 Thanks for your explanation, it's much more clear for me now!

In our k8s cluster, services got re-deployed at almost every time (we have our own pod auto-scheduler: pods are up when new messages comming in and stopped when done processing.) We can't help but produce a lot of new time series, even when I drop haft of the exposed metrics.

I reduced the -storage.local.memory-chunks to 1M and -storage.local.max-chunks-to-persist to 500K, but after 14h, prometheus reports as below:

# HELP prometheus_local_storage_memory_chunkdescs The current number of chunk descriptors in memory.
# TYPE prometheus_local_storage_memory_chunkdescs gauge
prometheus_local_storage_memory_chunkdescs 2.451975e+06
# HELP prometheus_local_storage_memory_chunks The current number of chunks in memory, excluding cloned chunks (i.e. chunks without a descriptor).
# TYPE prometheus_local_storage_memory_chunks gauge
prometheus_local_storage_memory_chunks 1.154092e+06
# HELP prometheus_local_storage_memory_series The current number of series in memory.
# TYPE prometheus_local_storage_memory_series gauge
prometheus_local_storage_memory_series 2.033516e+06

RAM usage is now 16Gb. It looks like prometheus is trying to store every series in memory. What I expect for -storage.local.memory-chunks is that this flag is the maximum memory size to store active series. I know, total memory usage can exceed this value because there are other things to handle. But the active series in memory should not exceed this value, right? If not, how can I limit the memory when dealing with millions of time series?

@ntquyen

This comment has been minimized.

Copy link
Author

ntquyen commented Nov 3, 2016

Just found this issue #455 which explains moreo memory usage , we may switch to this one then.

@ntquyen ntquyen closed this Nov 3, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 3, 2016

You need a lot of RAM to deal with millions of time series. There is no way around that.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.