Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Resource Utilisation on Prometheus #4740

Closed
vaibhavkhurana2018 opened this Issue Oct 15, 2018 · 8 comments

Comments

Projects
None yet
3 participants
@vaibhavkhurana2018
Copy link

vaibhavkhurana2018 commented Oct 15, 2018

Proposal

Use case. Why is this important?

The prometheus app is crashing at regular intervals.

Bug Report

What did you do?
I'm running prometheus inside kubernetes cluster, but the prometheus-app keeps on crashing as the utilisation exceeds the node resources.

prometheus deployment config:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "9"
  creationTimestamp: null
  generation: 1
  labels:
    name: prometheus-app
  name: prometheus-app
  selfLink: /apis/extensions/v1beta1/namespaces/monitoring/deployments/prometheus-app
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: prometheus-app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: prometheus-app
        restart: 24th-sept
    spec:
      containers:
      - args:
        - --storage.tsdb.retention=120d
        - --config.file=/etc/prometheus/prometheus.yml
        - --storage.tsdb.path=/data
        image: prom/prometheus:v2.4.2
        imagePullPolicy: IfNotPresent
        name: prometheus-app
        ports:
        - containerPort: 9090
          name: web
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/prometheus
          name: config-volume
        - mountPath: /data
          name: data-volume
      dnsPolicy: Default
      nodeSelector:
        node-role.kubernetes.io/node-ops: ""
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: prometheus-app
      serviceAccountName: prometheus-app
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: prometheus.yml
            path: prometheus.yml
          name: prometheus-app-config
        name: config-volume
      - name: data-volume
        persistentVolumeClaim:
          claimName: prod-prometheus-app-datastore
status: {}

What did you see instead? Under which circumstances?
The app is crashing at regular intervals. From the grafana dashboard, i can see that the memory and CPU utilisation exceeds to the poing that node goes off OOM.

NOTE: There is only prometheus pod running on the node.

Environment

  • System information:
    v1.8.1+coreos.0
    Its an EC2 instance(r4.xlarge), with 4vCPU, and 30.5 GiB RAM.

  • Prometheus version:

    v4.2.2, docker image: prom/prometheus:v4.2.2

  • Prometheus configuration file:

# We point this to the statsd-exporter
global:
  scrape_interval: 30s
  external_labels:
    monitor: 'prod-api'
scrape_configs:
  - job_name: statsd-exporter
    scrape_interval: 30s
    honor_labels: true
    scrape_timeout: 25s
    static_configs:
      - targets: ["statsd-exporter.test.vpc:9102"]
  - job_name: statsd-exporter-gateway
    scrape_interval: 30s
    honor_labels: true
    scrape_timeout: 25s
    static_configs:
      - targets: ["statsd-exporter-gw.test.vpc:9102"]
  - job_name: pod-scraper
    kubernetes_sd_configs:
      - role: pod
    # Prometheus collects metrics from pods with "prometheus.app/scrape: true" label.
    # Prometheus gets 'hello_requests_total{status="500"} 1'
    # from hello:8000/metrics and adds "job" and "instance" labels, so it becomes
    # 'hello_requests_total{instance="10.16.0.10:8000",job="hello",status="500"} 1'.
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_app_scrape]
        regex: true
        action: keep
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: k8s_pod
        action: replace
  - job_name: 'prometheus-self'
    scrape_interval: 15s
    static_configs:
        - targets: ['localhost:9090']

screen shot 2018-10-15 at 5 11 38 pm

screen shot 2018-10-15 at 5 11 20 pm

Kindly suggest any solution to limit the utilization/ If i am doing anything wrong in the setup.

@spirrello

This comment has been minimized.

Copy link

spirrello commented Oct 15, 2018

Are you doing any top n dashboards? I was previously and this was tanking our Prometheus instance causing it to hit the memory limits.

@vaibhavkhurana2018

This comment has been minimized.

Copy link
Author

vaibhavkhurana2018 commented Oct 16, 2018

Thanks @spirrello for the response. i think you are referring to topk, please correct me if i am wrong. And, yes we are using topk in one of our dashboards, will try excluding that and check.

But, still it is used only in 3 queries and should not effect that much in my opinion. Any other thoughts that you may seem effective?

@spirrello

This comment has been minimized.

Copy link

spirrello commented Oct 16, 2018

Yes topk can drive memory utilization heavily and it tapered off once I killed those queries. That's the only thing I've seen drive heavy memory utilization in my environment.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 16, 2018

From your dashboards, your Prometheus holds almost 2M series in the head. A rough estimate is to consider that every series needs 8kB of memory. In your case it would be 16GB. You need to add some room for handling queries too.
If you want to get a better picture of where the memory is spent, check https://www.robustperception.io/analysing-prometheus-memory-usage

As for how to deal with the situation, you would need to get a machine with more RAM, increase your sampling interval or reduce the number of collected metrics.

@vaibhavkhurana2018

This comment has been minimized.

Copy link
Author

vaibhavkhurana2018 commented Oct 17, 2018

Have removed the panels using topk as suggested by @spirrello, but still no luck. The utilization is still high.

@simonpasquier Thanks for the response, will run the profiler and share the results.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 9, 2018

@vaibhavkhurana2018

This comment has been minimized.

Copy link
Author

vaibhavkhurana2018 commented Nov 14, 2018

@simonpasquier Apologies for the late revert here. Didn't get anything specific with the profiler as the app was going down every time the profiler was run.

One observation, have seen CPU spike up whenever a query is performed from the grafana dashboard, which results in increase of memory utilisation and finally the app being unresponsive.

Anyways, have segregated the prom app for multiple applications as a resort for the same. Wanted to know that what is an ideal system spec that is recommended to have for this kind of infrastructure as from my point of view i will again be facing the same issue in the coming future.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 14, 2018

Unfortunately there's no general formula to assess the amount of memory that will be used by Prometheus as it depends on too many factors. When you reach the limits of a single process, the right thing to do is indeed splitting targets across multiple Prometheus servers. Note that some TSDB improvements are in development which should eventually reduce a bit the memory usage (especially when compacting data).

I'm closing the issue for now. Feel free to reopen if it happens again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.