Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus memory leak and keeps restarting #5013

Closed
FANLONGFANLONG opened this Issue Dec 19, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@FANLONGFANLONG
Copy link

FANLONGFANLONG commented Dec 19, 2018

Proposal

Use case. Why is this important?

“Nice to have” is not a good use case. :)

Bug Report

What did you do?
Reboot host running Prometheus.
What did you expect to see?
Normal operation.
What did you see instead? Under which circumstances?
memory leak, prometheus keep restarting
image

Environment
kubernetes 1.10

  • System information:

    insert output of uname -srm here

  • Prometheus version:
    prometheus 2.4.3( we also try 2.6.0 rc1 but the problem is same)
    insert output of prometheus --version here

  • Alertmanager version:

    insert output of alertmanager --version here (if relevant to the issue)

  • Prometheus configuration file:

# my global config
global:
  scrape_interval:     30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 30s # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 30s
  external_labels:
    cluster: KC0
   # prometheus: prometheus-svc-2.monitor.svc.bcc-kc0.jd.local
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
       #- 127.0.0.1:9093
       - alertmanager.monitor.svc.bcc-kc0.jd.local:80

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/export/prometheus/rules/*.yaml"


scrape_configs:

- job_name: 'prometheus'
  static_configs:
    - targets: ['prometheus-svc.monitor.svc.bcc-kc0.jd.local:80']
#- job_name: 'kubernetes-apiservers'




- job_name: 'kubernetes-cadvisor'
  scrape_interval:     2m # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  # Default to scraping over https. If required, just disable this or change to
  # `http`.
  scheme: https


  tls_config:
    ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    insecure_skip_verify: true
  bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

  kubernetes_sd_configs:
  - role: node
  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - source_labels: [__address__]
    action: replace
    target_label: __address__
    regex: ([^:;]+):(\d+)
    replacement: ${1}:10255
  - source_labels: [__scheme__]
    action: replace
    target_label: __scheme__
    regex: https
    replacement: http
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /metrics/cadvisor

- job_name:  alert-test2
  static_configs:
    - targets: 
      - '10.196.82.165:18080'
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)

  • Logs:
[debug.tar.gz](https://github.com/prometheus/prometheus/files/2693168/debug.tar.gz)

@FANLONGFANLONG

This comment has been minimized.

Copy link
Author

FANLONGFANLONG commented Dec 19, 2018

debug.tar.gz
this is logs

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Dec 20, 2018

You have too many metrics: your Prometheus is reporting 40M timeseries in the current head. There's no way a single Prometheus instance can cope with this load. You need either to drop some metrics or shard your targets across several Prometheus servers.

@FANLONGFANLONG

This comment has been minimized.

Copy link
Author

FANLONGFANLONG commented Dec 22, 2018

@simonpasquier, thanks . Merry Christmases

May I know how many timeseries by a single Prometheus instance Prometheus recommend officially?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 2, 2019

There's no official recommendation but 10M of series is probably the upper bound. Check this slide deck (p28 in particular).

@bamb00

This comment has been minimized.

Copy link

bamb00 commented Mar 4, 2019

Hi @simonpasquier,

How do I determine the timeseries in the current head? I'm getting a OOMKilled (exit code 137) with resource limits at 2Gi was not enough.

Thanks.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 4, 2019

@bamb00 check the prometheus_tsdb_head_series metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.