Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory: Kill process 24355 (prometheus) score 945 or sacrifice child #2525

Closed
robsonpeixoto opened this Issue Mar 25, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@robsonpeixoto
Copy link
Contributor

robsonpeixoto commented Mar 25, 2017

I have a single Prometheus server that scrape 366 targets. It be in OOM and never recover from crashrecovery.

I tried to get run the pprof tool but the process die before open the port 9090.

Environment

  • Dmesg

  • Command

     /opt/prometheus-1.5.2/prometheus \
       -config.file=/etc/prometheus/prometheus.yml\
       -storage.local.path=/var/lib/prometheus\
       -web.console.libraries=/opt/prometheus-1.5.2/console_libraries\
       -web.console.templates=/opt/prometheus-1.5.2/consoles\
       -storage.local.retention=168h\
       -storage.local.max-chunks-to-persist=2097152\
       -storage.local.memory-chunks=3145728\
       -storage.local.series-sync-strategy=never\
       -log.level=debug\
       -alertmanager.url=http://monitoring-1.mich2.prod.juscloud.com:9093
    
  • System information:

     # free -m
                  total       used       free     shared    buffers     cached
     Mem:         29901      29122        779          0       4180       2991
     -/+ buffers/cache:      21950       7951
     Swap:         1906        123       1783
    
     uname -srm
     Linux 3.13.0-110-generic x86_64
    

    More detail

  • Prometheus version:

    Using the version 1.5.2 + patch cc3e859

    built with the command:

     export GOOS=linux GOARCH=amd64
     make build
    
     prometheus, version 1.5.2 (branch: HEAD, revision: 278328b7b2ceff6b082ff0182dd3098144f3e4e6)
       build user:       robinho@robinho-notebook.local
       build date:       20170320-13:36:50
       go version:       go1.8
    
  • Prometheus configuration file:

     global:
       scrape_interval:     15s
       evaluation_interval: 15s
    
     rule_files:
       - "/etc/prometheus/rules/alerting/*.rules"
       - "/etc/prometheus/rules/recording/*.rules"
    
     scrape_configs:
       - job_name: 'prometheus'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['prometheus']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
    
       - job_name: 'pushgateway'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['pushgateway']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
    
       - job_name: 'alertmanager'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['alertmanager']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
    
       - job_name: 'node'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['node_exporter']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
           - source_labels: ['__meta_consul_node']
             regex: '([^.]+?)(-?\d+)[.].*'
             replacement: '${1}'
             target_label: 'cluster'
    
       - job_name: 'mesos-master'
         scrape_interval: 15s
         scrape_timeout: 5s
         static_configs:
           - targets: ['marathon-lb:10114']
    
       - job_name: 'mesos-agent'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['mesos_agent_exporter']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
         metric_relabel_configs:
           - source_labels: ['source']
             regex: '([^.]+)[.]([^.]+)'
             replacement: '${1}'
             target_label: 'app_name'
           - source_labels: ['source']
             regex: '([^.]+)[.]([^.]+)'
             replacement: '${2}'
             target_label: 'app_instance'
    
       - job_name: 'mesos'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['cadvisor-mesos']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
         metric_relabel_configs:
           - source_labels: ['container_env_mesos_task_id']
             regex: '([^.]+)[.]([^.]+)'
             replacement: '${1}'
             target_label: 'app_name'
           - source_labels: ['container_env_mesos_task_id']
             regex: '([^.]+)[.]([^.]+)'
             replacement: '${2}'
             target_label: 'app_instance'
    
       - job_name: 'marathon-lb'
         scrape_interval: 1s
         consul_sd_configs:
           - server: localhost:8500
             services: ['marathon-lb-exporter']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
         metric_relabel_configs:
           - source_labels: ['backend']
             regex: '(.*)_(\d+)'
             replacement: '${1}'
             target_label: 'app_name'
           - source_labels: ['backend']
             regex: '(.*)_(\d+)'
             replacement: '${2}'
             target_label: 'app_port'
           - source_labels: ['frontend']
             regex: '(.*)_(\d+)'
             replacement: '${1}'
             target_label: 'app_name'
           - source_labels: ['frontend']
             regex: '(.*)_(\d+)'
             replacement: '${2}'
             target_label: 'app_port'
    
       - job_name: 'tsuru'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['cadvisor-tsuru']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
         metric_relabel_configs:
           - source_labels: ['container_label_tsuru_app_name']
             target_label: 'app_name'
           - source_labels: ['container_label_tsuru_process_name']
             target_label: 'app_process'
           - source_labels: ['id']
             regex: '/docker/(.{12}).*'
             replacement: '${1}'
             target_label: 'app_instance'
    
       - job_name: 'hbase-regionserver'
         scrape_interval: 5m
         scrape_timeout: 5m
         consul_sd_configs:
           - server: localhost:8500
             services: ['hbase-regionserver-metrics']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
           - source_labels: ['__meta_consul_node']
             regex: '([^.]+?)(-?\d+)[.].*'
             replacement: '${1}'
             target_label: 'cluster'
    
       - job_name: 'elasticsearch'
         scrape_interval: 15s
         scrape_timeout: 5s
         static_configs:
           - targets: ['marathon-lb.service.jusbrasil:10045']
    
       - job_name: 'rabbitmq'
         scrape_interval: 15s
         scrape_timeout: 5s
         static_configs:
           - targets: ['marathon-lb.service.jusbrasil:10070']
    
       - job_name: 'mandioca-rabbitmq'
         scrape_interval: 15s
         scrape_timeout: 5s
         static_configs:
           - targets: ['marathon-lb.service.jusbrasil:10164']
    
       - job_name: 'kissmetrics-consumer'
         scrape_interval: 30s
         scrape_timeout: 5s
         static_configs:
           - targets: ['marathon-lb.service.jusbrasil:10056']
    
       - job_name: 'nginx'
         scrape_interval: 15s
         consul_sd_configs:
           - server: localhost:8500
             services: ['nginx-metrics']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
    
       - job_name: 'nginx-vts'
         scrape_interval: 5s
         consul_sd_configs:
           - server: localhost:8500
             services: ['nginx-vts-exporter']
         relabel_configs:
           - source_labels: ['__meta_consul_node', '__meta_consul_service_port']
             separator: ':'
             target_label: 'instance'
           - source_labels: ['__meta_consul_node']
             target_label: 'node'
           - source_labels: ['__meta_consul_address']
             target_label: 'address'
    
       - job_name: 'marathon'
         scrape_interval: 15s
         marathon_sd_configs:
           - servers:
               - 'http://marathon:8080'
         relabel_configs:
           - source_labels: ['__meta_marathon_app']
             target_label: 'marathon_app'
           - source_labels: ['__address__']
             regex: '([^:]+?)[:](.*)'
             replacement: '${1}'
             target_label: 'node'
           - source_labels: ['__address__']
             regex: '([^:]+?)[:](.*)'
             replacement: '${2}'
             target_label: 'app_port'
           - source_labels: ['__meta_marathon_task']
             regex: '([^.]+?)[.].*'
             replacement: '${1}'
             target_label: 'app_name'
           - source_labels: ['__meta_marathon_app_label_PROMETHEUS']
             regex: '1'
             action: 'keep'
           - source_labels: ['__meta_marathon_port_definition_label_PROMETHEUS_SKIP']
             regex: '1'
             action: 'drop'
           - source_labels: ['__meta_marathon_port_mapping_label_PROMETHEUS_SKIP']
             regex: '1'
             action: 'drop'
    
  • Logs

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 26, 2017

Your server has a lot of metrics. The logs are truncated, so I cannot see how many series you have currently active in memory, but 12630000 archived metrics is a lot. They all have to be indexed, and you OOM half way through that. That's mostly a LevelDB problem, in my experience. If you have to index a lot in a short amount of time, it takes a ginormous amount of RAM.

You can try to GC more aggressively by setting the GOGC environment variable before starting Prometheus, something like

export GOGC=30

But even if you make it through crash recovery, your server might need way more RAM to cope with your number of time series. (Earlier in the logs, you can see how many in-memory time series you have. With 32GiB of RAM, I would not go beyond 2–3 million time series in memory. And even then, you need to tweak your flags. With your current flags, you should not have more than 1M time series for smooth operation.)

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 26, 2017

I'm closing this as it doesn't appear to be a bug but the expected behavior. Should you need more support, please ask on the prometheus-users mailinglist, where more people are available to help and other users can benefit from the answers.

@beorn7 beorn7 closed this Mar 26, 2017

@robsonpeixoto

This comment has been minimized.

Copy link
Contributor Author

robsonpeixoto commented Mar 26, 2017

Thanks @beorn7. I'll study a better way to reduce the number of time series.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.