Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM :killed process(prometheus),is there memory leak? #1549

Closed
guanglinlv opened this Issue Apr 12, 2016 · 21 comments

Comments

Projects
None yet
4 participants
@guanglinlv
Copy link

guanglinlv commented Apr 12, 2016

Hi, I have a single prometheus server that scrape about 50+ targets. it'will be OOM running several hours. i'm confused that,

  • dmesg
[3907506.014018] [50093]     0 50093  8556396  8490383   16703        0             0 prometheus
[3907506.014031] Out of memory: Kill process 50093 (prometheus) score 947 or sacrifice child
[3907506.014035] Killed process 50093 (prometheus) total-vm:34225584kB, anon-rss:33961532kB, file-rss:0kB
[3920674.254981] [63061]     0 63061  8506886  8492690   16614        0             0 prometheus
[3920674.260250] Out of memory: Kill process 63061 (prometheus) score 947 or sacrifice child
[3920674.262081] Killed process 63061 (prometheus) total-vm:34027544kB, anon-rss:33970760kB, file-rss:0kB
[3958788.455016] [105674]     0 105674  8547989  8492511   16685        0             0 prometheus
[3958788.460257] Out of memory: Kill process 105674 (prometheus) score 947 or sacrifice child
[3958788.462060] Killed process 105674 (prometheus) total-vm:34191956kB, anon-rss:33970044kB, file-rss:0kB
[3970678.851899] [117374]     0 117374  8505681  8494867   16616        0             0 prometheus
[3970678.855538] Out of memory: Kill process 117374 (prometheus) score 947 or sacrifice child
[3970678.857368] Killed process 117374 (prometheus) total-vm:34022724kB, anon-rss:33979468kB, file-rss:0kB
  • system info
[15:23 root@prometheus-poc:/var/mwc/jobs] # cat /etc/redhat-release 
CentOS Linux release 7.1.1503 (Core) 
[15:23 root@prometheus-poc:/var/mwc/jobs] # uname -a
Linux prometheus-poc 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[15:23 root@prometheus-poc:/var/mwc/jobs] # free -bt
              total        used        free      shared  buff/cache   available
Mem:    35682078720 24212639744   372396032    94019584 11097042944 11069128704
Swap:             0           0           0
Total:  35682078720 24212639744   372396032
  • prometheus version
prometheus, version 0.17.0 (branch: release-0.17, revision: e11fab3)
  build user:       fabianreinartz@macpro
  build date:       20160302-17:48:43
  go version:       1.5.3
  • prometheus startup flags
prometheus -config.file=/var/mwc/jobs/prometheus/conf/prometheus.yml -storage.local.path=/mnt/prom_data -storage.local.memory-chunks=1048576 -log.level=debug -storage.remote.opentsdb-url=http://10.63.121.35:4242 -alertmanager.url=http://10.63.121.65:9093
  • prometheus scrape config
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    scrape_timeout: 10s
    target_groups:
      - targets: ['localhost:9090']
  - job_name: 'node'
    scrape_interval: 5s
    scrape_timeout: 10s
    target_groups:
      - targets: ['localhost:9100']

  - job_name: 'overwritten-default'
    scrape_interval: 5s
    scrape_timeout: 10s
    consul_sd_configs:
      - server: <consul_server>
        datacenter: “consul_dc”

    relabel_configs:
      - source_labels: ['__meta_consul_service_id']
        regex:         '(.*)'
        target_label:  'job'
        replacement:   '$1'
        action:        'replace'
      - source_labels: ['__meta_consul_service_address','__meta_consul_service_port']
        separator:     ';'
        regex:         '(.*);(.*)'
        target_label:  '__address__'
        replacement:   '$1:$2'
        action:        'replace'
      - source_labels: ['__meta_consul_service_id']
        regex:         '^prometheus_.*'
        action:        'keep'
  • prometheus process status
Name:   prometheus
State:  S (sleeping)
Tgid:   130923
Ngid:   0
Pid:    130923
PPid:   1
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 512
Groups: 
VmPeak: 19548872 kB
VmSize: 19548872 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  19486532 kB
VmRSS:  19486532 kB
VmData: 19532964 kB
VmStk:       136 kB
VmExe:      6776 kB
VmLib:         0 kB
VmPTE:     38184 kB
VmSwap:        0 kB
Threads:    19
SigQ:   2/136048
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: fffffffe7fc1feff
CapInh: 0000000000000000
CapPrm: 0000001fffffffff
CapEff: 0000001fffffffff
CapBnd: 0000001fffffffff
Seccomp:    0
Cpus_allowed:   ff
Cpus_allowed_list:  0-7
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    1165275
nonvoluntary_ctxt_switches: 234755
  • graph of process_resident_memory_bytes

process_resident_memory_bytes

  • graph of prometheus_local_storage_memory_chunks

prometheus_local_storage_memory_chunks

thanks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 12, 2016

That Prometheus should only be using ~3GB of RAM, but it looks like it'll top out at ~70GB.

Do you happen to have over 20M timeseries? If so you need a bigger box and to increase -storage.local.memory-chunks.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 12, 2016

@brian-brazil thanks for your quickly replay. ah, what do you mean that 20M timeseries? may i ask you for more description on the relation among of them? BTW, node and cadvisor is all of my exporter running there.

thanks a lot.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 12, 2016

What's the value of prometheus_local_storage_memory_series?

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 12, 2016

  • ^prometheus_local_storage_
[16:10 root@prometheus-poc:~] # curl -s http://10.63.121.65:9090/metrics | grep '^prometheus_local_storage'
prometheus_local_storage_checkpoint_duration_milliseconds 9878.559365
prometheus_local_storage_chunk_ops_total{type="clone"} 3
prometheus_local_storage_chunk_ops_total{type="create"} 433051
prometheus_local_storage_chunk_ops_total{type="drop"} 162003
prometheus_local_storage_chunk_ops_total{type="load"} 1
prometheus_local_storage_chunk_ops_total{type="persist"} 181705
prometheus_local_storage_chunk_ops_total{type="pin"} 60016
prometheus_local_storage_chunk_ops_total{type="transcode"} 619224
prometheus_local_storage_chunk_ops_total{type="unpin"} 60016
prometheus_local_storage_chunkdesc_ops_total{type="evict"} 734963
prometheus_local_storage_chunkdesc_ops_total{type="load"} 707473
prometheus_local_storage_chunks_to_persist 227787
prometheus_local_storage_fingerprint_mappings_total 0
prometheus_local_storage_inconsistencies_total 0
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.5"} 56.475166
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.9"} 194.745354
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.99"} 308.09004
prometheus_local_storage_indexing_batch_duration_milliseconds_sum 499634.47357500094
prometheus_local_storage_indexing_batch_duration_milliseconds_count 4559
prometheus_local_storage_indexing_batch_sizes{quantile="0.5"} 1
prometheus_local_storage_indexing_batch_sizes{quantile="0.9"} 1
prometheus_local_storage_indexing_batch_sizes{quantile="0.99"} 1
prometheus_local_storage_indexing_batch_sizes_sum 494623
prometheus_local_storage_indexing_batch_sizes_count 4559
prometheus_local_storage_indexing_queue_capacity 16384
prometheus_local_storage_indexing_queue_length 0
prometheus_local_storage_ingested_samples_total 4.324093e+07
prometheus_local_storage_invalid_preload_requests_total 0
prometheus_local_storage_maintain_series_duration_milliseconds{location="archived",quantile="0.5"} 5.956355
prometheus_local_storage_maintain_series_duration_milliseconds{location="archived",quantile="0.9"} 15.741466
prometheus_local_storage_maintain_series_duration_milliseconds{location="archived",quantile="0.99"} 57.316122
prometheus_local_storage_maintain_series_duration_milliseconds_sum{location="archived"} 46707.698125999865
prometheus_local_storage_maintain_series_duration_milliseconds_count{location="archived"} 4562
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.5"} 25.256813
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.9"} 101.369317
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.99"} 401.066975
prometheus_local_storage_maintain_series_duration_milliseconds_sum{location="memory"} 1.3564655638540094e+06
prometheus_local_storage_maintain_series_duration_milliseconds_count{location="memory"} 27084
prometheus_local_storage_max_chunks_to_persist 524288
prometheus_local_storage_memory_chunkdescs 2.664816e+06
prometheus_local_storage_memory_chunks 420526
prometheus_local_storage_memory_series 37271
prometheus_local_storage_out_of_order_samples_total 7772
prometheus_local_storage_persist_errors_total 0
prometheus_local_storage_persistence_urgency_score 0.4344348907470703
prometheus_local_storage_rushed_mode 0
prometheus_local_storage_series_ops_total{type="archive"} 6387
prometheus_local_storage_series_ops_total{type="create"} 8551
prometheus_local_storage_series_ops_total{type="maintenance_in_archive"} 4562
prometheus_local_storage_series_ops_total{type="maintenance_in_memory"} 27084
prometheus_local_storage_series_ops_total{type="purge_from_archive"} 4466
prometheus_local_storage_series_ops_total{type="unarchive"} 3
  • up 2h 30m, ~29GB
Process 'prometheus'
  status                            Running
  monitoring status                 Monitored
  pid                               130923
  parent pid                        1
  uid                               0
  effective uid                     0
  gid                               0
  uptime                            2h 37m 
  children                          0
  memory                            28.9 GB
  memory total                      28.9 GB
  memory percent                    86.9%
  memory percent total              86.9%
  cpu percent                       16.6%
  cpu percent total                 16.6%
  data collected                    Tue, 12 Apr 2016 16:10:43
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 12, 2016

37k timeseries, there's something very wrong here. Can you try the latest Prometheus, and get us a pprof memory profile?

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 12, 2016

ok, the latest is https://github.com/prometheus/prometheus/releases/tag/0.18.0rc1 ?and ,how to get the pprof memory profile? i'm a newbie in profiling. thanks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 12, 2016

https://golang.org/pkg/net/http/pprof/ the heap profile is what we want.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 12, 2016

Example:

$ go tool pprof http://demo.robustperception.io:9090/debug/pprof/heap
Fetching profile from http://demo.robustperception.io:9090/debug/pprof/heap
Saved profile in /home/julius/pprof/pprof.demo.robustperception.io:9090.inuse_objects.inuse_space.002.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) svg > heap.svg
Generating report in heap.svg
(pprof) 

Then send the heap.svg.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 13, 2016

@brian-brazil @juliusv sorry for late replay, i got some trouble on putting the pprof file (about 40+kB) because the security of my company proxy.

and any debug information or investigate point you can let me know. until i break the security limitation of proxy. thanks a lot.

up to 16GB in 1 hours, here is the top20

9394.79MB of 9747.25MB total (96.38%)
Dropped 706 nodes (cum <= 48.74MB)
Showing top 20 nodes out of 64 (cum >= 6715.36MB)
      flat  flat%   sum%        cum   cum%
 5745.41MB 58.94% 58.94%  5745.41MB 58.94%  github.com/prometheus/prometheus/vendor/github.com/prometheus/common/model.Metric.Clone
 1849.07MB 18.97% 77.91%  1849.07MB 18.97%  github.com/prometheus/prometheus/vendor/github.com/golang/protobuf/proto.(*Buffer).DecodeStringBytes
  880.19MB  9.03% 86.94%  6625.60MB 67.97%  github.com/prometheus/prometheus/storage/remote.(*Storage).Append
  326.81MB  3.35% 90.30%   326.81MB  3.35%  github.com/prometheus/prometheus/storage/local.newDoubleDeltaEncodedChunk
  266.39MB  2.73% 93.03%   274.52MB  2.82%  github.com/prometheus/prometheus/storage/remote.(*StorageQueueManager).Run
  127.34MB  1.31% 94.34%   399.17MB  4.10%  github.com/prometheus/prometheus/storage/local.(*persistence).loadSeriesMapAndHeads
   82.88MB  0.85% 95.19%    82.88MB  0.85%  github.com/prometheus/prometheus/vendor/github.com/syndtr/goleveldb/leveldb/util.(*BufferPool).Get
   55.02MB  0.56% 95.75%    55.02MB  0.56%  runtime.malg
   45.17MB  0.46% 96.21%   150.27MB  1.54%  github.com/prometheus/prometheus/storage/local.(*memorySeries).add
    5.50MB 0.056% 96.27%   103.10MB  1.06%  github.com/prometheus/prometheus/storage/local.doubleDeltaEncodedChunk.add
    4.01MB 0.041% 96.31%  1869.71MB 19.18%  github.com/prometheus/prometheus/vendor/github.com/matttproud/golang_protobuf_extensions/pbutil.ReadDelimited
       3MB 0.031% 96.34%  1672.07MB 17.15%  github.com/prometheus/prometheus/vendor/github.com/golang/protobuf/proto.(*Buffer).dec_slice_struct
       2MB 0.021% 96.36%    84.88MB  0.87%  github.com/prometheus/prometheus/vendor/github.com/syndtr/goleveldb/leveldb/table.(*Reader).readBlock
    1.50MB 0.015% 96.38%  1850.57MB 18.99%  github.com/prometheus/prometheus/vendor/github.com/golang/protobuf/proto.(*Buffer).dec_string
    0.50MB 0.0051% 96.38%    61.01MB  0.63%  github.com/prometheus/prometheus/retrieval.recordScrapeHealth
         0     0% 96.38%  8705.09MB 89.31%  github.com/prometheus/prometheus/retrieval.(*Target).RunScraper
         0     0% 96.38%  8704.59MB 89.30%  github.com/prometheus/prometheus/retrieval.(*Target).scrape
         0     0% 96.38%    61.01MB  0.63%  github.com/prometheus/prometheus/retrieval.(*Target).scrape.func1
         0     0% 96.38%  6715.36MB 68.89%  github.com/prometheus/prometheus/retrieval.(*ruleLabelsAppender).Append
         0     0% 96.38%  6715.36MB 68.89%  github.com/prometheus/prometheus/retrieval.ruleLabelsAppender.Append
@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 13, 2016

hi, i'm sure what cause the high memory. but i'm not sure it's a problem or not. the cause is remote opentsdb is down. the memory of prometheus process will increase endless until OOM. if i repair the opentsdb and start it again. the memory of prometheus is acceptable. up to 1.9 GB in 4 hours.

@brian-brazil @juliusv should you confirm the abnormal case, that the remote is down forever.

thanks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 13, 2016

That doesn't sound like it, as there's timeouts and other limits on that code path. Can you try without opentsdb configured to be sure?

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 13, 2016

@brian-brazil , yes i'm validating without opentsdb.

BTW, it's the heap.svg that memory is up to 16 GB in 1 hours. you can investigate it.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 13, 2016

The heap graph indicates a leak in the remote storage code. Nothing is jumping out at me from the code.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 25, 2016

@brian-brazil @juliusv , i had did the validation, several point cause the OOM,

  • the default remote timeout is 30 seconds
  • maxConcurrentSends is 10
  • the remote opentsdb is very busying or zombie, so all of the sending will be io timeout

then, the pending samples will grow more and more large until to OOM. is it possible? any suggestion?

Regards.

@brian-brazil brian-brazil removed the question label Apr 25, 2016

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 25, 2016

Is Prometheus producing any log messages about remote storage?

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 25, 2016

@brian-brazil , only io timeout print.

timeout

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 25, 2016

Okay, so we're not dropping samples on the floor in the queue manager. Therefore I suspect the problem is in https://github.com/prometheus/prometheus/blob/master/storage/remote/opentsdb/client.go#L74-L131

@fabxc fabxc added this to the v1.0.0 milestone Apr 25, 2016

@fabxc fabxc added kind/bug and removed bug labels Apr 28, 2016

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 28, 2016

I just got a report from someone who implemented their own writer to remote storage based on the graphite one that they also see memory issues - but also CPU. Did you see CPU issues?

This would hint that the issue isn't just with the opentsdb code.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Apr 29, 2016

as the #issuecomment-208775153 said, the cpu is normal, only up to 16 percent.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented May 20, 2016

#1643 should fix this for you, if not please let us know.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.