Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus 2.3.0 become OOM when consul is unavailable #4253

Closed
Wing924 opened this Issue Jun 12, 2018 · 18 comments

Comments

Projects
None yet
4 participants
@Wing924
Copy link

Wing924 commented Jun 12, 2018

Bug Report

What did you do?
I use consul SD in prometheus.
I upgraded consul servers but failed to failover and cause consul cluster died for a few minutes.

What did you expect to see?
Prometheus running as usual.

What did you see instead? Under which circumstances?
Prometheus become out of memory.

Environment

  • System information:

    Linux 3.10.0-693.2.2.el7.x86_64 x86_64

  • Prometheus version:

    prometheus, version 2.3.0 (branch: HEAD, revision: 290d717)
    build user: root@d539e167976a
    build date: 20180607-08:46:54
    go version: go1.10.2

  • Prometheus configuration file:

    global:
      scrape_interval:     15s
      evaluation_interval: 15s

    scrape_configs:
    - job_name: 'netdata'
      metrics_path: '/api/v1/allmetrics?format=prometheus'
      relabel_configs:
      - source_labels: ['__address__']
        modulus: 4
        target_label: __tmp_hash
        action: hashmod
      - source_labels: ['__tmp_hash']
        regex: '^0$'
        action: keep
      - source_labels: ['env']
        regex: '^(dev|stg)$'
        action: keep
      - source_labels: ['__address__']
        regex: '^([^:]+):\d+'
        target_label: 'fqdn'
        replacement: '$1'
      - source_labels: ['fqdn']
        regex: '^([^\.]+)\..+$'
        target_label: 'hostname'
        replacement: '$1'
      - source_labels: ['hostname']
        regex: '^([^.]+\d)(\d\d)(z|zd|c)?'
        target_label: 'host_group'
        replacement: '$1'
      file_sd_configs:
      - files:
        - /etc/prometheus/conf.d/netdata.yml

    - job_name: 'prometheus_exporter'
      relabel_configs:
      - source_labels: ['__meta_consul_address']
        modulus: 4
        target_label: __tmp_hash
        action: hashmod
      - source_labels: ['__tmp_hash']
        regex: '^0$'
        action: keep
      - source_labels: [__meta_consul_tags]
        regex: '.*,type-([^,]+),.*'
        replacement: '$1'
        target_label: job
      - source_labels: [__meta_consul_node]
        target_label: fqdn
      - source_labels: [fqdn]
        regex: '^([^\.]+)\..+$'
        replacement: '$1'
        target_label: 'hostname'
      - source_labels: [__meta_consul_metadata_env]
        target_label: 'env'
      - source_labels: [__meta_consul_metadata_host_group]
        target_label: 'host_group'
      consul_sd_configs:
      - server: localhost:8500
        services:
        - prometheus_exporter
  • Logs:
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.437914536Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.0, branch=HEAD, revision=290d71791a507a5057b9a099c9d48703d86dc941)"
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.437996743Z caller=main.go:223 build_context="(go=go1.10.2, user=root@d539e167976a, date=20180607-08:46:54)"
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.438016309Z caller=main.go:224 host_details="(Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64  (none))"
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.438031205Z caller=main.go:225 fd_limits="(soft=1006500, hard=1006500)"
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.439396398Z caller=main.go:514 msg="Starting TSDB ..."
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.439664658Z caller=web.go:426 component=web msg="Start listening for connections" address=0.0.0.0:9090
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.443322109Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528768800000 maxt=1528776000000 ulid=01CFS5QNSYDD4V91EHVGXHHPC2
Jun 12 16:09:41  prometheus[12720]: level=info ts=2018-06-12T07:09:41.444473845Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528776000000 maxt=1528783200000 ulid=01CFSCKD20V0X9Z619CTJG0HC1
Jun 12 16:09:55  prometheus[12720]: level=info ts=2018-06-12T07:09:55.243619665Z caller=main.go:524 msg="TSDB started"
Jun 12 16:09:55  prometheus[12720]: level=info ts=2018-06-12T07:09:55.243684491Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Jun 12 16:09:55  prometheus[12720]: level=info ts=2018-06-12T07:09:55.244611863Z caller=main.go:500 msg="Server is ready to receive web requests."
Jun 12 16:09:55  prometheus[12720]: level=error ts=2018-06-12T07:09:55.246268501Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:10:10  prometheus[12720]: level=error ts=2018-06-12T07:10:10.246998917Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:10:25  prometheus[12720]: level=error ts=2018-06-12T07:10:25.24927276Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:10:40  prometheus[12720]: level=error ts=2018-06-12T07:10:40.250021419Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:10:55  prometheus[12720]: level=error ts=2018-06-12T07:10:55.251577846Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:11:10  prometheus[12720]: level=error ts=2018-06-12T07:11:10.252305118Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:11:25  prometheus[12720]: level=error ts=2018-06-12T07:11:25.253499684Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:11:40  prometheus[12720]: level=error ts=2018-06-12T07:11:40.254119744Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:11:55  prometheus[12720]: level=error ts=2018-06-12T07:11:55.255685656Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:12:10  prometheus[12720]: level=error ts=2018-06-12T07:12:10.257246775Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:12:25  prometheus[12720]: level=error ts=2018-06-12T07:12:25.258391992Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:12:40  prometheus[12720]: level=error ts=2018-06-12T07:12:40.259062034Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:12:55  prometheus[12720]: level=error ts=2018-06-12T07:12:55.261606034Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:13:10  prometheus[12720]: level=error ts=2018-06-12T07:13:10.262981208Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:13:25  prometheus[12720]: level=error ts=2018-06-12T07:13:25.264128827Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:13:40  prometheus[12720]: level=error ts=2018-06-12T07:13:40.264831005Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:13:55  prometheus[12720]: level=error ts=2018-06-12T07:13:55.2664757Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:14:10  prometheus[12720]: level=error ts=2018-06-12T07:14:10.268095616Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:14:25  prometheus[12720]: level=error ts=2018-06-12T07:14:25.269409927Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:14:40  prometheus[12720]: level=error ts=2018-06-12T07:14:40.270111576Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:14:55  prometheus[12720]: level=error ts=2018-06-12T07:14:55.27160297Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:15:10  prometheus[12720]: level=error ts=2018-06-12T07:15:10.272358567Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:15:25  prometheus[12720]: level=error ts=2018-06-12T07:15:25.273978302Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:15:40  prometheus[12720]: level=error ts=2018-06-12T07:15:40.274724767Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:15:55  prometheus[12720]: level=error ts=2018-06-12T07:15:55.277669968Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:16:10  prometheus[12720]: level=error ts=2018-06-12T07:16:10.278402869Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:16:25  prometheus[12720]: level=error ts=2018-06-12T07:16:25.280993084Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:16:40  prometheus[12720]: level=error ts=2018-06-12T07:16:40.281668811Z caller=consul.go:258 component="discovery manager scrape" discovery=consul msg="Error retrieving datacenter name" err="Unexpected response code: 403 (Permission denied)"
Jun 12 16:23:57  prometheus[12720]: fatal error: runtime: out of memory
Jun 12 16:23:57  prometheus[12720]: runtime stack:
Jun 12 16:23:57  prometheus[12720]: runtime.throw(0x1c859d7, 0x16)
Jun 12 16:23:57  prometheus[12720]: /usr/local/go/src/runtime/panic.go:616 +0x81
Jun 12 16:23:57  prometheus[12720]: runtime.sysMap(0xc6aef40000, 0x200000000, 0x0, 0x2b205f8)
Jun 12 16:23:57  prometheus[12720]: /usr/local/go/src/runtime/mem_linux.go:216 +0x20a
Jun 12 16:23:57  prometheus[12720]: runtime.(*mheap).sysAlloc(0x2b06ec0, 0x200000000, 0x470c90)
Jun 12 16:23:57  prometheus[12720]: /usr/local/go/src/runtime/malloc.go:470 +0xd4
Jun 12 16:23:57  prometheus[12720]: runtime.(*mheap).grow(0x2b06ec0, 0x100000, 0x0)
Jun 12 16:23:57  prometheus[12720]: /usr/local/go/src/runtime/mheap.go:907 +0x60
Jun 12 16:23:57  prometheus[12720]: runtime.(*mheap).allocSpanLocked(0x2b06ec0, 0x100000, 0x2b20608, 0xc41d6a48ad)
Jun 12 16:23:57  prometheus[12720]: /usr/local/go/src/runtime/mheap.go:820 +0x301
Jun 12 16:23:57  prometheus[12720]: runtime.(*mheap).alloc_m(0x2b06ec0, 0x100000, 0x101, 0x70)
Jun 12 16:23:57  prometheus[12720]: /usr/local/go/src/runtime/mheap.go:686 +0x118
Jun 12 16:23:57  prometheus[12720]: runtime.(*mheap).alloc.func1()
Jun 12 16:23:57  prometheus[12720]: /usr/local/go/src/runtime/mheap.go:753 +0x4d
Jun 12 16:23:57  prometheus[12720]: runtime.(*mheap).alloc(0x2b06ec0, 0x100000, 0xc420010101, 0x4140bc)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/mheap.go:752 +0x8a
Jun 12 16:23:58  prometheus[12720]: runtime.largeAlloc(0x200000000, 0x101, 0xc420054ec0)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/malloc.go:826 +0x94
Jun 12 16:23:58  prometheus[12720]: runtime.mallocgc.func1()
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/malloc.go:721 +0x46
Jun 12 16:23:58  prometheus[12720]: runtime.systemstack(0x0)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/asm_amd64.s:409 +0x79
Jun 12 16:23:58  prometheus[12720]: runtime.mstart()
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/proc.go:1175
Jun 12 16:23:58  prometheus[12720]: goroutine 112619 [running]:
Jun 12 16:23:58  prometheus[12720]: runtime.systemstack_switch()
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/asm_amd64.s:363 fp=0xc4ad17b0e0 sp=0xc4ad17b0d8 pc=0x457300
Jun 12 16:23:58  prometheus[12720]: runtime.mallocgc(0x200000000, 0x1a15ee0, 0x152a101, 0xc4ad17b1c0)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/malloc.go:720 +0x8a2 fp=0xc4ad17b180 sp=0xc4ad17b0e0 pc=0x410692
Jun 12 16:23:58  prometheus[12720]: runtime.makeslice(0x1a15ee0, 0x20000000, 0x20000000, 0x1565384, 0xc49296e6c0, 0x163f2deaa58)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/slice.go:61 +0x77 fp=0xc4ad17b1b0 sp=0xc4ad17b180 pc=0x441497
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/storage.(*sampleRing).add(0xc492968870, 0x163f2deaa58, 0x7ff0000000000002)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/storage/buffer.go:186 +0x10b fp=0xc4ad17b220 sp=0xc4ad17b1b0 pc=0x152696b
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/storage.(*BufferedSeriesIterator).Next(0xc492945d10, 0x163f2db0001)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/storage/buffer.go:96 +0x51 fp=0xc4ad17b248 sp=0xc4ad17b220 pc=0x1526641
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/storage.(*BufferedSeriesIterator).Seek(0xc492945d10, 0x163f2def39e, 0x493e0)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/storage/buffer.go:80 +0x53 fp=0xc4ad17b270 sp=0xc4ad17b248 pc=0x1526523
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/web.(*Handler).federation(0xc420229500, 0x1de1240, 0xc48200ab20, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/web/federate.go:97 +0xb6b fp=0xc4ad17b758 sp=0xc4ad17b270 pc=0x16ffe8b
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/web.(*Handler).(github.com/prometheus/prometheus/web.federation)-fm(0x1de1240, 0xc48200ab20, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/web/web.go:272 +0x48 fp=0xc4ad17b788 sp=0xc4ad17b758 pc=0x170af18
Jun 12 16:23:58  prometheus[12720]: net/http.HandlerFunc.ServeHTTP(0xc42034b560, 0x1de1240, 0xc48200ab20, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:1947 +0x44 fp=0xc4ad17b7b0 sp=0xc4ad17b788 pc=0x68c674
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP(0x1dd6c00, 0xc42034b560, 0x7f5c1a56f0a0, 0xc4459db720, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/util/httputil/compression.go:90 +0x7c fp=0xc4ad17b7e8 sp=0xc4ad17b7b0 pc=0x169e39c
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/util/httputil.(CompressionHandler).ServeHTTP-fm(0x7f5c1a56f0a0, 0xc4459db720, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/web/web.go:273 +0x57 fp=0xc4ad17b820 sp=0xc4ad17b7e8 pc=0x16e2fc7
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/web.(*Handler).testReady.func1(0x7f5c1a56f0a0, 0xc4459db720, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/web/web.go:401 +0x55 fp=0xc4ad17b880 sp=0xc4ad17b820 pc=0x17097a5
Jun 12 16:23:58  prometheus[12720]: net/http.HandlerFunc.ServeHTTP(0xc42052d600, 0x7f5c1a56f0a0, 0xc4459db720, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:1947 +0x44 fp=0xc4ad17b8a8 sp=0xc4ad17b880 pc=0x68c674
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1(0x1de3480, 0xc48200ab00, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:196 +0xed fp=0xc4ad17b930 sp=0xc4ad17b8a8 pc=0x1689c3d
Jun 12 16:23:58  prometheus[12720]: net/http.HandlerFunc.ServeHTTP(0xc4204c34a0, 0x1de3480, 0xc48200ab00, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:1947 +0x44 fp=0xc4ad17b958 sp=0xc4ad17b930 pc=0x68c674
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerDuration.func2(0x1de3480, 0xc48200ab00, 0xc4459f1100)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/client_golang/prometheus/promhttp/instrument_server.go:76 +0xb5 fp=0xc4ad17b9e0 sp=0xc4ad17b958 pc=0x16897e5
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route.(*Router).handle.func1(0x1de3480, 0xc48200ab00, 0xc4459f1000, 0x0, 0x0, 0x0)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route/route.go:60 +0x222 fp=0xc4ad17ba98 sp=0xc4ad17b9e0 pc=0x169ce02
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/vendor/github.com/julienschmidt/httprouter.(*Router).ServeHTTP(0xc4203824c0, 0x1de3480, 0xc48200ab00, 0xc4459f1000)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/vendor/github.com/julienschmidt/httprouter/router.go:299 +0x6d1 fp=0xc4ad17bb78 sp=0xc4ad17ba98 pc=0x1698de1
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route.(*Router).ServeHTTP(0xc42052c640, 0x1de3480, 0xc48200ab00, 0xc4459f1000)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/route/route.go:98 +0x4c fp=0xc4ad17bba8 sp=0xc4ad17bb78 pc=0x169caac
Jun 12 16:23:58  prometheus[12720]: net/http.(*ServeMux).ServeHTTP(0xc42097e330, 0x1de3480, 0xc48200ab00, 0xc4459f1000)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:2337 +0x130 fp=0xc4ad17bbe8 sp=0xc4ad17bba8 pc=0x68e3e0
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/web.secureHeadersMiddleware.func1(0x1de3480, 0xc48200ab00, 0xc4459f1000)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/web/web.go:81 +0x195 fp=0xc4ad17bc30 sp=0xc4ad17bbe8 pc=0x17090e5
Jun 12 16:23:58  prometheus[12720]: net/http.HandlerFunc.ServeHTTP(0xc4204d5f20, 0x1de3480, 0xc48200ab00, 0xc4459f1000)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:1947 +0x44 fp=0xc4ad17bc58 sp=0xc4ad17bc30 pc=0x68c674
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/vendor/github.com/opentracing-contrib/go-stdlib/nethttp.Middleware.func2(0x1dead80, 0xc4203f3dc0, 0xc4459f0f00)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:74 +0x3ab fp=0xc4ad17bd30 sp=0xc4ad17bc58 pc=0x168217b
Jun 12 16:23:58  prometheus[12720]: net/http.HandlerFunc.ServeHTTP(0xc4204c24e0, 0x1dead80, 0xc4203f3dc0, 0xc4459f0f00)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:1947 +0x44 fp=0xc4ad17bd58 sp=0xc4ad17bd30 pc=0x68c674
Jun 12 16:23:58  prometheus[12720]: net/http.serverHandler.ServeHTTP(0xc420532a90, 0x1dead80, 0xc4203f3dc0, 0xc4459f0f00)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:2694 +0xbc fp=0xc4ad17bd88 sp=0xc4ad17bd58 pc=0x68f48c
Jun 12 16:23:58  prometheus[12720]: net/http.(*conn).serve(0xc4459e9220, 0x1decd80, 0xc4947f99c0)
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:1830 +0x651 fp=0xc4ad17bfc8 sp=0xc4ad17bd88 pc=0x68b691
Jun 12 16:23:58  prometheus[12720]: runtime.goexit()
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc4ad17bfd0 sp=0xc4ad17bfc8 pc=0x459d11
Jun 12 16:23:58  prometheus[12720]: created by net/http.(*Server).Serve
Jun 12 16:23:58  prometheus[12720]: /usr/local/go/src/net/http/server.go:2795 +0x27b
Jun 12 16:23:58  prometheus[12720]: goroutine 1 [chan receive, 14 minutes]:
Jun 12 16:23:58  prometheus[12720]: github.com/prometheus/prometheus/vendor/github.com/oklog/oklog/pkg/group.(*Group).Run(0xc4209a5c90, 0xc4206cc070, 0x8)
Jun 12 16:23:58  prometheus[12720]: /go/src/github.com/prometheus/prometheus/vendor/github.com/oklog/oklog/pkg/group/group.go:43 +0xec
Jun 12 16:23:58  prometheus[12720]: main.main()
...
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 12, 2018

Does the same happen without the Consul scrape config being listed? The consul log messages could be circumstantial.

@iksaif

@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Jun 12, 2018

looking

@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Jun 12, 2018

so, I can't reproduce it. Looks like the stacktrace is pointing at /federate. You you try to isolate the issue ? (with and without federation, consul, netdata)

@Wing924

This comment has been minimized.

Copy link
Author

Wing924 commented Jun 13, 2018

@iksaif
Yes, I use /federate.

I did a test:

Environment 1

  • prometheus cluster
    • server[01-04]: disable consul SD (comment out - job_name: 'prometheus_exporter' block)
    • server[05-08]: keep same config as I posted before.
  • consul cluser
    • version: 1.1.0
    • ACL block reading service

Environment 2

  • prometheus cluster
    • server[01-04]: disable consul SD (comment out - job_name: 'prometheus_exporter' block)
    • server[05-08]: keep same config as I posted before.
  • consul cluser
    • version: 1.1.0
    • ACL allow reading service

Environment 3

  • prometheus cluster
    • server[01-04]: disable consul SD (comment out - job_name: 'prometheus_exporter' block)
    • server[05-08]: keep same config as I posted before.
  • consul cluser
    • version: 1.0.1
    • ACL allow reading service

Result

  • Environment 1
    • all server[01-04] work well
    • all server[05-08] still become OOM and restart by systemd repeatly
  • Environment 2
    • all server[01-08] work well
  • Environment 3
    • all server[01-08] work well
@Wing924

This comment has been minimized.

Copy link
Author

Wing924 commented Jun 13, 2018

I don't know if this change on consul make prometheus OOM
https://www.consul.io/docs/upgrading.html#upgrade-from-version-1-0-6-to-higher
or kernel bug:
https://access.redhat.com/solutions/3441101
I use CentOS 7.5

@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Jun 13, 2018

We used 1.1.0 and I wasn't able to reproduce this issue.

Could you try to get a heap profile with go tool pprof http://localhost:9090/debug/pprof/heap ?

@Wing924

This comment has been minimized.

Copy link
Author

Wing924 commented Jun 13, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 13, 2018

@Wing924 this doesn't show high RAM usage. Can you take the profile when at the time it replicates the problem or before it is OOM killed.

@Wing924

This comment has been minimized.

Copy link
Author

Wing924 commented Jun 13, 2018

@krasi-georgiev Sorry, I upload the new pprof here.
prometheus

@Wing924

This comment has been minimized.

Copy link
Author

Wing924 commented Jun 13, 2018

This happened after I upgrading consul servers cluster from 1.0.1 to 1.1.0.
Yesterday I did the upgrading consul servers cluster on staging, it happened.
Today I did the same thing on production, it also happened.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 13, 2018

that is weird as the high mem usage points to federation. Is this behaving normal in Prometheus 2.2 ?

can you strip down to the most minimal config that replicates the bug so I can also try it locally.

@Wing924

This comment has been minimized.

Copy link
Author

Wing924 commented Jun 13, 2018

@krasi-georgiev Half of my prometheus are 2.2.1 and others are 2.3.0. Both of them happened.

can you strip down to the most minimal config that replicates the bug so I can also try it locally.

I can't try the test again because this accident will stop all monitoring system.
I'll setup a test env to do it. But it will take some time.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 13, 2018

@Wing924 thanks much appreciated. Ping me with the results and I will try to replicate with the minimal config as well.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

that is weird as the high mem usage points to federation

Yes, that looks like it. There's a massive federation request being processed here. Can you share the configuration of the Prometheus sending it?

@Wing924

This comment has been minimized.

Copy link
Author

Wing924 commented Jun 13, 2018

@brian-brazil
Here is the upstream prometheus config:

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
- job_name: 'federate'
  honor_labels: true
  params:
    'match[]':
    - 'up{}'
    - 'netdata_info{}'
    - '{job="netdata", __name__=~"netdata_(apache|cpu_cpu|disk|ipv4_(sockstat_tcp_sockets|tcpsock)|memcached|nginx|rabbitmq|redis|springboot|system_(cpu|io|ipv4|load|ram|swap)|tomcat|users|web_log)_.*"}'
    - '{env=~".+", job!="netdata"}'
  relabel_configs:
  - source_labels: [metrics_path]
    target_label: __metrics_path__
  - regex: metrics_path
    action: labeldrop
  file_sd_configs:
  - files:
    - /etc/prometheus/conf.d/worker_groups.yml
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 13, 2018

So that's going to pull in basically an entire Prometheus worth of data via federation. I'm a bit confused as to why the sampleRing is so big though.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 22, 2018

This looks like it was #4254.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.