Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic on unknown family type #2319

Closed
mberhault opened this Issue Jan 4, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@mberhault
Copy link

mberhault commented Jan 4, 2017

What did you do?
running prometheus as usual against ~50 targets

What did you expect to see?
working properly. This looks transient (no binary changes/upgrades/etc... around the time of crash). It did not occur again, so this is not a persistent bad metric on our side.
I'm not sure there's much to do here, but a bit more information in the panic message would be helpful, at the very least, a "%v" of the bad value to help debug. With plumbed-in context (a lot more work), you would surface the target being processed too.

What did you see instead? Under which circumstances?
prometheus crashed with:

panic: expfmt.extractSamples: unknown metric family type

goroutine 115082585 [running]:
panic(0x179fc20, 0xc4b619bbe0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt.extractSamples(0xc5095afa58, 0xc4b619bb10, 0x0, 0x0, 0x8)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt/decode.go:182 +0xdf
github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt.(*SampleDecoder).Decode(0xc5095afa40, 0xc44792db38, 0x7, 0xc8)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt/decode.go:156 +0x70
github.com/prometheus/prometheus/retrieval.(*targetScraper).scrape(0xc5033e7a40, 0x7fb4f7722028, 0xc446b37d40, 0xecfff03c6, 0x1f4f4b4, 0x2700300, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:345 +0x46c
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc497d7f3b0, 0x2540be400, 0x2540be400, 0x0)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:423 +0x4af
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:240 +0x3e5

After restarting, prometheus entered recovery mode, then resumed as usual, no further crashes (so far).

Environment

  • System information:
Linux 4.8.0-30-generic x86_64
  • Prometheus version:
prometheus, version 1.4.1 (branch: master, revision: 2a89e8733f240d3cd57a6520b52c36ac4744ce12)
  build user:       root@e685d23d8809
  build date:       20161128-09:59:22
  go version:       go1.7.3
  • Alertmanager version:

N/A

  • Prometheus configuration file:

prometheus is invoked with:

prometheus -config.file=prometheus.yml -storage.local.path="/mnt/data/prometheus" -alertmanager.url="http://localhost:9093/alertmanager/" -web.listen-address="localhost:9090" -web.external-url="https://monitoring.gce.cockroachdb.com/prometheus/" -storage.local.retention=720h -log.level="debug"
global:
  scrape_interval: 10s
  evaluation_interval: 30s

rule_files:
- "rules/alerts.rules"
- "rules/aggregation.rules"

scrape_configs:
  - job_name: 'prometheus'
    metrics_path: '/prometheus/metrics'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter_prometheus'
    metrics_path: '/metrics'
    scheme: 'http'
    static_configs:
    - targets: ['localhost:9100']
      labels:
        cluster: 'prometheus'

  - job_name: 'cockroach'
    # metrics_path defaults to '/metrics'
    metrics_path: '/_status/vars'
    # scheme defaults to 'http'.
    scheme: 'https'
    tls_config:
      insecure_skip_verify: true

    static_configs:
    - targets: [
### REDACTED: 64 DNS names (all targets are down and unresolvable)
               ]
      labels:
        cluster: sky

    # Resolve DNS to find hosts.
    # *NOTE* this requires up-to-date DNS.
    dns_sd_configs:
    - names: [
### REDACTED: 8 or so dns names.
      ]
      type: 'A'
      port: 8080

    # Create a "cluster" label using the beginning of the dns name (eg: 'gamma') as the value.
    # *NOTE* this requires accurate DNS names.
    relabel_configs:
    - action: replace
      source_labels: [__meta_dns_name]
      target_label: cluster
      regex: ([^.]+)\.gce\.cockroachdb\.com
      replacement: $1
    # The DNS entry register.cockroachdb.com is for the registration endpoint, the nodes themselves
    # are a little bit different.
    - action: replace
      source_labels: [__meta_dns_name]
      target_label: cluster
      regex: register-nodes\.cockroachdb\.com
      replacement: register

  - job_name: 'node_exporter'
    # metrics_path defaults to '/metrics'
    metrics_path: '/metrics'
    # scheme defaults to 'http'.
    scheme: 'http'

    # Resolve DNS to find hosts.
    # *NOTE* this requires up-to-date DNS.
    dns_sd_configs:
    - names: [
### REDACTED: 8 or so dns names.
             ]
      type: 'A'
      port: 9100

    # Create a "cluster" label using the beginning of the dns name (eg: 'gamma') as the value.
    # *NOTE* this requires accurate DNS names.
    relabel_configs:
    - action: replace
      source_labels: [__meta_dns_name]
      target_label: cluster
      regex: ([^.]+)\.gce\.cockroachdb\.com
      replacement: $1
    # are a little bit different.
    - action: replace
      source_labels: [__meta_dns_name]
      target_label: cluster
      regex: register-nodes\.cockroachdb\.com
      replacement: register
  • Alertmanager configuration file:
    N/A

  • Logs:
    prometheus log: the DNS warnings are due to some odd Azure DNS setup. Had to override resolve.conf force TCP DNS resolution, but it still tries UDP for some reason.
    The crashed occurred about a minute after the last DNS warning.

time="2017-01-04T14:54:55Z" level=warning msg="DNS resolution failed." name=gamma.gce.cockroachdb.com reason="read udp 10.1.2.4:57405->8.8.8.8:53: i/o timeout" server=8.8.8.8 source="dns.go:190" suffix=zm4xtzk4ipyexovwm3ixtj4sqe.bx.internal.cloudapp.net
time="2017-01-04T14:54:55Z" level=warning msg="DNS resolution failed." name=cyan.gce.cockroachdb.com reason="read udp 10.1.2.4:50832->8.8.8.8:53: i/o timeout" server=8.8.8.8 source="dns.go:190" suffix=zm4xtzk4ipyexovwm3ixtj4sqe.bx.internal.cloudapp.net
time="2017-01-04T14:54:55Z" level=warning msg="DNS resolution failed." name=cobalt.gce.cockroachdb.com reason="read udp 10.1.2.4:46017->8.8.8.8:53: i/o timeout" server=8.8.8.8 source="dns.go:190" suffix=zm4xtzk4ipyexovwm3ixtj4sqe.bx.internal.cloudapp.net
time="2017-01-04T14:54:55Z" level=warning msg="DNS resolution failed." name=register-nodes.cockroachdb.com reason="read udp 10.1.2.4:49978->8.8.8.8:53: i/o timeout" server=8.8.8.8 source="dns.go:190" suffix=zm4xtzk4ipyexovwm3ixtj4sqe.bx.internal.cloudapp.net
panic: expfmt.extractSamples: unknown metric family type

goroutine 115082585 [running]:
panic(0x179fc20, 0xc4b619bbe0)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt.extractSamples(0xc5095afa58, 0xc4b619bb10, 0x0, 0x0, 0x8)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt/decode.go:182 +0xdf
github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt.(*SampleDecoder).Decode(0xc5095afa40, 0xc44792db38, 0x7, 0xc8)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/common/expfmt/decode.go:156 +0x70
github.com/prometheus/prometheus/retrieval.(*targetScraper).scrape(0xc5033e7a40, 0x7fb4f7722028, 0xc446b37d40, 0xecfff03c6, 0x1f4f4b4, 0x2700300, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:345 +0x46c
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run(0xc497d7f3b0, 0x2540be400, 0x2540be400, 0x0)
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:423 +0x4af
created by github.com/prometheus/prometheus/retrieval.(*scrapePool).sync
        /go/src/github.com/prometheus/prometheus/retrieval/scrape.go:240 +0x3e5

supervisor logs showing start/stop time:

2016-12-23 14:31:44,910 INFO spawned: 'prometheus' with pid 8737
2016-12-23 14:31:47,068 INFO success: prometheus entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2017-01-04 14:55:02,398 INFO exited: prometheus (exit status 2; expected)
2017-01-04 14:57:47,728 INFO spawned: 'prometheus' with pid 50089
2017-01-04 14:57:49,803 INFO success: prometheus entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jan 5, 2017

This happens if the enum in the protobuf has an invalid value. Could happen due to data corruption, so I guess we should not panic but return an error so that callers can handle it.

That has to happen in the prometheus/common repo. I have filed prometheus/common#72 for it.
Closing this in lieu.

@beorn7 beorn7 closed this Jan 5, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.