Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash after Crash Recovery (len out of range) #2492

Closed
ichekrygin opened this Issue Mar 11, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@ichekrygin
Copy link

ichekrygin commented Mar 11, 2017

Restart Prometheus after abrupt termination

Prometheus starts, performs crash recovery, and starts serving requests

Prometheus starts, performs crash recovery, and crashes

Environment

  • System information:

    Linux 4.7.3-coreos-r2 x86_64

  • Prometheus version:

    prometheus, version 1.5.2 (branch: master, revision: bd1182d)
    build user: root@1a01c5f
    build date: 20170210-16:23:28
    go version: go1.7.5

  • Alertmanager version:

    N/A

  • Prometheus configuration file:

    global:
      scrape_interval: 30s
      scrape_timeout: 30s

    rule_files:
    - /etc/prometheus/alert.rules

    scrape_configs:
    - job_name: etcd
      static_configs:
        - targets:
          - 10.72.132.6:2379
          - 10.72.134.15:2379
          - 10.72.146.27:2379
          - 10.72.145.36:2379
          - 10.72.144.241:2379

    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']

    - job_name: 'kube-state-metrics'
      static_configs:
        - targets: ['prometheus-kube-state-metrics:8080']

    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

    - job_name: 'kube-dns-dnsmasq'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: http
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: kube-system;kube-dns;metrics-sidecar

    - job_name: 'kube-dns-skydns'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: http
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: kube-system;kube-dns;metrics-kubedns

    - job_name: 'kubernetes-nodes'
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        insecure_skip_verify: true
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    - job_name: 'kubernetes-service-endpoints'
      scheme: https
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)(?::\d+);(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_service_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

    - job_name: 'kubernetes-services'
      scheme: https
      metrics_path: /probe
      params:
        module: [http_2xx]
      kubernetes_sd_configs:
      - role: service
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_service_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name
  • Alertmanager configuration file:
N/A
  • Logs:
time="2017-03-11T01:14:58Z" level=info msg="13660000 metrics queued for indexing." source="crashrecovery.go:523"
time="2017-03-11T01:14:58Z" level=info msg="All requests for rebuilding the label indexes queued. (Actual processing may lag behind.)" source="crashrecovery.go:529"
time="2017-03-11T01:14:58Z" level=warning msg="Crash recovery complete." source="crashrecovery.go:152"
panic: runtime error: makeslice: len out of range

goroutine 1 [running]:
panic(0x18b2120, 0xc4714e2630)
        /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/prometheus/prometheus/storage/local/codable.getBuf(0x14aa82a7611b0582, 0xc42041e900, 0x14aa82a7611b0582, 0x0)
        /go/src/github.com/prometheus/prometheus/storage/local/codable/codable.go:62 +0xd6
github.com/prometheus/prometheus/storage/local/codable.decodeString(0x2686c00, 0xc42041e900, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/storage/local/codable/codable.go:142 +0xdc
github.com/prometheus/prometheus/storage/local/codable.(*Metric).UnmarshalFromReader(0xc47154d8f0, 0x2686c00, 0xc42041e900, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/storage/local/codable/codable.go:193 +0x13a
github.com/prometheus/prometheus/storage/local.(*headsScanner).scan(0xc420398460, 0xc4205d8270)
        /go/src/github.com/prometheus/prometheus/storage/local/heads.go:123 +0x130
github.com/prometheus/prometheus/storage/local.(*persistence).loadSeriesMapAndHeads(0xc4201091e0, 0xc4203c2580, 0x0, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/storage/local/persistence.go:828 +0x1ae
github.com/prometheus/prometheus/storage/local.(*MemorySeriesStorage).Start(0xc4205c8000, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/storage/local/storage.go:374 +0x1f5
main.Main(0x0)
        /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:181 +0x114f
main.main()
        /go/src/github.com/prometheus/prometheus/cmd/prometheus/main.go:43 +0x22

@ichekrygin ichekrygin changed the title Crash after Crash Recovery Crash after Crash Recovery (len out of range) Mar 11, 2017

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 13, 2017

This is almost certainly triggered by data corruption crash recovery hasn't detected.

In general, the unmarshaling code should be resilient against that and quarantine the series instead of crashing. This seems to hit yet another cornercase.

@beorn7 beorn7 self-assigned this Mar 13, 2017

@beorn7 beorn7 added the kind/bug label Mar 13, 2017

beorn7 added a commit that referenced this issue Apr 4, 2017

storage: Check for negative values from varint decoding
Sadly, we have a number of places where we use varint encoding for
numbers that cannot be negative. We could have saved a bit by using
uvarint encoding. On the bright side, we now have a 50% chance to
detect data corruption. :-/

Fixes #1800 and #2492.

@beorn7 beorn7 closed this Apr 4, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.