Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus crash when remote_storage_adapter comes up after failure (panic: runtime error: makeslice: len out of range) #2969

Closed
bkupidura opened this Issue Jul 19, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@bkupidura
Copy link

bkupidura commented Jul 19, 2017

What did you do?
We are using remote_storage_adapter implementation (https://github.com/prometheus/prometheus/tree/master/documentation/examples/remote_storage/remote_storage_adapter) to push metrics to InfluxDB.

Sometimes when remote_storage_adapter is back after failure (docker service scale remote_storage_adapter=0 && sleep 300 && docker service scale remote_storage_adapter=1) prometheus server crash.

Logs
time="2017-07-18T15:14:20Z" level=warning msg="Error sending 100 samples to remote storage: Post http://remote_storage_adapter:9201/write: dial tcp: lookup remote_storage_adapter on 127.0.0.11:53: no such host" source="queue_manager.go:500"
time="2017-07-18T15:14:26Z" level=info msg="Remote storage resharding from 118 to 47 shards." source="queue_manager.go:351"
time="2017-07-18T15:14:36Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:14:46Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:14:56Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:15:06Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:15:16Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:15:26Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:15:36Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:15:46Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:15:46Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633"
time="2017-07-18T15:15:48Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1.235390018s." source="persistence.go:665"
time="2017-07-18T15:15:56Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:16:06Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:16:16Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:16:26Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:16:36Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:16:46Z" level=info msg="Remote storage resharding from 47 to 2 shards." source="queue_manager.go:351"
time="2017-07-18T15:16:56Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:17:06Z" level=info msg="Currently resharding, skipping." source="queue_manager.go:354"
time="2017-07-18T15:17:16Z" level=info msg="Remote storage resharding from 2 to -1 shards." source="queue_manager.go:351"
panic: runtime error: makeslice: len out of range

goroutine 193 [running]:
github.com/prometheus/prometheus/storage/remote.(*QueueManager).newShards(0xc42028a700, 0xffffffffffffffff, 0x1)
/go/src/github.com/prometheus/prometheus/storage/remote/queue_manager.go:396 +0x40
github.com/prometheus/prometheus/storage/remote.(*QueueManager).reshard(0xc42028a700, 0xffffffffffffffff)
/go/src/github.com/prometheus/prometheus/storage/remote/queue_manager.go:375 +0xcf
github.com/prometheus/prometheus/storage/remote.(*QueueManager).reshardLoop(0xc42028a700)
/go/src/github.com/prometheus/prometheus/storage/remote/queue_manager.go:364 +0x105
created by github.com/prometheus/prometheus/storage/remote.(*QueueManager).Start
/go/src/github.com/prometheus/prometheus/storage/remote/queue_manager.go:265 +0x85

Environment
System: Ubuntu 16.04
Prometheus is running on top of docker-swarm.

Prometheus cmd line:
/opt/prometheus/prometheus -config.file /srv/prometheus/prometheus.yml -web.listen-address 0.0.0.0:9090 -storage.local.engine persisted -storage.local.retention 360h -storage.local.target-heap-size 3221225472 -storage.local.num-fingerprint-mutexes 4096 -storage.local.path /data/data/1/

Prometheus version:
prometheus, version 1.6.3 (branch: master, revision: c580b60)
build user: root@a6410e65f5c7
build date: 20170522-09:15:06
go version: go1.8.1

Prometheus config

global:
  evaluation_interval: 1m
  external_labels:
    region: region1
  scrape_interval: 15s
  scrape_timeout: 15s
alerting:
  alertmanagers:
    # docker_swarm_alertmanager
    - dns_sd_configs:
      - names: [tasks.alertmanager]
        type: A
        port: 9093
remote_write:
  # docker_remote_write
  - url: http://remote_storage_adapter:9201/write

rule_files:
- alerts.yml

scrape_configs:
  - job_name: telegraf

    static_configs:
    - targets: ['172.16.10.100:9126','172.16.10.106:9126','172.16.10.107:9126','172.16.10.109:9126','172.16.10.102:9126','172.16.10.108:9126','172.16.10.101:9126','172.16.10.105:9126','172.16.10.103:9126','172.16.10.110:9126','172.16.10.121:9126']
  - job_name: pushgateway
    dns_sd_configs:
    - names:
      - tasks.pushgateway
      type: A
      port: 9091
  - job_name: prometheus
    dns_sd_configs:
    - names:
      - tasks.server
      type: A
      port: 9090
  - job_name: alertmanager
    dns_sd_configs:
    - names:
      - tasks.alertmanager
      type: A
      port: 9093
@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jul 19, 2017

Thanks for reporting! I'll take a look.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.