Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus 2.1 abnormally shutdown with SIGBUS #3781

Closed
roengram opened this Issue Feb 1, 2018 · 12 comments

Comments

Projects
None yet
4 participants
@roengram
Copy link

roengram commented Feb 1, 2018

What did you do?
Federation test on Prometheus 2.1

What did you expect to see?
Prometheus running stably

What did you see instead? Under which circumstances?
Prometheus shutdown with SIGBUS

Environment
Prometheus 2.1 was running on a Joyent container (64GB RAM, 800GB SSD)

  • System information:

    Linux 3.13.0 x86_64

  • Prometheus version:

    prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
    build user: root@6e784304d3ff
    build date: 20180119-12:01:23
    go version: go1.9.2

  • Alertmanager version:

    N/A

  • Prometheus configuration file:

global:
  scrape_timeout: 60s
  external_labels:
    prom: prom-2.1

scrape_configs:

- job_name: job_1
  honor_labels: true
  metrics_path: /metrics_1000_15
  scrape_interval: 40s
  file_sd_configs:
    - files:
      - /griffin/prom/conf.d/targets_1000.yml
  • Alertmanager configuration file:

N/A

  • Logs:
unexpected fault address 0xe28970b000
fatal error: fault
unexpected fault address 0xe28949d000
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe28974c020
unexpected fault address 0xe289746040
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe289754020
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe289721000
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe289756008
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe289750020
unexpected fault address 0xe28974e000
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe28975a000
unexpected fault address 0xe289752020
unexpected fault address 0xe28974a008
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe289333000
unexpected fault address 0xe289611000
unexpected fault address 0xe28975e008
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe28975c008
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
unexpected fault address 0xe289758000
fatal error: unexpected signal during runtime execution
fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x3 addr=0xe28970b000 pc=0x1569fe0]

goroutine 614700 [running]:
runtime.throw(0x1bf0612, 0x5)
  /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc49bdef1f0 sp=0xc49bdef1d0 pc=0x42bca5
runtime.sigpanic()
  /usr/local/go/src/runtime/signal_unix.go:364 +0x29d fp=0xc49bdef240 sp=0xc49bdef1f0 pc=0x44285d
github.com/prometheus/prometheus/storage.(*sampleRing).add(0xe28964d240, 0x16150be3889, 0x47c02767f3654bd7)
  /go/src/github.com/prometheus/prometheus/storage/buffer.go:188 +0x70 fp=0xc49bdef2a8 sp=0xc49bdef240 pc=0x1569fe0
github.com/prometheus/prometheus/storage.(*BufferedSeriesIterator).Next(0xe28953db00, 0x16150be3889)
  /go/src/github.com/prometheus/prometheus/storage/buffer.go:89 +0x51 fp=0xc49bdef2d0 sp=0xc49bdef2a8 pc=0x1569d41
github.com/prometheus/prometheus/storage.(*BufferedSeriesIterator).Seek(0xe28953db00, 0x16150c2539c, 0x493e0)
  /go/src/github.com/prometheus/prometheus/storage/buffer.go:73 +0x53 fp=0xc49bdef2f8 sp=0xc49bdef2d0 pc=0x1569c23
github.com/prometheus/prometheus/web.(*Handler).federation(0xc420824100, 0x28ff540, 0xc51b6d9e80, 0xc4aec98900)

Full log: https://drive.google.com/open?id=1NMifcYiNZXz8aQlkLRs6K3TOr3lv7APw

Detail

Prometheus was scraping 1000 metrics from 1000 targets with 40 sec interval. When I invoked federation endpoint with match[]={label1=\"v1\"} continuously, scrape started to fail, and then 1~2 min later Prometheus died with the above log.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 1, 2018

Were you running anything that could alter the files in the data directory?

@roengram

This comment has been minimized.

Copy link
Author

roengram commented Feb 1, 2018

Nope. Another range query sum_over_time(metric1[2d])/avg_over_time(metric1[2d]) was running side-by-side, but nothing except prometheus was accessing prometheus data files.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 1, 2018

Okay, so it's not mmap reading out of bounds. This is likely a hardware fault then.

@dfredell

This comment has been minimized.

Copy link

dfredell commented Feb 7, 2018

I have been getting the same kind of errors (I think). I'm running on a similar platform.

Environment

Prometheus 2.1 was running in a Joyent private cloud docker container on SmartOS. Prometheus is running via Containerpilot. Consul-template updating the config and reloading via SIGHUP.

Docker File

FROM alpine:3.7
curl https://github.com/prometheus/prometheus/releases/download/v2.1.0/2.1.0.linux-amd64.tar.gz
...
CMD ["/usr/local/bin/containerpilot"]

System information:

bash-4.4# uname -a
Linux 0a166278b71d 3.13.0 BrandZ virtual linux x86_64 Linux

Prometheus version:

prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
build user: root@6e784304d3ff
build date: 20180119-12:01:23
go version: go1.9.2

Log

fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x3 addr=0xc42019f000 pc=0x459989]

2018-02-07T05:01:44.445533349Z unexpected fault address 0xc45cb27000
2018-02-07T05:01:44.44558765Z fatal error: fault
2018-02-07T05:01:44.456577767Z [signal SIGBUS: bus error code=0x3 addr=0xc45cb27000 pc=0x45d788]

If you want the full log I can get that too.

I was not poking the Prometheus or docker at the time of its failures. I have 6 separate instances of prometheus running in an identical way and only one is dying frequently like this. Last night when no human is working, prometheus kept getting SIGBUS as it tried to start. This went on for like an hour (about 20 times) then it was finally able to start. Prometheus had been running for 30hr, I had SIGBUS errors two days ago and tried giving docker more memory evidently didn't help.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 7, 2018

Have you tried running that Prometheus on a different machine?

This is smelling like a Joyent issue, given both of you are using that platform.

@dfredell

This comment has been minimized.

Copy link

dfredell commented Feb 8, 2018

I haven't tried a different machine, and don't think I really can. My scrape targets are in a private triton network.

@dfredell

This comment has been minimized.

Copy link

dfredell commented Feb 12, 2018

I tried downgrading to prometheus 2.0.0 and that didn't totally help. One stack is happy but a different one is now getting SIGBUS errors every few hours.

I also noticed that prometheus was taking 1072% of the memory

Mem: 18630K used, 1029946K free, 0K shrd, 0K buff, 0K cached
CPU:   1% usr   1% sys   0% nic  96% idle   0% io   0% irq   0% sirq
Load average: 0.14 0.32 0.42 56/187 0
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
80109 73736 root     R     4216   0%   0   0% {busybox} top
61257 18306 root     S   10974m1072%  11   0% /bin/prometheus --config.file=/etc/prometheus/prometheus.yml --web.console.libraries=/etc/prometheus/console_libraries --web.console.templates=/etc/prometheus/consoles
58163 18306 root     S    46240   4%   0   0% /usr/local/bin/consul agent --data-dir /consul/data --config-dir /consul/config -rejoin -retry-join=consul.svc....
18306     1 root     S    18740   2%  48   0% /usr/local/bin/containerpilot
61071 18306 root     S    15364   1%  10   0% /usr/local/bin/consul-template -config /etc/app.hcl -consul-addr localhost:8500
    1     0 root     S    11744   1%  34   0% /usr/local/bin/containerpilot
73736     1 root     S     4368   0%  47   0% {busybox} sh

The Go Lang issue golang/go#21586 talks about growslice and SIGBUS so this might be fixed in Go1.10

I made a prometheus build on go1.10rc2 to see if that fixes my issue.

@dfredell

This comment has been minimized.

Copy link

dfredell commented Feb 13, 2018

I did try starting up the docker on a prometheus build with go1.10rc2 and it seams to have the same issue. It started and prometheus would just sit there. The webpage wasn't loading and the process was just taking up more and more memory. As if prometheus was loading all the old data metrics into memory. I had 6G in the /data directory. When I deleted the directory prometheus booted right up, no hesitation.

unexpected fault address 0xc420152008
fatal error: fault
[signal SIGBUS: bus error code=0x3 addr=0xc420152008 pc=0x466333]

goroutine 1382 [running]:
runtime.throw(0x8a4204, 0x5)
	/usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc4202c7bf0 sp=0xc4202c7bd0 pc=0x42ac85
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:364 +0x29d fp=0xc4202c7c40 sp=0xc4202c7bf0 pc=0x440ddd
syscall.Environ(0x0, 0x0, 0x0)
	/usr/local/go/src/syscall/env_unix.go:145 +0x133 fp=0xc4202c7cd8 sp=0xc4202c7c40 pc=0x466333
os.Environ(0xc42000c080, 0xc4201880a0, 0x0)
	/usr/local/go/src/os/env.go:117 +0x22 fp=0xc4202c7d00 sp=0xc4202c7cd8 pc=0x48a6d2
os/exec.(*Cmd).envv(0xc4204f2000, 0xc4201880b8, 0x0, 0x1)
	/usr/local/go/src/os/exec/exec.go:182 +0x51 fp=0xc4202c7d28 sp=0xc4202c7d00 pc=0x78a271
os/exec.(*Cmd).Start(0xc4204f2000, 0x8bfa80, 0x8a5736)
	/usr/local/go/src/os/exec/exec.go:366 +0x42c fp=0xc4202c7e90 sp=0xc4202c7d28 pc=0x78b74c
github.com/joyent/containerpilot/commands.(*Command).Run.func2(0xc4201ac070, 0xc4200e1540, 0xc4200bf540)
	/go/src/github.com/joyent/containerpilot/commands/commands.go:127 +0x10d fp=0xc4202c7fc8 sp=0xc4202c7e90 pc=0x7946cd
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2337 +0x1 fp=0xc4202c7fd0 sp=0xc4202c7fc8 pc=0x459241
created by github.com/joyent/containerpilot/commands.(*Command).Run
	/go/src/github.com/joyent/containerpilot/commands/commands.go:124 +0x24d
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 7, 2018

Do you still the issue if you use the latest Prometheus version?

@dfredell

This comment has been minimized.

Copy link

dfredell commented Aug 7, 2018

@simonpasquier
I'm no longer having this issue. I don't remember what exactly fixed it, it was either https://smartos.org/bugview/OS-6467 or joyent/containerpilot#536 I think.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 8, 2018

@dfredell thanks!

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.