Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus stuck in repairing block #4003

Closed
auhlig opened this Issue Mar 23, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@auhlig
Copy link

auhlig commented Mar 23, 2018

What did you do?

Upgraded to Prometheus from v2.1.0 to v2.2.1.

What did you expect to see?

Prometheus up and running after repairing broken data on 1st start.

What did you see instead? Under which circumstances?

Logs (see below) show it does repair some blocks but then stops. Was like that for 30+ min. Still not accessible nor responsive.

Environment

  • System information:

    Linux 4.14.19-coreos x86_64

  • Prometheus version:

    prometheus, version 2.2.1 (branch: HEAD, revision: 
    bc6058c81272a8d938c05e75607371284236aadc)
    build user:       root@149e5b3f0829
    build date:       20180314-14:15:45
    go version:       go1.10
    
  • Prometheus configuration file:

Can be found here

  • Logs:
level=info ts=2018-03-23T09:31:00.286511629Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
level=info ts=2018-03-23T09:31:00.286647191Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
level=info ts=2018-03-23T09:31:00.286680366Z caller=main.go:222 host_details="(Linux 4.14.19-coreos #1 SMP Wed Feb 14 03:18:05 UTC 2018 x86_64 prometheus-frontend-4122203228-vrp5v (none))"
level=info ts=2018-03-23T09:31:00.286712074Z caller=main.go:223 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-03-23T09:31:00.290376952Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-03-23T09:31:00.290353555Z caller=main.go:504 msg="Starting TSDB ..."
level=info ts=2018-03-23T09:31:00.658993029Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C6ZE674K70J52DS6TBP97J2W
level=info ts=2018-03-23T09:31:00.975599882Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C702SC857XDXJ4P51ZHR9YRX
level=info ts=2018-03-23T09:31:01.254893204Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C70QCJR6BBDR6TH055CPSZ61
level=info ts=2018-03-23T09:31:01.530167271Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C71BZQXP329TDPYQCEKN525D
level=info ts=2018-03-23T09:31:01.889519884Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C720JXCQBWR1P4EC98QTZ6YF
level=info ts=2018-03-23T09:31:02.18960343Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C72N63BE1DKP60ZG74CB4FCQ
level=info ts=2018-03-23T09:31:02.45459546Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C739S94492EMQNFAYBJJF0G9
level=info ts=2018-03-23T09:31:02.727286709Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C73YCEVH96WAHC4NJ62YK9NA
level=info ts=2018-03-23T09:31:03.024455479Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C74JZMVHY0TS5WW66KEN9DNB
level=info ts=2018-03-23T09:31:03.305846851Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C757JTHNX49KWNK2EXSQMDXQ
level=info ts=2018-03-23T09:31:03.657477409Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C75W608SNDZK0SF0A2R7M4KQ
level=info ts=2018-03-23T09:31:03.966513557Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C76GS620V7KQ1PCP3QZWZAXM
level=info ts=2018-03-23T09:31:04.272400051Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C775CBDE1RSS9WWQXQVBW5WR
level=info ts=2018-03-23T09:31:04.61206357Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C77SZHCQ4PAM7X0AFP2DXSR7
level=info ts=2018-03-23T09:31:04.887706405Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C78EJQGQB63QRKCMJ3FYTKG2
level=info ts=2018-03-23T09:31:05.168543025Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7935Y0R5F0XJMAH74JP3Y2Y
level=info ts=2018-03-23T09:31:05.532052906Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C79QS3BZN82R966EXW408F8F
level=info ts=2018-03-23T09:31:05.844648474Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7ACC8Z0S4EQTBSGHRVZY3A7
level=info ts=2018-03-23T09:31:06.192159627Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7B0ZEGPHETPQZPFX3PK1S0M
level=info ts=2018-03-23T09:31:06.595666784Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7BNJN309K6WBM7EZJG35Z9X
level=info ts=2018-03-23T09:31:06.987977604Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7CA5TVCXY7NRGTE7TSE0J5W
level=info ts=2018-03-23T09:31:07.315650969Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7CYS0AZNA0WJ4N1MB7F3D92
level=info ts=2018-03-23T09:31:07.670215516Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7DKC5M0WYHG1SKP2ZJ2HTR2
level=info ts=2018-03-23T09:31:08.021387709Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7E7ZC3244X92KE5MHNSY3MY
level=info ts=2018-03-23T09:31:08.375164304Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7EWJHTZJ9CR1Q8G6WCFAFT8
level=info ts=2018-03-23T09:31:08.846532609Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7FH5QWC5CBP5EP62F33W2RS
level=info ts=2018-03-23T09:31:09.170340232Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7G5RV57RD30PK4H75D2KJSE
level=info ts=2018-03-23T09:31:09.523399902Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7GTCKK95AP6BQTMK5YHXSG9
level=info ts=2018-03-23T09:31:10.069858329Z caller=repair.go:41 component=tsdb msg="fixing broken block" ulid=01C7H17G2KR7Y2BCSPQXTFQHBX
@dannyk81

This comment has been minimized.

Copy link

dannyk81 commented Mar 29, 2018

Just wondering if you ever got around this? I have a few servers running v2.1.0 that I was planning to upgrade to v2.2.1, but now kind of getting second thoughts after reading this...

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Mar 29, 2018

I upgraded 3 instances and only one was stuck. I had to wipe the data to get it up again. However, the other 2 were repaired and up and running quite fast. Didn't observe storage issues since then.
I can tell more after upgrading our remaining 17 Prometheis next week

@dannyk81

This comment has been minimized.

Copy link

dannyk81 commented Mar 31, 2018

@auhlig thanks for the update, still didn't uprgade our instances, figured I'll take a snapshot before - just in case 😬

@dannyk81

This comment has been minimized.

Copy link

dannyk81 commented May 15, 2018

Upgraded our Prometheus fleet (12 servers) to v2.2.1 (from v2.1.) didn't encounter this issue 😅

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 7, 2018

I'm closing this issue as it seems to have been a flaky error. Feel free to reopen if this is still a problem for you.

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Aug 7, 2018

Haven't seen it since we're running v2.2.3

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 7, 2018

thanks for the heads-up @auhlig

@Hashfyre

This comment has been minimized.

Copy link

Hashfyre commented Feb 4, 2019

We are seeing this again #4324 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.