Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus does not start: get segment range: segments are not sequential #5342

Closed
aclowkey opened this issue Mar 12, 2019 · 15 comments
Closed

Comments

@aclowkey
Copy link

Not sure what is the cause of this.
Any work arounds?

  • logs
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.336550919Z caller=main.go:238 msg="S
tarting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"        
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.338787238Z caller=main.go:239 build_
context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"                                              
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.338980452Z caller=main.go:240 host_d
etails="(Linux 4.15.0-1027-gcp #28~16.04.1-Ubuntu SMP Fri Jan 18 10:10:51 UTC 2019 x86_64 butler-qn0s (none))"       
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.339011177Z caller=main.go:241 fd_lim
its="(soft=1024, hard=4096)"                                                                                         
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.339033426Z caller=main.go:242 vm_lim
its="(soft=unlimited, hard=unlimited)"                                                                               
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.339994665Z caller=main.go:554 msg="S
tarting TSDB ..."                                                                                                    
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.340044489Z caller=web.go:397 compone
nt=web msg="Start listening for connections" address=0.0.0.0:9090                                                    
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.341969564Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1550858400000 maxt=1550923200000 ulid=01D4DDXJ92WCFJP8RJFYX4ET05            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.342464357Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1550923200000 maxt=1550988000000 ulid=01D4FBQ5N9HSJFFGAEZKCNWXPM            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.342846498Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1550988000000 maxt=1551052800000 ulid=01D4H9GPQTFHS03XYHAP74Q558            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.343210565Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551052800000 maxt=1551117600000 ulid=01D4K7A7N602VRBKTMBD22ZCT8            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.343620378Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551117600000 maxt=1551182400000 ulid=01D4N53RDG1KE02GDFBNZSE3XQ            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.343750482Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551182400000 maxt=1551247200000 ulid=01D4Q2X85K3T0ZEPWNNE8HHR9F            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.343852062Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551247200000 maxt=1551312000000 ulid=01D4S0PSVN7B8WAZ096FZ3VGGW            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.343953336Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551312000000 maxt=1551376800000 ulid=01D4TYGBJW8C7KYCCCVEYSYBEV            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344051367Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551376800000 maxt=1551441600000 ulid=01D4WW9WM2B627AHPN17Y9QCWX            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344146095Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551441600000 maxt=1551506400000 ulid=01D4YT3D930DEV1K5X19XPYS9N            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344251583Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551506400000 maxt=1551571200000 ulid=01D50QWYJF130YMQARP9DFK0PS            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344351732Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551571200000 maxt=1551636000000 ulid=01D52NPH0TETJD8FZQCDQ82CPV            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344444554Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551636000000 maxt=1551700800000 ulid=01D54KG15BHKCTEPCBH1Z2XDMG            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344547559Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551700800000 maxt=1551765600000 ulid=01D56H9JA1G0N1JSDSTKAFVMW9            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344639528Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551765600000 maxt=1551830400000 ulid=01D58F34Q5DT2RHG9RDYCSB0ZA            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344745998Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551830400000 maxt=1551895200000 ulid=01D5ACWPAZEEMJ63PRHX4QV43R            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344839812Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551895200000 maxt=1551960000000 ulid=01D5CAP77S8DZKXF3QJBZFVSS2            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.344934759Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1551960000000 maxt=1552024800000 ulid=01D5E8FRH6V16F8YSBYPR56T8M            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.345024125Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1552024800000 maxt=1552089600000 ulid=01D5G69AF3S19647A4RJKXBQEF            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.345092905Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1552154400000 maxt=1552161600000 ulid=01D5J42RCDTWP90V88874ZMY95            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.345193102Z caller=repair.go:35 compo
nent=tsdb msg="found healthy block" mint=1552089600000 maxt=1552154400000 ulid=01D5J42TY24ETG0KWZP5QZYJF3            
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347444501Z caller=main.go:423 msg="S
topping scrape discovery manager..."                                                                                 
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347480391Z caller=main.go:437 msg="S
topping notify discovery manager..."                                                                                 
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347493468Z caller=main.go:459 msg="S
topping scrape manager..."                                                                                           
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347508094Z caller=main.go:433 msg="N
otify discovery manager stopped"                                                                                     
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347541286Z caller=main.go:419 msg="Scrape discovery manager stopped"
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347607052Z caller=main.go:453 msg="Scrape manager stopped"
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347632362Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347646724Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347664062Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=info ts=2019-03-12T11:51:56.347687593Z caller=main.go:608 msg="Notifier manager stopped"
Mar 12 11:51:56 butler-qn0s prometheus[26322]: level=error ts=2019-03-12T11:51:56.347814188Z caller=main.go:617 err="opening storage failed: get segment range: segments are not sequential"
Mar 12 11:51:56 butler-qn0s systemd[1]: prometheus.service: Main process exited, code=exited, status=1/FAILURE
@brian-brazil
Copy link
Contributor

Can you share a listing of the WAL directory?

@aclowkey
Copy link
Author

01D4DDXJ92WCFJP8RJFYX4ET05
01D4FBQ5N9HSJFFGAEZKCNWXPM
01D4H9GPQTFHS03XYHAP74Q558
01D4K7A7N602VRBKTMBD22ZCT8
01D4N53RDG1KE02GDFBNZSE3XQ
01D4Q2X85K3T0ZEPWNNE8HHR9F
01D4S0PSVN7B8WAZ096FZ3VGGW
01D4TYGBJW8C7KYCCCVEYSYBEV
01D4WW9WM2B627AHPN17Y9QCWX
01D4YT3D930DEV1K5X19XPYS9N
01D50QWYJF130YMQARP9DFK0PS
01D52NPH0TETJD8FZQCDQ82CPV
01D54KG15BHKCTEPCBH1Z2XDMG
01D56H9JA1G0N1JSDSTKAFVMW9
01D58F34Q5DT2RHG9RDYCSB0ZA
01D5ACWPAZEEMJ63PRHX4QV43R
01D5CAP77S8DZKXF3QJBZFVSS2
01D5E8FRH6V16F8YSBYPR56T8M
01D5G69AF3S19647A4RJKXBQEF
01D5J42RCDTWP90V88874ZMY95
01D5J42TY24ETG0KWZP5QZYJF3

@brian-brazil
Copy link
Contributor

There should be a wal directory too, what's in there?

@aclowkey
Copy link
Author

aclowkey commented Mar 12, 2019

These files are in the wal/

@brian-brazil
Copy link
Contributor

Sounds like you put in an incorrect storage path at some point then, it should have been up a directory.

@cstyan
Copy link
Member

cstyan commented Mar 12, 2019

@aclowkey you're data dir should look something like this:

01D4V68DXCV1WRTCP17CZTHX14
01D4V6A3YJCP3ME0DWH3C0VC8E.tmp  
01D4VCRZJDVN178KHWPY2Q97F1  
01D5NF51KERW8THK8QBM1CJ2EH  
01D5NF52XQPN8B8JW2CTESFNCD  
lock  
wal

And that inner wal dir:

00006852
00006853  
00006854  
checkpoint.006851

@aclowkey
Copy link
Author

aclowkey commented Mar 13, 2019

It happened in our production env aswell. Here are the dirs:

01D4K7A7N602VRBKTMBD22ZCT8
01D4N53RDG1KE02GDFBNZSE3XQ
01D4Q2X85K3T0ZEPWNNE8HHR9F
01D4S0PSVN7B8WAZ096FZ3VGGW
01D4TYGBJW8C7KYCCCVEYSYBEV
01D4WW9WM2B627AHPN17Y9QCWX
01D4YT3D930DEV1K5X19XPYS9N
01D50QWYJF130YMQARP9DFK0PS
01D52NPH0TETJD8FZQCDQ82CPV
01D54KG15BHKCTEPCBH1Z2XDMG
01D56H9JA1G0N1JSDSTKAFVMW9
01D58F34Q5DT2RHG9RDYCSB0ZA
01D5ACWPAZEEMJ63PRHX4QV43R
01D5CAP77S8DZKXF3QJBZFVSS2
01D5E8FRH6V16F8YSBYPR56T8M
01D5G69AF3S19647A4RJKXBQEF
01D5J42TY24ETG0KWZP5QZYJF3
01D5M1WCVBATNVAXTGMYQSKQT6
01D5NZNXHVJ83XC9Z9EPYR7N6M
01D5QXFEP2Y5W142A1K5V8AZVR
01D5RJ2KFZDK2WD02H4CT1805P
01D5S6NQS7X645CEGJ68YX0BTQ
01D5S6NSAEAKK28ASYJ9DCBNSN
lock
wal

and inside wal

00000001
00000002
00000003
00000004
00000005
00000006
00000007
00000008
00000009
00000010
00000011
00000012
00000013
00000014
00000015
00000016
00000017
00000018
00000220
00000221
00000222
checkpoint.000219

The server was shutdown on host maintance, but the data is stored on NFS.
Any idea how to repair it ?

@cstyan
Copy link
Member

cstyan commented Mar 13, 2019

@aclowkey Can you post the output of ls -al for the wal directory? You shouldn't have segments 1-18 and then more starting at 220. Not sure what would cause this.

Any idea @gouthamve @krasi-georgiev ?

@aclowkey
Copy link
Author

drwxr-xr-x  3 me me      4096 Mar 12 12:29 .
drwxr-xr-x 26 me me      4096 Mar 12 15:00 ..
-rw-r--r--  1 me me 134217728 Mar  6 20:07 00000001
-rw-r--r--  1 me me 134217728 Mar  7 00:39 00000002
-rw-r--r--  1 me me 134217728 Mar  7 05:10 00000003
-rw-r--r--  1 me me 134217728 Mar  7 09:41 00000004
-rw-r--r--  1 me me 134217728 Mar  7 14:16 00000005
-rw-r--r--  1 me me 134217728 Mar  7 18:57 00000006
-rw-r--r--  1 me me 134217728 Mar  7 23:30 00000007
-rw-r--r--  1 me me 134217728 Mar  8 04:02 00000008
-rw-r--r--  1 me me 134217728 Mar  8 08:33 00000009
-rw-r--r--  1 me me 134217728 Mar  8 13:07 00000010
-rw-r--r--  1 me me 134217728 Mar  8 17:41 00000011
-rw-r--r--  1 me me 134217728 Mar  8 22:15 00000012
-rw-r--r--  1 me me 134217728 Mar  9 02:48 00000013
-rw-r--r--  1 me me 134217728 Mar  9 07:21 00000014
-rw-r--r--  1 me me 134217728 Mar  9 11:54 00000015
-rw-r--r--  1 me me 134217728 Mar  9 16:28 00000016
-rw-r--r--  1 me me 134217728 Mar  9 21:01 00000017
-rw-r--r--  1 me me 134217728 Mar 10 01:33 00000018
-rw-r--r--  1 me me 134217728 Mar 10 06:05 00000019
-rw-r--r--  1 me me 134217728 Mar 10 10:36 00000020
-rw-r--r--  1 me me 134217728 Mar 10 15:08 00000021
-rw-r--r--  1 me me 134217728 Mar 10 19:39 00000022
-rw-r--r--  1 me me 134217728 Mar 11 00:11 00000023
-rw-r--r--  1 me me 134217728 Mar 11 04:42 00000024
-rw-r--r--  1 me me 134217728 Mar 11 09:14 00000025
-rw-r--r--  1 me me 134217728 Mar 11 13:47 00000026
-rw-r--r--  1 me me 134217728 Mar 11 18:19 00000027
-rw-r--r--  1 me me 134217728 Mar 11 22:51 00000028
-rw-r--r--  1 me me 134217728 Mar 12 03:23 00000029
-rw-r--r--  1 me me 134217728 Mar 12 07:55 00000030
-rw-r--r--  1 me me 134217728 Mar 12 12:29 00000031
-rw-r--r--  1 me me 119930880 Mar 12 16:33 00000032
-rw-r--r--  1 me me 134217728 Mar  6 02:07 00000220
-rw-r--r--  1 me me 134217728 Mar  6 06:42 00000221
-rw-r--r--  1 me me 121540034 Mar  6 10:52 00000222
drwxr-xr-x  2 me me      4096 Mar  6 07:00 checkpoint.000219

@krasi-georgiev
Copy link
Contributor

did I understand right that you are using NFS to store the data?
if that is the case this is most likely the cause. We had many reports for data corruption when using NFS. If you can replicate if on a NON nfs storage please let me know the steps to reproduce and will look into it.

@aclowkey
Copy link
Author

aclowkey commented Mar 14, 2019 via email

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Mar 14, 2019

you will lose some data (probably 15-30mins), but if you delete the non sequential wal files it will start as normal.

00000220
00000221
00000222
checkpoint.000219

@krasi-georgiev
Copy link
Contributor

closing now as NFS is most likely the cause, but if you can replicate it on a NON nfs storage feel free to reopen with steps to reproduce.

Thanks.

@chinglinwen
Copy link

I've encounter the same issue

the problem is, I can't make it start ( unless delete the data, I don't find any way to make it recover )

because of

Apr 23 07:00:46 kubernet-node.yunwei-90-208 kernel: VFS: file-max limit 655350 reached
Apr 23 07:00:46 kubernet-node.yunwei-90-208 kernel: VFS: file-max limit 655350 reached
Apr 23 07:03:40 kubernet-node.yunwei-90-208 prometheus[111397]: level=warn ts=2019-04-22T23:03:40.270599862Z caller=scrape.go:835 component="scrape manager" scrape_pool=node-exporter target=http://172.31.81.10:9100/metrics msg="append failed" err="write to WAL: log samples: create new segment file: open data/wal/00007749: too many open files"
Apr 23 07:03:40 kubernet-node.yunwei-90-208 kubelet[3402]: E0423 07:03:40.328642    3402 kubelet_volumes.go:128] Orphaned pod "32ed7e89-39a5-11e9-96fc-36174af6a80e" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Apr 23 07:03:40 kubernet-node.yunwei-90-208 kernel: VFS: file-max limit 655350 reached
Apr 23 07:03:41 kubernet-node.yunwei-90-208 prometheus[111397]: level=warn ts=2019-04-22T23:03:41.324831113Z caller=scrape.go:835 component="scrape manager" scrape_pool=node-exporter target=http://172.31.82.44:9100/metrics msg="append failed" err="write to WAL: log samples: create new segment file: open data/wal/00007749: too many open files"
Apr 23 07:03:41 kubernet-node.yunwei-90-208 prometheus[111397]: level=warn ts=2019-04-22T23:03:41.325068238Z caller=scrape.go:850 component="scrape manager" scrape_pool=node-exporter target=http://172.31.82.44:9100/metrics msg="appending scrape report failed" err="write to WAL: log samples: create new segment file: open data/wal/00007749: too many open files"
Apr 23 07:03:41 kubernet-node.yunwei-90-208 prometheus[111397]: level=warn ts=2019-04-22T23:03:41.397045674Z caller=scrape.go:835 component="scrape manager" scrape_pool=node-exporter target=http://172.31.83.21:9100/metrics msg="append failed" err="write to WAL: log samples: create new segment file: open data/wal/00007749: too many open files"

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Apr 23, 2019

Your issues is different and the error says it all :)

https://stackoverflow.com/questions/4986388/vfs-file-max-limit-1231582-reached

Please open a new issue if you still think it is a bug in Prometheus.

@prometheus prometheus locked and limited conversation to collaborators Apr 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants