Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus 2.2.0 missing metrics after upgrade #3943
Comments
This comment has been minimized.
This comment has been minimized.
britcey
commented
Mar 9, 2018
|
I'm seeing a similar issue; updated to 2.2 at 2018-03-08 22:02:17 UTC - there's a big hole in my metrics until about 8 hours later. Fortunately, this is just a pilot, but concerning nonetheless. Nothing stands out in the logs other than, perhaps: Mar 8 22:02:19 xxx prometheus: level=warn ts=2018-03-08T22:02:19.54858458Z caller=head.go:320 component=tsdb msg="unknown series references in WAL samples" count=7 right after the upgrade |
brian-brazil
added
kind/bug
priority/P0
component/local storage
labels
Mar 10, 2018
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
britcey
commented
Mar 12, 2018
This comment has been minimized.
This comment has been minimized.
|
@britcey can you share the meta.json files and size of your index files in the storage dir? You are only experiencing data loss of some metrics right? Could your verify whether this happens to metrics in lexicographic order? For example, all metrics starting with Could you set the @hgranillo same thing would probably apply to you as well. |
This comment has been minimized.
This comment has been minimized.
Jean-Daniel
commented
Mar 12, 2018
•
|
I have the same issue here. I didn't check all series, but can't find one that does not suffer the data loss. At start, I have the same warning:
I attach a list of all my index dir with the corresponding meta.json |
This comment has been minimized.
This comment has been minimized.
|
@Jean-Daniel is this an upgrade or a fresh 2.2 Prometheus installation? |
This comment has been minimized.
This comment has been minimized.
|
can you share some more configs. it might help replicate it. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc It seems to affect all metrics. I also se a few .tmp dir in one on my Prometheus. Are these safe to delete? Heres a few metrics from A to Z I have a second gap now, similar to @britcey |
This comment has been minimized.
This comment has been minimized.
britcey
commented
Mar 12, 2018
|
Metafile info attached; it appears to be all metrics that are missing data. I'll set storage.tsdb.max-block-duration shortly. Not sure if it's a coincidence, but there was a restart of prometheus sometime before the second data loss - not sure if the other folks were running continuously or not. |
This comment has been minimized.
This comment has been minimized.
Jean-Daniel
commented
Mar 13, 2018
bwplotka
referenced this issue
Mar 13, 2018
Closed
compact: Additional test for issue prometheus#3943 #298
This comment has been minimized.
This comment has been minimized.
|
@Jean-Daniel The suspicion is that you all upgraded to 2.2.0 in similar time (within couple of hours) - that's (maybe) why same time loss might happen more likely |
This comment has been minimized.
This comment has been minimized.
|
This logs are worrying: Compaction somehow reduced a block O.o |
gentoo-bot
pushed a commit
to gentoo/gentoo
that referenced
this issue
Mar 13, 2018
This comment has been minimized.
This comment has been minimized.
|
Ok, we found bug & root cause. No block was deleted - just compaction wrongly ignores some blocks in certain cases. The ignored block (12h) is most likely laying around, but no one is using it. Fix in progress. EDIT: Fix in review prometheus/tsdb#299 |
bwplotka
added a commit
to bwplotka/tsdb
that referenced
this issue
Mar 13, 2018
bwplotka
referenced this issue
Mar 13, 2018
Merged
compact: Assume fresh block to ignore by minTime not ULID order. Added tests. #299
This comment has been minimized.
This comment has been minimized.
|
BTW In theory we can still repair the wrong blocks and recover these metrics. That will need some extra work, but it might be possible. EDIT: That might be incorrect actually ): |
This comment has been minimized.
This comment has been minimized.
|
The question is if that should be carried in prometheus or promtool. |
This comment has been minimized.
This comment has been minimized.
|
@RichiH ...and if it's necessary. Depends on impact. |
This comment has been minimized.
This comment has been minimized.
|
Impact will be hard to measure; one would hope that any organizations that depend on data in the longer would be more conservative in running .0 releases; I know we are. If it's done, it probably makes more sense to put the functionality in promtool. Otherwise, it becomes a game of "when can we drop this again". |
This comment has been minimized.
This comment has been minimized.
britcey
commented
Mar 13, 2018
|
If it matters - mine's just a pilot setup, I don't need the data recovered (just a nice-to-have) |
This comment has been minimized.
This comment has been minimized.
trnl
commented
Mar 13, 2018
|
@Bplotka, are you gonna make 2.2.1 release with that? |
This comment has been minimized.
This comment has been minimized.
|
As it would be nice to recover these metrics is not that critical. At least for me. |
This comment has been minimized.
This comment has been minimized.
|
That's good, because on second glance by looking on @hgranillo meta files you gave me this "ignored" block might be actually dropped. ): So, I am no longer sure if this missing block is actually recoverable. |
This comment has been minimized.
This comment has been minimized.
|
@trnl yes as far as I know, after my fix will land. |
This comment has been minimized.
This comment has been minimized.
|
Hey @kamusin. Glad all works for you now. I will try to explain here why we are not sure if we can recover these gaps. So the recovery potentially depends on when that issue happened vs when you upgraded to fix that. When new block appeared a wrong compaction resulted in: As you can see we ended up in case where we have overlapped blocks (54h block with 12h gap and 12h block itself) in the system. Also these time ranges are from @hgranillo case. Issue could happen on any other blocks based on your case, however, it was inevitable. Now the fun begins. Nothing is prepared for overlap, so we have all sorts of problems there.. and TBH I am unable to predict what results one can have with these (: As we could see so far:
and there is no longer overlapped blocks visible on his disk (from what we could see from meta files). We don't have enough logs, something bad seemed to happened after the log lines we have.
So I don't know the answer if generally this issue is recoverable. But we can go with this case by case. If you @kamusin can print all the |
This comment has been minimized.
This comment has been minimized.
|
Maybe writing some short tool/script for this would be helpful |
This comment has been minimized.
This comment has been minimized.
kamusin
commented
Mar 23, 2018
|
I think I should be able to post the data @Bplotka, however if you have a snipper ready to share please let me know (so far so good with the upgrade, however every time we need to restart prometheus (we do have a small gap of no data that varies between 2-10 minutes, some nodes seem to not be affected though but this might be a different problem). |
This was referenced Mar 23, 2018
This comment has been minimized.
This comment has been minimized.
hmmxp
commented
Mar 23, 2018
|
Dear All, I am also facing similar issue as per my recently closed case #4002 , what is the fix for the issue? Should i git clone the master branch and build the Prometheus binary with the fix provided by Bplotka? Or i need to make configuration changes |
This comment has been minimized.
This comment has been minimized.
|
@hmmxp 2.2.1 should include the fix, so let's move to your issue to investigate. @kamusin yea, restart of Prometheus will always lose some data. Maybe you can make it faster because 2-10 minutes sounds long, but still - unless you use HA (multiple Prometheus) replicas, there will be some data missing. We use https://github.com/improbable-eng/thanos to provide Prometheus HA (it has query layer with deduplication feature) |
This comment has been minimized.
This comment has been minimized.
hmmxp
commented
Mar 23, 2018
|
@Bplotka At the moment version 2.2.1 not seeing any missing data, but still monitoring closely. The data that was missing, is it possible to recover it or what are the options available? |
This comment has been minimized.
This comment has been minimized.
|
@hmmxp See few comments above: #3943 (comment) If you check all your metafiles from your disk and there will be overlapping block somewhere -> we will be able to recover it. You can post them here, we can help. |
This comment has been minimized.
This comment has been minimized.
dswarbrick
commented
Mar 27, 2018
•
|
I saw the "block time ranges overlap" error with Prometheus 2.2.0 within a day or two after a fresh install. I upgraded to 2.2.1 and nuked the DB. It ran fine for about 36 hours, then I saw the error occur. The filesystem fills up during a compaction, and if I restart Prometheus at that point, it never comes back up. This bug seems to be a regression since 2.1.0. |
This comment has been minimized.
This comment has been minimized.
|
Hi @dswarbrick, would you be able to share the logs? The bug that caused it in 2.2.0 has been identified and fixed in 2.2.1. |
This comment has been minimized.
This comment has been minimized.
@dswarbrick what exactly error? |
This comment has been minimized.
This comment has been minimized.
|
@dswarbrick would this summary be correct:
|
This comment has been minimized.
This comment has been minimized.
dswarbrick
commented
Mar 28, 2018
|
I installed 2.2.0 last Friday, around midday. It ran fine until Sunday, when the log started to report these:
At one point it ran out of space on the 50 GB filesystem:
This gradually became worse:
... until eventually it died:
When I tried to restart it on Monday, it would not start and died with:
This is the point at which I removed the entire contents of /srv/prometheus/metrics2, and upgraded to Prometheus 2.2.1. This obviously started fine with an empty DB. Later on Tuesday, I started to see the same "block time ranges overlap" error again, and the filesystem again started to rapidly fill, as if the compaction process was failing with leftover temporary files.
I tried to restart Prometheus, and once again it failed to start with:
At this point I again removed the contents of /srv/prometheus/metrics2, and started with a fresh DB again. So far it's been running ok for about 14 hours. |
This comment has been minimized.
This comment has been minimized.
|
Can you share the logs until the compaction failed error happened on 2.2.1?
…On Wed, Mar 28, 2018 at 1:47 PM Daniel Swarbrick ***@***.***> wrote:
I installed 2.2.0 last Friday, around midday. It ran fine until Sunday,
when the log started to report these:
Mar 25 11:00:00 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:00:00.371572247Z caller=compact.go:394 component=tsdb msg="compact blocks" count=1 mint=1521957600000 maxt=1521964800000
Mar 25 11:00:01 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:00:01.652055946Z caller=head.go:348 component=tsdb msg="head GC completed" duration=64.254505ms
Mar 25 11:00:02 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:00:02.26184745Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=609.038738ms
Mar 25 11:00:02 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:00:02.388959377Z caller=compact.go:394 component=tsdb msg="compact blocks" count=3 mint=1521936000000 maxt=1521957600000
Mar 25 11:00:04 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:00:04.082174394Z caller=compact.go:394 component=tsdb msg="compact blocks" count=2 mint=1521892800000 maxt=1521936000000
Mar 25 11:00:06 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:00:06.414524076Z caller=compact.go:394 component=tsdb msg="compact blocks" count=3 mint=1521799200000 maxt=1521957600000
Mar 25 11:00:11 fkb-prom prometheus[20055]: level=error ts=2018-03-25T09:00:11.083160017Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap (1521892800000, 1521957600000)"
Mar 25 11:00:12 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:00:12.09218052Z caller=compact.go:394 component=tsdb msg="compact blocks" count=2 mint=1521892800000 maxt=1521957600000
Mar 25 11:00:15 fkb-prom prometheus[20055]: level=error ts=2018-03-25T09:00:15.18509509Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap (1521799200000, 1521828000000)"
At one point it ran out of space on the 50 GB filesystem:
Mar 25 11:30:04 fkb-prom prometheus[20055]: level=info ts=2018-03-25T09:30:04.488714476Z caller=compact.go:394 component=tsdb msg="compact blocks" count=11 mint=1521892800000 maxt=1521957600000
Mar 25 11:34:33 fkb-prom prometheus[20055]: level=error ts=2018-03-25T09:34:33.128006388Z caller=db.go:281 component=tsdb msg="compaction failed" err="compact [/srv/prometheus/metrics2/01C9E67YYV73KNNCWD01TAYSFT /srv/prometheus/metrics2/01C9E65NKJ8AFP6NJSYQVPKNJX /srv/prometheus/metrics2/01C9E6SZNNAABH0KW6A4KZXV2J /srv/prometheus/metrics2/01C9E65XDW74BHDR7NPBNAGXJ2 /srv/prometheus/metrics2/01C9E70R0XAT2QG2BEENG2FD11 /srv/prometheus/metrics2/01C9E6A4H1FPAAGRW8SRKDWNTE /srv/prometheus/metrics2/01C9E6CKZ9452ERA52VPHWKK41 /srv/prometheus/metrics2/01C9E6FMJY0C2Y1KYJPNRHBXAT /srv/prometheus/metrics2/01C9E6KRHFMXV5K3SES7QX72X6 /srv/prometheus/metrics2/01C9E788ESAF5F2AT1YAX51HZY /srv/prometheus/metrics2/01C9E65KYME9TEKF8J212QK38J]: write compaction: write chunks: no space left on device"
This gradually became worse:
Mar 25 16:39:27 fkb-prom prometheus[20055]: level=info ts=2018-03-25T14:39:27.268797602Z caller=compact.go:394 component=tsdb msg="compact blocks" count=1 mint=1521964800000 maxt=1521972000000
Mar 25 16:39:28 fkb-prom prometheus[20055]: level=error ts=2018-03-25T14:39:28.925386435Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: open block /srv/prometheus/metrics2/01C9EQRRJ6GW1GMCQYVBV5AGA7: mmap files: mmap: cannot allocate memory"
... until eventually it died:
Mar 25 17:21:22 fkb-prom prometheus[20055]: fatal error: runtime: cannot allocate memory
When I tried to restart it on Monday, it would not start and died with:
Mar 26 10:01:18 fkb-prom prometheus[6806]: level=error ts=2018-03-26T08:01:18.048223898Z caller=main.go:575 err="Opening storage failed invalid block sequence: block time ranges overlap (1521799200000, 1521957600000)"
Mar 26 10:01:18 fkb-prom prometheus[6806]: level=info ts=2018-03-26T08:01:18.048388413Z caller=main.go:577 msg="See you next time!"
This is the point at which I removed the entire contents of
/srv/prometheus/metrics2, and upgraded to Prometheus 2.2.1. This obviously
started fine with an empty DB.
Later on Tuesday, I started to see the same "block time ranges overlap"
error again, and the filesystem again started to rapidly fill, as if the
compaction process was failing with leftover temporary files.
Mar 27 17:00:14 fkb-prom prometheus[14007]: level=error ts=2018-03-27T15:00:14.672804202Z caller=db.go:281 component=tsdb msg="compaction failed" err="reload blocks: invalid block sequence: block time ranges overlap (1522087200000, 1522152000000)"
I tried to restart Prometheus, and once again it failed to start with:
Mar 27 18:38:19 fkb-prom prometheus[7437]: level=error ts=2018-03-27T16:38:19.768740127Z caller=main.go:575 err="Opening storage failed invalid block sequence: block time ranges overlap (1522051200000, 1522152000000)"
Mar 27 18:38:19 fkb-prom prometheus[7437]: level=info ts=2018-03-27T16:38:19.768880142Z caller=main.go:577 msg="See you next time!"
At this point I again removed the contents of /srv/prometheus/metrics2,
and started with a fresh DB again. So far it's been running ok for about 14
hours.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#3943 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHA3H0ZHKchAX248U508E4xGCkSYzwbnks5ti0cJgaJpZM4Sku1O>
.
|
This comment has been minimized.
This comment has been minimized.
dswarbrick
commented
Mar 28, 2018
|
Here are my logs starting from the initial 2.2.0 install last week, up to the last few minutes. |
bwplotka
referenced this issue
Mar 28, 2018
Closed
compact: [Debug- do not merge] dswarbrick compact issue - cannot repro #311
This comment has been minimized.
This comment has been minimized.
|
Hey @dswarbrick - thanks for you logs! I tried to repro your case locally, but it seems that 2.2.1 works perfectly in your case. Somehow your Prometheus behaved diffidently and produced overlapped block. Details: prometheus/tsdb#311 Looks like there was some side/external effect that impacted your storage/Prometheus but not sure what. Keep us in touch if you can repro the issue once again. Also you were quite lucky that Prometheus cought overlapping blocks. For most cases they will be ignored (Prometheus will be running despite overlapping blocks as it was for other folks here). This PR fixes overlapping check: prometheus/tsdb#310 |
This comment has been minimized.
This comment has been minimized.
|
This curiously looks like the bug-fix was not included in the debian release. https://packages.debian.org/sid/all/golang-github-prometheus-tsdb-dev/download doesn't have this PR: prometheus/tsdb#299 which fixed the issue. cc @TheTincho @dswarbrick Could you try with an release from here and see if that fixes it: https://github.com/prometheus/prometheus/releases/tag/v2.2.1 |
This comment has been minimized.
This comment has been minimized.
|
Yup, let's double check debian packaging, but if not - there is one thing we don't check locally: influence on compact planning if |
This comment has been minimized.
This comment has been minimized.
dswarbrick
commented
Mar 29, 2018
|
So far the Debian package 2.2.1 is still working correctly on my system, but I think it's been just a little over 24 hours in operation. The problem seems to kick in at around the 36 hour mark. If I see the problem occur today, I will install the vanilla release from upstream and clear the DB again. Thanks for the help so far! I think your suspicion of the Debian package is probably warranted. |
This comment has been minimized.
This comment has been minimized.
|
Hey all, I can confirm that I missed this update in the tsdb library. I will update the package now. |
This comment has been minimized.
This comment has been minimized.
|
I have just uploaded updated packages to Debian (2.2.1+ds-2), they should hit the mirrors in a few hours. Sorry for the mistake! |
This comment has been minimized.
This comment has been minimized.
|
Thanks for solving this so quickly!
…On Thu, Mar 29, 2018 at 11:24 AM Martín Ferrari ***@***.***> wrote:
I have just uploaded updated packages to Debian (2.2.1+ds-2), they should
hit the mirrors in a few hours. Sorry for the mistake!
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#3943 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEuA8ktUZpNCMq85D9GjjMkidgV6arO-ks5tjKg_gaJpZM4Sku1O>
.
|
bwplotka
referenced this issue
Apr 12, 2018
Merged
compact: Use sync-delay only for fresh blocks. Refactored halt, retry logic. #282
bwplotka
referenced this issue
Jun 28, 2018
Open
Add tsdb.Scan() to unblock from a corrupted db. #320
krasi-georgiev
referenced this issue
Aug 14, 2018
Open
Add alerting rules and post alerts as GitHub comments on being triggered #32
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |







hgranillo commentedMar 9, 2018
•
edited
What did you do?
What did you expect to see?
All metrics except the ones missing during container reload and start. No more than a few minutes.
What did you see instead? Under which circumstances?
After I updated everything was running fine. New metrics were being collected and stored. A day after a gap of metrics went missing. A few hours ago I was able to see the now missing metrics. They were there and a second after they not.
The really weird thing here is that I have 4 Prometheus running and all 4 have the same issue, they were upgraded almost at the same time. Theres a long missing gap of metrics (all kind of metrics from different exporters) and no errors on logs.
These 4 Prometheus are running on separated environments on two different AWS regions.
Environment
Linux 4.4.0-1050-aws x86_64I cut the scrap rules here. I can include some of them.
Production Ireland


The last metric of 2.1.0 is the green one and ends when I upgraded Prometheus (19:58). On the logs you can see that the cointainer started at 19:58:19
Staging Virginia

Prometheus Dashboard Production Ireland

I know this look really sketchy. But I didnt had restart or network error on the environments that run Prometheus. Thats why I choose a Prometheus metric as a example but all metrics where affected.