Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upupgrade from 2.7.1 to 2.8.0 provokes consistently ~ 30% loss of metrics passing through remote_write #5389
Comments
This comment has been minimized.
This comment has been minimized.
|
(1.7M timeseries in this instance tried on different prometheis instances, got the same results with different setup) Since it's dropped before actually pushing the metrics i don't think it's related to the software receiving remote_write stream. Here it's metrictank. |
This comment has been minimized.
This comment has been minimized.
|
ping @cstyan @tomwilkie @gouthamve |
krasi-georgiev
added
the
component/remote storage
label
Mar 21, 2019
This comment has been minimized.
This comment has been minimized.
|
Thanks for the report @beorn-! Looks like you have a corrupt WAL:
Can you grab screenshots of the following queries please?
(edit: added one more) Those missing series records from the WAL explain why the remote write code is dropping samples for those series. When the next checkpoint hits, the remote write code will pick this up and start sending entries for those series. I'm surprised this doesn't happen earlier TBH - I will investigate. |
This comment has been minimized.
This comment has been minimized.
|
btw we have been getting "unknown series references" for quite a while and never found the real culprit for that. |
This comment has been minimized.
This comment has been minimized.
|
Here you go @tomwilkie The green graph is 2.7.1. the other are the 2.8.0 with a few restarts along the way |
This comment has been minimized.
This comment has been minimized.
|
Unfortunately there isn't much we can do in 2.8 regarding remote write if the WAL is corrupt. In prior versions remote write got samples to send via copying them from scrapes, but now we read them from the WAL, and within the WAL there are series records (telling us about metric name and labels) and samples records (with a ref ID to the series record, and then the TS and value for that sample). This means we cache the results of the series records we read in 2.8. With unknown series references it means we'll see samples records whose ref ID we haven't seen or cached, so remote write won't be able to send on those samples. That's the |
This comment has been minimized.
This comment has been minimized.
|
So @cstyan somehow flushing the WAL should solve my issue right ? Is there some fsck-ish tool somewhere ? Can it be safely reseted ? |
This comment has been minimized.
This comment has been minimized.
|
QQ - where is the Prometheus data stored? Is this a local or a network disk? Are you using a particular cloud provider? |
This comment has been minimized.
This comment has been minimized.
|
Also, I suspect if you leave it running for long enough the errors will subside - if possible it would be worth a try. |
This comment has been minimized.
This comment has been minimized.
|
It is stored locally on baremetal servers. I've checked the WAL keeps something like 5 hours. During the reload i still got
it's been days since my last upgrade test so the WAL are brand new. According to the explications/assumptions given here I expected to be out of the woods, but i've had the very same problem. A rollback instantly fixed the issue. |
tomwilkie
referenced this issue
Apr 1, 2019
Open
Release 2.8+ remote storage doesn't work on ext4 bare metal, running RH7 #5424
brian-brazil
added a commit
that referenced
this issue
Apr 2, 2019
This comment has been minimized.
This comment has been minimized.
|
the fix works. |

beorn- commentedMar 21, 2019
•
edited
Bug Report
What did you do?
After an upgrade from prometheus 2.7.1 to prometheus 2.8.0 i faced a a big metric loss with the remote_write rewrite.
What did you expect to see?
Nothing special.
What did you see instead? Under which circumstances?
loosing roughly 30% of metrics because of "dropped sample for series that was not explicitly dropped via relabelling"
Environment
Debian stretch with github release binary.
Linux 4.18.0-0.bpo.1-amd64 x86_64
prometheus, version 2.8.0 (branch: HEAD, revision: 5936949)
build user: root@4c4d5c29b71f
build date: 20190312-07:46:58
go version: go1.11.5
prom28.txt
went fine. back to normal with 2.7.1 version.