Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Error on ingesting out-of-order samples' in 0.18.0 #1585

Closed
RichiH opened this Issue Apr 25, 2016 · 17 comments

Comments

Projects
None yet
6 participants
@RichiH
Copy link
Member

RichiH commented Apr 25, 2016

As per IRC/mailing list:

After upgrading to 0.18.0, I am seeing tons of

WARN[194663] Error on ingesting out-of-order samples       numDropped=1967 source=scrape.go:459
WARN[194693] Error on ingesting out-of-order samples       numDropped=4573 source=scrape.go:459
WARN[194694] Error on ingesting out-of-order samples       numDropped=1184 source=scrape.go:459
WARN[194723] Error on ingesting out-of-order samples       numDropped=1967 source=scrape.go:459
WARN[194753] Error on ingesting out-of-order samples       numDropped=4573 source=scrape.go:459
WARN[194754] Error on ingesting out-of-order samples       numDropped=1184 source=scrape.go:459
WARN[194783] Error on ingesting out-of-order samples       numDropped=1967 source=scrape.go:459
WARN[194813] Error on ingesting out-of-order samples       numDropped=4573 source=scrape.go:459
WARN[194814] Error on ingesting out-of-order samples       numDropped=1184 source=scrape.go:459
WARN[194843] Error on ingesting out-of-order samples       numDropped=1967 source=scrape.go:459

From the ML, several others are seeing the same.

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Apr 25, 2016

@RichiH RichiH changed the title `Out of Order Samples` in 0.18.0 `Error on ingesting out-of-order samples` in 0.18.0 Apr 25, 2016

@RichiH RichiH changed the title `Error on ingesting out-of-order samples` in 0.18.0 'Error on ingesting out-of-order samples' in 0.18.0 Apr 25, 2016

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Apr 25, 2016

As a somewhat unexpected update:

All those messages came from three machines, all of them running CatOS (properly firewalled, for a specific historical reason, and I would have killed them ages ago). Gut feeling is that the messages are valid and 0.18.0 simply exposes things that 0.17.0 gracefully ignored.

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Apr 25, 2016

As another update: I removed the patch and the CatOS machines and am running with vanilla 0.18.0 from source again. Will continue to monitor the situation.

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Apr 25, 2016

I re-enabled the machines to play with the situation a bit and I think that it would be an acceptable trade-off to print the instance and/or the job name along with numDropped. That way, users have something to dog down with while STDOUT/STDERR is not flooded with crap.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 25, 2016

Could you share some of the log excerpts generated by https://gist.github.com/juliusv/dab35d4b17caf1937ff080238f1e7f20? I'm interested in the other fields also, not just what the metrics were...

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 25, 2016

Or we re-introduce the detailed logging, but on DEBUG level.

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Apr 25, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 25, 2016

Sure, that's what I meant.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 25, 2016

Got a private look at @RichiH's log excerpts. The problem is that samples are coming in with the same timestamp, but different value. So in that case, Prometheus is doing the right thing, but the "out of order" error message is a bit misleading. Should we have a separate error code for "hey, you can't change the value of an existing sample"?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Apr 25, 2016

To clarify, this actually is from two targets being relabeled into the same
thing?

On Mon, Apr 25, 2016 at 1:01 PM Julius Volz notifications@github.com
wrote:

Got a private look at @RichiH https://github.com/RichiH's log excerpts.
The problem is that samples are coming in with the same timestamp, but
different value. So in that case, Prometheus is doing the right thing, but
the "out of order" error message is a bit misleading. Should we have a
separate error code for "hey, you can't change the value of an existing
sample"?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1585 (comment)

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Apr 25, 2016

No, it's old machines answering to one query twice and apparently doing it live, so they produce different values every time.

I can't say anything about the others seeing this effect, though.

@beorn7 beorn7 self-assigned this Apr 25, 2016

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 25, 2016

I would bet that it's the same issue for the other people. In 0.17.0, we just didn't check the sample value equality yet, only if the timestamps are the same:

0.17.0:

https://github.com/prometheus/prometheus/blob/0.17.0/storage/local/storage.go#L595-L606

Compare to 0.18.0:

https://github.com/prometheus/prometheus/blob/0.18.0/storage/local/storage.go#L607-L619

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 25, 2016

OK, improved logging merged.
For investigation, you can now switch to debug log level.

The duplicate metrics from node exporter apparently only happen with a pretty old version.

I think this can be closed.

@beorn7 beorn7 closed this Apr 25, 2016

@spinus

This comment has been minimized.

Copy link

spinus commented Jun 29, 2016

Is it possible to know which samples are meant to be overridden (or which set of labels cause the problem)?

@RichiH

This comment has been minimized.

Copy link
Member Author

RichiH commented Jun 29, 2016

@spinus

This comment has been minimized.

Copy link

spinus commented Jun 29, 2016

ah, sure, I should tried that before asking. Thank you @RichiH.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.