Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logs filling up dockerd memory #1046

Closed
miekg opened this Issue Sep 1, 2015 · 20 comments

Comments

Projects
None yet
6 participants
@miekg
Copy link

miekg commented Sep 1, 2015

In our prometheus setup we see a lot of

time="2015-09-01T10:51:02Z" level=warning msg="Ignoring sample with out-of-order timestamp for fingerprint 8ecd53ef091d7a0f .....: 1441104582.533 is not after 1441104582.533" file=storage.go line=564 
time="2015-09-01T10:51:02Z" level=warning msg="Ignoring sample with out-of-order timestamp for fingerprint 60499958e3784aaa ....1441104417.533 is not after 1441104417.533" file=storage.go line=564 

And with "a lot" I mean that dockerd fills to 32GB and then gets OOM-killed. This is partly due to docker being stupid and actually caching this output or maybe this isn't worthy of outputting?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 1, 2015

This is worthy of outputting, as it indicates a misconfiguration. Are you federating a timeseries that already exists in your prometheus?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 1, 2015

Any chance this is the same as #1042?

@miekg

This comment has been minimized.

Copy link
Author

miekg commented Sep 1, 2015

@brian-brazil we don't have 100% control and what rules are being pushed to our prometheus. So yes this may indicate a problem, but I rather have some timeseries not being correct then the whole container dying. So a flag to suppress this would be welcome. :)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 1, 2015

@brian-brazil I don't see why this necessarily is a misconfiguration.
In 0.15.1 you will always get this error output if a federated metric updates less frequent than /federate is scraped – which should be a perfectly legal setup.

@miekg

This comment has been minimized.

Copy link
Author

miekg commented Sep 1, 2015

@fabxc yep, def. seems related.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 1, 2015

Should be fixed by #973.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 1, 2015

I don't see why this necessarily is a misconfiguration.

A prometheus servers that's getting the same timeseries from two different places indicates that something is either up with your rules (rare), or you're missing distinguishing labels when federating from multiple sources (common).

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 1, 2015

If that's the case, sure – but it's not the only possible or most likely cause.
It raises the question though whether there's a way to detect misconfiguration without false positives. Generally I don't feel it's Prometheus's job beyond the basics.

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Sep 1, 2015

@miekg What version of docker are you running? Supposedly this is fixed in docker >= 1.7. See moby/moby#9139.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Sep 1, 2015

In either case, one misconfiguration must not cause 32GiB of warnings. This will also drown out other problems, even if you're able to rotate it away quick enough. Can we rate limit this to be once per timeseries or less?

@jimmidyson

This comment has been minimized.

Copy link
Member

jimmidyson commented Sep 1, 2015

Wow I didn't see 32GB - must pay closer attention ;) That's going to fill up disk too pretty quick.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 1, 2015

@matthiasr The fix will no longer show the warning for these cases at all (correctly, as there are too many false positives).
It will pop up though in some edge cases where you actually get a time series from different targets and have very (un)lucky timing.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 1, 2015

If that's the case, sure – but it's not the only possible or most likely cause.

What other causes have you seen?

It raises the question though whether there's a way to detect misconfiguration without false positives.

I've never seen a case where this message was a false positive, in the tens of times I've investigated this over the years it was always a misconfiguration or misuse.

Detecting this statically will be difficult, as you'd need to look at the label model of the entire set of federating Prometheus servers. Incorrect rules tend to be a bit easier to detect in the usual case.

Generally I don't feel it's Prometheus's job beyond the basics.

When this happens the results can be surprising, this warning tells the user about a very fundamental error in their setup.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 1, 2015

Please read my reply above and the PR I linked where this log was fixed (which you 👍 yourself).

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 1, 2015

👍 what @fabxc said. These are just identical timestamps, which are totally normal in case of client-side timestamps which update less frequently than the scrape interval. As said, the next release will not log this anymore.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 1, 2015

I'm always thinking of HEAD :)

@miekg

This comment has been minimized.

Copy link
Author

miekg commented Sep 1, 2015

Thanks @juliusv and @fabxc. Any ETA on the next release?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 1, 2015

We want to move towards an RC next week. The unclear socket leak is
certainly a blocker for that.

On Tue, Sep 1, 2015 at 6:45 PM Miek Gieben notifications@github.com wrote:

Thanks @juliusv https://github.com/juliusv and @fabxc
https://github.com/fabxc. Any ETA on the next release?


Reply to this email directly or view it on GitHub
#1046 (comment)
.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 14, 2015

Closing this as the issue is fixed in HEAD and will be in the next release.

@juliusv juliusv closed this Sep 14, 2015

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.