Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAL log samples: log series: write metrics/wal/000xxx: cannot allocate memory #4393
Comments
This comment has been minimized.
This comment has been minimized.
|
I've edited the original description to fix the formatting |
simonpasquier
added
the
kind/enhancement
label
Aug 7, 2018
gouthamve
added
kind/bug
and removed
kind/enhancement
labels
Sep 26, 2018
This comment has been minimized.
This comment has been minimized.
|
This is a bug and not an enhancement. If there is no memory, the kernel should ideally be freeing it or OOMing. Now, having said that, I've checked the RSS when this happened, and there was 600M available to the container :/ |
This comment has been minimized.
This comment has been minimized.
|
I've tagged it as enhancement because of this remark from @iksaif
|
simonpasquier
added
the
component/local storage
label
Sep 26, 2018
This comment has been minimized.
This comment has been minimized.
|
After some debugging it turns out, the node ran out of memory. But kubernetes killed a pod to make some memory, and everything came back to normal, except Prometheus. It just continued dropping samples :/ Not sure if the new WAL fixes it. |
This comment has been minimized.
This comment has been minimized.
kamaradclimber
commented
Sep 29, 2018
|
Had the same issue last night.
When it happened prometheus container was only using 1.6G out of the 2G allocated to the container. Details on the prometheus instance: I agree with @iksaif 's proposal: prometheus /health should not pass on some occasions to allow orchestrator to kill and respawn the instance. During the issue (and until instance replacement), service discovery was broken (prometheus could not discovery any target). |
This comment has been minimized.
This comment has been minimized.
|
If think there is an internal ticket with a bit more context. In particular
there is a counter of failed memory allocation that could be used by
mesos/k8s or any side-agent to kill everything in the container.
The issue here is that even if there are a few (even hundreds of) megabytes
available some allocations will fail (if they try to allocate more than
available).
Le sam. 29 sept. 2018 à 10:49, Grégoire Seux <notifications@github.com> a
écrit :
… Had the same issue last night.
This is a bug and not an enhancement. If there is no memory, the kernel
should ideally be freeing it or OOMing.
When it happened prometheus container was only using 1.6G out of the 2G
allocated to the container.
See
https://snapshot.raintank.io/dashboard/snapshot/P39ucxFdCl7ObNfpXc6gX8db1EN44xHj?orgId=2
for graph (memory is on the top right).
Details on the prometheus instance:
Version: 2.3.1
Revision: 188ca45
<188ca45>
GoVersion: go1.10.3
I agree with @iksaif <https://github.com/iksaif> 's proposal: prometheus
/health should not pass on some occasions to allow orchestrator to kill and
respawn the instance.
During the issue (and until instance replacement), service discovery was
broken (prometheus could not discovery any target).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4393 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA_DAwKmqj8d9oscLv61Dpmtm0wL1HNhks5ufzQvgaJpZM4VScx7>
.
|
This comment has been minimized.
This comment has been minimized.
Can you share more information about this? AFAIK there's no way to do it in Go (I've only found golang/go#16843). 2.4.x has a new implementation of the WAL and since your original error was related to the WAL, it might be worth testing again with an newer version. It would be interesting to capture the |
This comment has been minimized.
This comment has been minimized.
|
See |
This comment has been minimized.
This comment has been minimized.
|
@gouthamve at the end do you think this is a problem with tsdb ?
did you mean that Prometheus was restarted, but continued dropping samples? |
krasi-georgiev
referenced this issue
Nov 8, 2018
Closed
Fatal error handling (when writes to wal file fail) #247
This comment has been minimized.
This comment has been minimized.
|
closing for no activity, if you think we should revisit please reopen and add some more updates for the issue. |
iksaif commentedJul 17, 2018
•
edited by simonpasquier
This happens when Prometheus (>2) runs in a cgroup and fails to allocate memory because there is no more memory available in the cgroup.
Maybe
/-/healthshould fail in this case ? This would allow marathon/kube to restart the instance (possibly freeing unused/leaked memory).