Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AL log samples: log series: write metrics/wal/000xxx: cannot allocate memory #4393

Closed
iksaif opened this Issue Jul 17, 2018 · 10 comments

Comments

Projects
None yet
5 participants
@iksaif
Copy link
Contributor

iksaif commented Jul 17, 2018

level=warn ts=2018-07-06T10:17:31.820321163Z caller=scrape.go:713 component="scrape manager" scrape_pool=job-rtb-us target="http://10.xxx.xx.xx:8082/metrics?contentType=application%2Fvnd.prometheus-overseer" msg="append failed" err="WAL log samples: log series: write metrics/wal/000261: cannot allocate memory"
level=warn ts=2018-07-06T10:17:31.820390888Z caller=scrape.go:713 component="scrape manager" scrape_pool=job-rtb-us target="http://10.xxx.xx.xx:8082/metrics?contentType=application%2Fvnd.prometheus-overseer" msg="append failed" err="WAL log samples: log series: write metrics/wal/000261: cannot allocate memory"

This happens when Prometheus (>2) runs in a cgroup and fails to allocate memory because there is no more memory available in the cgroup.

Maybe /-/health should fail in this case ? This would allow marathon/kube to restart the instance (possibly freeing unused/leaked memory).

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 7, 2018

I've edited the original description to fix the formatting

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Sep 26, 2018

This is a bug and not an enhancement. If there is no memory, the kernel should ideally be freeing it or OOMing.

Now, having said that, I've checked the RSS when this happened, and there was 600M available to the container :/

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 26, 2018

I've tagged it as enhancement because of this remark from @iksaif

Maybe /-/health should fail in this case ?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Sep 26, 2018

After some debugging it turns out, the node ran out of memory. But kubernetes killed a pod to make some memory, and everything came back to normal, except Prometheus. It just continued dropping samples :/ Not sure if the new WAL fixes it.

@kamaradclimber

This comment has been minimized.

Copy link

kamaradclimber commented Sep 29, 2018

Had the same issue last night.

This is a bug and not an enhancement. If there is no memory, the kernel should ideally be freeing it or OOMing.

When it happened prometheus container was only using 1.6G out of the 2G allocated to the container.
See https://snapshot.raintank.io/dashboard/snapshot/P39ucxFdCl7ObNfpXc6gX8db1EN44xHj?orgId=2 for graph (memory is on the top right).

Details on the prometheus instance:
Version: 2.3.1
Revision: 188ca45
GoVersion: go1.10.3

I agree with @iksaif 's proposal: prometheus /health should not pass on some occasions to allow orchestrator to kill and respawn the instance.

During the issue (and until instance replacement), service discovery was broken (prometheus could not discovery any target).

@iksaif

This comment has been minimized.

Copy link
Contributor Author

iksaif commented Sep 29, 2018

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 3, 2018

If think there is an internal ticket with a bit more context. In particular there is a counter of failed memory allocation that could be used by mesos/k8s or any side-agent to kill everything in the container.

Can you share more information about this? AFAIK there's no way to do it in Go (I've only found golang/go#16843).

2.4.x has a new implementation of the WAL and since your original error was related to the WAL, it might be worth testing again with an newer version. It would be interesting to capture the go_memstats_* metrics when it happens.

@iksaif

This comment has been minimized.

Copy link
Contributor Author

iksaif commented Oct 3, 2018

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 10, 2018

@gouthamve at the end do you think this is a problem with tsdb ?

After some debugging it turns out, the node ran out of memory. But kubernetes killed a pod to make some memory, and everything came back to normal, except Prometheus. It just continued dropping samples :/ Not sure if the new WAL fixes it.

did you mean that Prometheus was restarted, but continued dropping samples?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 2, 2019

closing for no activity, if you think we should revisit please reopen and add some more updates for the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.