Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus fails to write WAL due to the `cannot allocate memory` error #4014

Closed
FUSAKLA opened this Issue Mar 26, 2018 · 9 comments

Comments

Projects
None yet
3 participants
@FUSAKLA
Copy link
Contributor

FUSAKLA commented Mar 26, 2018

What did you do?
Restarted Prometheus

What did you expect to see?
Prometheus to come up as usual

What did you see instead? Under which circumstances?
Prometheus crashed when compacting with cannot allocate memory and afterwards not ingesting any data because the write /prometheus/wal/000943: cannot allocate memory

After restart it's ok again.. I'm confused it has available resources, there is no OOM from Kubernetes.
Nothing indicates there is any problem with memory.

Worst thing is that Prometheus is still running and no alerts ale dispatched (just those using absent) because Prometheus has no data at all.

It happened already twice.
Unfortunately I'm not able to reproduce it.

Environment
Running in Kubernetes with data mounted as hostPath

Resources info from kubectl describe node:

Namespace                       Name                                       CPU Requests  CPU Limits  Memory Requests  Memory Limits
---------                       ----                                       ------------  ----------  ---------------  -------------
*****-monitoring-production     prometheus-collector-5474c5f965-7fmqx      1200m (5%)    3 (12%)     2148Mi (1%)      4596Mi (3%)
  • System information: official docker image

  • Prometheus version: 2.2.1

  • Logs:

level=info ts=2018-03-26T13:17:46.833820512Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
level=info ts=2018-03-26T13:17:46.833897458Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
level=info ts=2018-03-26T13:17:46.833932636Z caller=main.go:222 host_details="(Linux 4.14.13-infra #1 SMP Sat Jan 13 13:28:26 CET 2018 x86_64 prometheus-collector-5474c5f965-7fmqx (none))"
level=info ts=2018-03-26T13:17:46.833952256Z caller=main.go:223 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-03-26T13:17:46.837820241Z caller=main.go:504 msg="Starting TSDB ..."
level=info ts=2018-03-26T13:17:46.837902626Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-03-26T13:17:58.487508646Z caller=main.go:514 msg="TSDB started"
level=info ts=2018-03-26T13:17:58.487620424Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus-collector.yaml
level=info ts=2018-03-26T13:17:58.50430633Z caller=main.go:491 msg="Server is ready to receive web requests."
level=info ts=2018-03-26T13:49:48.570742031Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus-collector.yaml
level=info ts=2018-03-26T14:08:20.612340104Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus-collector.yaml
level=info ts=2018-03-26T15:00:02.478680282Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1522065600000 maxt=1522072800000
level=info ts=2018-03-26T15:00:11.639097618Z caller=head.go:348 component=tsdb msg="head GC completed" duration=556.833838ms
level=info ts=2018-03-26T15:00:15.238052892Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=3.598880962s
level=info ts=2018-03-26T15:00:15.640427335Z caller=compact.go:393 component=tsdb msg="compact blocks" count=3 mint=1522044000000 maxt=1522065600000
level=error ts=2018-03-26T15:00:21.266716379Z caller=db.go:281 component=tsdb msg="compaction failed" err="compact [/prometheus/01C9GRJ9J9JZ6SQZKJR3G3EW5K /prometheus/01C9GZE2K9J8BJ3GGV86W3ZNX7 /prometheus/01C9H69SWDA4M3FGE6YN85T0QS]: write compaction: add series: write series data: write /prometheus/01C9HD5XMRJHDC99KSDNEA2EYE.tmp/index: cannot allocate memory"
level=error ts=2018-03-26T15:00:27.290463545Z caller=wal.go:713 component=tsdb msg="sync failed" err="flush buffer: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:28.464523232Z caller=manager.go:398 component="rule manager" group=meta-monitoring msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:33.505556838Z caller=manager.go:398 component="rule manager" group=sspweb msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:33.506484408Z caller=manager.go:398 component="rule manager" group=sspweb msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:33.626388222Z caller=manager.go:398 component="rule manager" group=kube-backup-diff msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:33.62729855Z caller=manager.go:398 component="rule manager" group=kube-backup-diff msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:36.496418035Z caller=manager.go:398 component="rule manager" group=kube-deployment-ready msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=error ts=2018-03-26T15:00:36.845430206Z caller=wal.go:713 component=tsdb msg="sync failed" err="flush buffer: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:37.523571052Z caller=scrape.go:697 component="scrape manager" scrape_pool=prometheus-ng-harvester-scif target="http://tt-k8s1-w3.ng.seznam.cz:32532/federate?match%5B%5D=%7Bjob%3D~%22%28.%2B%29%22%7D" msg="append failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:38.465933212Z caller=manager.go:398 component="rule manager" group=kube2world msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:42.418888605Z caller=scrape.go:697 component="scrape manager" scrape_pool=prometheus-ko-harvester-scif target="http://tt-k8s1-w3.ko.seznam.cz:32532/federate?match%5B%5D=%7Bjob%3D~%22%28.%2B%29%22%7D" msg="append failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"
level=warn ts=2018-03-26T15:00:42.777911277Z caller=manager.go:398 component="rule manager" group=barrels-alerts msg="rule sample appending failed" err="WAL log samples: log series: write /prometheus/wal/000943: cannot allocate memory"

Maybe it's not Prometheus fault bud if so, could you suggest where to look?
I'm out of ideas what could have caused this.

@YanjieGao

This comment has been minimized.

Copy link

YanjieGao commented Mar 29, 2018

meet the same error. Could we set prometheus param to tuning the memory?

@FUSAKLA

This comment has been minimized.

Copy link
Contributor Author

FUSAKLA commented Mar 30, 2018

There are no memory params in Prometheus 2.0 so no tuning..

It happened again two times. I tried flushing all data but with no result. It happened without restart of Prometheus again.
There is no OOM from Kubernetes so there is no problem with memory.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 30, 2018

Your machine is tight on memory, reduce usage or add more.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor Author

FUSAKLA commented Mar 30, 2018

There is 10G of free memory on the machine and average usage of the Prometheus instance is 2G of memory. It has 400k time series. This is really not sufficient?

In documentation there is no section about how to determine required memory for the server anymore. It's been running on this machine for months and there were no problems with memory but this started since the 2.2.1 upgrade.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 30, 2018

Prometheus uses a lot of virtual memory, it's possible something is up with your kernel that it isn't liking that.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor Author

FUSAKLA commented Apr 3, 2018

Thank's for the response.. unfortunately I still cannot find the "something".

I have two instances (both suffering from this issue) running in different DC which are identical so I deployed 2.1.0 to one of them and observed it and the 2.2.1 failed again and on the 2.1.0 the issue did not occur, although the CPU, memory and reloaded blocks resources usage is higher due to issues in the 2.1.0.

Were there any changes that could cause some burst in memory usage on compaction? I'm confused by the fact it started after the 2.2.1 upgrade

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 3, 2018

I see no changes that make a difference on Linux. This looks like your kernel refusing to allocate virtual memory, check your overcommitment settings.

@FUSAKLA

This comment has been minimized.

Copy link
Contributor Author

FUSAKLA commented Apr 6, 2018

So I think we can close this up. I found out it was my dumb mistake on misreading data from cAdvisors (mixed data from multiple namespaces). Sorry for bothering you with this.

After correcting resources all seems to be fine again. Thanks!

@FUSAKLA FUSAKLA closed this Apr 6, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.