Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash recovery OOM kills prometheus-server container #4833

Closed
estahn opened this Issue Nov 7, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@estahn
Copy link

estahn commented Nov 7, 2018

image

You see it is reaching its 30GB limit and then it goes down due to OOM kill.

Logs:

level=info ts=2018-11-06T23:54:07.953888964Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1539388800000 maxt=1539410400000 ulid=01CSPAG4D1S2PP200F5S3GJ88A
level=info ts=2018-11-06T23:54:07.954345953Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1539410400000 maxt=1539432000000 ulid=01CSPZ38NCPGFATD9TC6BYVDQK
level=info ts=2018-11-06T23:54:07.954769258Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1539432000000 maxt=1539453600000 ulid=01CSQKPF927H8YWZV020XCS58A
level=info ts=2018-11-06T23:54:07.955252927Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1539453600000 maxt=1539475200000 ulid=01CSR89QY9KFJZ5Q71CJY4N9QR
level=info ts=2018-11-06T23:54:07.95575467Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1539475200000 maxt=1539496800000 ulid=01CSRWWXW5HBBJT2G1EBW1WC02
...

Configuration:
image

WAL:

du -khs wal/
4.1G	wal/

Remediation:
Removing wal and restarting container. We lost around 5h of data.

Possibly related to #2139

Originally posted by @estahn in #4047 (comment)

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 9, 2018

Do you have any logs before the crash? Do you have a copy of the wal directory that crashed the Prometheus server?

@estahn

This comment has been minimized.

Copy link
Author

estahn commented Nov 12, 2018

@simonpasquier Please find all logs for that day attached.

search-results-2018-11-11T17_32_02.549-0800.csv.zip

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 12, 2018

I don't see any errors in the logs indicating a bug.
The only reason I can think for this OOM is if your scraping load has increased or have some problems with high cardinality(large labels value variations.)

the WAL is almost an exact copy of the data currently loaded in the RAM so increased scraping load or high cardinality would cause an increased memory usage and larger WAL size. If the RAM demand is so high that Prometheus is OOM killed at next startup it will load the same WAL again causing the same RAM usage and as a result will be killed again.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 12, 2018

I also suspect a label cardinality getting out of control. Is the issue reoccurring?

@estahn

This comment has been minimized.

Copy link
Author

estahn commented Nov 12, 2018

@krasi-georgiev @simonpasquier Indeed, I don't see any errors either. So the only thought I had was maybe the data in WAL is expanding to more than 6 times its size.

After deleting the WAL directory it recovered nicely and we haven't had the issue since.

I don't remember making any changes affecting label cardinality.

If this can not be solved or investigated with the current information I will close this ticket and reopen it once it happens again.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Nov 13, 2018

Closing it for now to keep the backlog manageable. Reopen if it happens again (having a copy of the buggy wal directory would be good).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.