Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional unbounded memory growth (not disk growth) at large scales, leading to OOM #3640

Closed
smarterclayton opened this Issue Dec 30, 2017 · 9 comments

Comments

Projects
None yet
5 participants
@smarterclayton
Copy link

smarterclayton commented Dec 30, 2017

We run a number of large prometheus instances on several large Kubernetes clusters (4GB to 20GB RSS, 1mil to 3mil series, 400k to 1.5mil samples scraped, 160 to 600 scrape targets, between 10gb and 50gb disk use for 3 day retention) at 2.0.0 and are occasionally seeing some of the instances transition from steady state memory usage to memory growth that continues (without correspondingly large disk growth) until the server OOMs, at which point a restart shows the same growth behavior. The only "fix" at that point is to clear the historical data, at which point the instance usually returns to a normal steady state memory usage pattern.

This is a graph of prometheus RSS across the 5 separate clusters over the last week - each spike corresponds to a transition to the crash looping state (which, since these are running on Kubernetes, are simply restarted every 5-10 minutes and will OOM continuously)

graph of memory use

I suspect that this is due to unbounded series growth, or some other temporary phenomenon of the scrape target, but have not yet been able to capture a series count from a server that is experiencing the unbounded growth, and the logs are being purged by the container after the first OOM restart. I have a 4GB snapshot of the data directory of an instance that is experiencing the data, but cannot publicly share it because it has customer data.

On the largest server (after a restart with an empty datadir) I was seeing this recur, but the count query for series count({__name__=~".+"}) appears to time out (even when queried locally). Is there a way to perform the same count from disk?

The logs from that restart (where it began growing from 18GB to 24GB over a 10 minute window, but only 4GB on disk):

level=info ts=2017-12-30T21:54:43.134979197Z caller=main.go:215 msg="Starting Prometheus" version="(version=2.0.0, branch=HEAD, revision=0a74f98628a0463dddc90528220c94de5032d1a0)"
level=info ts=2017-12-30T21:54:43.135095621Z caller=main.go:216 build_context="(go=go1.8.3, user=root@prometheus-binary-7-build, date=20171108-16:47:03)"
level=info ts=2017-12-30T21:54:43.135131371Z caller=main.go:217 host_details="(Linux 3.10.0-693.12.1.el7.x86_64 #1 SMP Mon Nov 6 10:54:56 EST 2017 x86_64 prometheus-0 (none))"
level=info ts=2017-12-30T21:54:43.13998242Z caller=web.go:380 component=web msg="Start listening for connections" address=localhost:9090
level=info ts=2017-12-30T21:54:43.140359599Z caller=main.go:314 msg="Starting TSDB"
level=info ts=2017-12-30T21:54:43.140430794Z caller=targetmanager.go:71 component="target manager" msg="Starting target manager..."
level=info ts=2017-12-30T21:54:43.158353051Z caller=main.go:326 msg="TSDB started"
level=info ts=2017-12-30T21:54:43.158476328Z caller=main.go:394 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2017-12-30T21:54:43.161863296Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-30T21:54:43.164207946Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-30T21:54:43.181127738Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-30T21:54:43.183168902Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-30T21:54:43.185283111Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-30T21:54:43.187482513Z caller=kubernetes.go:100 component="target manager" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2017-12-30T21:54:43.190165246Z caller=main.go:371 msg="Server is ready to receive requests."
level=info ts=2017-12-30T21:57:00.492261749Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1514670840000 maxt=1514670960000
level=info ts=2017-12-30T21:58:33.259676136Z caller=head.go:345 component=tsdb msg="head GC completed" duration=4.688191043s
level=info ts=2017-12-30T21:58:38.318069405Z caller=head.go:354 component=tsdb msg="WAL truncation completed" duration=5.058271039s
level=info ts=2017-12-30T21:59:00.656197214Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1514670960000 maxt=1514671080000
level=info ts=2017-12-30T22:00:33.422101548Z caller=head.go:345 component=tsdb msg="head GC completed" duration=3.65449887s
level=info ts=2017-12-30T22:00:49.237691707Z caller=head.go:354 component=tsdb msg="WAL truncation completed" duration=15.815469424s
level=info ts=2017-12-30T22:01:00.652890237Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1514671080000 maxt=1514671200000
level=info ts=2017-12-30T22:03:04.667729942Z caller=head.go:345 component=tsdb msg="head GC completed" duration=5.020881136s
level=info ts=2017-12-30T22:03:28.236871879Z caller=head.go:354 component=tsdb msg="WAL truncation completed" duration=23.569014736s
level=info ts=2017-12-30T22:03:31.362141958Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1514671200000 maxt=1514671320000
level=info ts=2017-12-30T22:05:11.790237476Z caller=head.go:345 component=tsdb msg="head GC completed" duration=3.046542087s
level=info ts=2017-12-30T22:05:33.311326462Z caller=head.go:354 component=tsdb msg="WAL truncation completed" duration=21.520988142s
level=info ts=2017-12-30T22:05:36.378422528Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1514671320000 maxt=1514671440000
level=error ts=2017-12-30T22:06:56.530958212Z caller=stdlib.go:89 component=web caller="http: panic serving 127.0.0.1:37986" msg="runtime error: invalid memory address or nil pointer dereference"
level=info ts=2017-12-30T22:06:58.882405622Z caller=head.go:345 component=tsdb msg="head GC completed" duration=2.717897098s
level=info ts=2017-12-30T22:07:18.066909037Z caller=head.go:354 component=tsdb msg="WAL truncation completed" duration=19.184404672s
level=info ts=2017-12-30T22:07:25.623824749Z caller=compact.go:361 component=tsdb msg="compact blocks" count=1 mint=1514671440000 maxt=1514671560000

The panic appears to have occurred when querying the total series count. Subsequent series counts failed - when triggering a restart with a kill I observed the following added to the logs

level=warn ts=2017-12-30T22:43:57.660498333Z caller=main.go:377 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2017-12-30T22:43:57.663691399Z caller=main.go:384 msg="See you next time!"

but the process continued running without exiting. I ended up having to force terminate the process. On a restart, I experience the same slow memory growth with high CPU (2+ core, vs 0.5-0.75 normally). A series count query failed with

{"status":"error","errorType":"timeout","error":"query timed out in expression evaluation"}
@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Dec 30, 2017

On attempting to kill that subsequent restart, I see:

level=warn ts=2017-12-30T23:00:07.581727457Z caller=main.go:377 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2017-12-30T23:00:07.582578736Z caller=main.go:384 msg="See you next time!"
level=error ts=2017-12-30T23:00:47.117678574Z caller=scrape.go:746 component="target manager" scrape_pool=kubernetes-nodes target=https://172.31.58.199:10250/metrics msg="stale append failed" err="out of bounds"
level=error ts=2017-12-30T23:00:47.203257319Z caller=scrape.go:746 component="target manager" scrape_pool=kubernetes-nodes target=https://172.31.49.42:10250/metrics msg="stale append failed" err="out of bounds"
level=error ts=2017-12-30T23:00:47.497900983Z caller=scrape.go:746 component="target manager" scrape_pool=kubernetes-nodes target=https://172.31.54.173:10250/metrics msg="stale append failed" err="out of bounds"
level=error ts=2017-12-30T23:00:48.012672694Z caller=scrape.go:746 component="target manager" scrape_pool=kubernetes-nodes target=https://172.31.54.77:10250/metrics msg="stale append failed" err="out of bounds"
... // many more

This process then ran out of memory and was restarted - captured a pprof dump
pprof.txt

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 2, 2018

On the largest server (after a restart with an empty datadir) I was seeing this recur, but the count query for series count({name=~".+"}) appears to time out (even when queried locally). Is there a way to perform the same count from disk?

There's the tsdb CLI that can provide some insights provided that you've got access to the data directory.

$ tsdb ls /data/
BLOCK ULID                  MIN TIME       MAX TIME       NUM SAMPLES  NUM CHUNKS  NUM SERIES
01C0VHCTFK60WYJ7A5YQP52EZG  1512432000000  1512496800000  0            0           0
01C0VHCTMC18CPHJ94VVC1VKAV  1512496800000  1512561600000  0            0           0
01C0VHCTR5XG72TK81283CKM5V  1512561600000  1512626400000  0            0           0
01C0VHCTVWAVHCRBVDENW5CESC  1512626400000  1512691200000  0            0           0
01C1T3GCRNZDKWKB4X8DDT8KZP  1512691200000  1512756000000  8658         669         669
01C1T3GD4DVVRXCQCVQJQ6C9F5  1512756000000  1512820800000  0            0           0
01C1T3GDE2K9ST676FN90VXGST  1512820800000  1512885600000  0            0           0
01C1T3GDMC8X5WVXWE9ABJG2SV  1512885600000  1512950400000  0            0           0
01C1T3GDSQE2337PJBDHVM6088  1512950400000  1513015200000  0            0           0
01C1T3GDYGSAQKX31NNG0C7HTF  1513015200000  1513080000000  0            0           0
01C1T3GE4NAZ9N9QGSHBQYSSZS  1513080000000  1513144800000  0            0           0
01C1T3GED1CN0QXVE40N8PF3JX  1513144800000  1513209600000  0            0           0
01C1T3GEN4ZEFYX6XVVJ9RNT5J  1513209600000  1513274400000  0            0           0
01C1T3GEYP4RYFG8N1DKT0QFRH  1513274400000  1513339200000  0            0           0
01C1T3GFABSTXJ0E6TB5AYV9D9  1513339200000  1513404000000  0            0           0
01C1T3GFHX5QDBCWBEHNN44AGP  1513404000000  1513468800000  0            0           0
01C1T3GFR6ERBWG1R15A6347BJ  1513468800000  1513533600000  0            0           0
01C1T3GG217NHSWZ5WF2WT29C9  1513533600000  1513598400000  0            0           0
01C1T3GGBH2SFEBJHFVYRZ99QD  1513598400000  1513663200000  0            0           0
01C1T3GGKH8H1Q4NE120DW98XG  1513663200000  1513728000000  0            0           0
01C1T3GC6MC0HG2K9XK97W5DBD  1513728000000  1513749600000  0            0           0
01C1T3GCCX475PB1XXHK9ZTR44  1513749600000  1513771200000  0            0           0
01C1T70QZXSZT3GJ0QW0X74AFP  1513771200000  1513778400000  5560         696         696
@nipuntalukdar

This comment has been minimized.

Copy link

nipuntalukdar commented Jan 2, 2018

@simonpasquier just curious, do we really need the directories with zero samples and zero chunks. I did a small fix to avoid that, but not sure if problem of "empty sample" directory is that frequent.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 2, 2018

@nipuntalukdar that is my local env which is frequently started/stopped so this explains the zero samples & chunks...

@pgier

This comment has been minimized.

Copy link
Contributor

pgier commented Jan 2, 2018

Here is the output of "tsdb ls" on the problematic cluster

BLOCK ULID                  MIN TIME       MAX TIME       NUM SAMPLES  NUM CHUNKS  NUM SERIES
01C2MPV6ECDJ31MM27Y7QJK459  1514670840000  1514670960000  1765718      1432583     1432583
01C2MPYVSGP52M25HKMAYHSBNM  1514670960000  1514671080000  2989243      1499671     1499671
01C2MQ2GZCSX6PTED5XRTY98WQ  1514671080000  1514671200000  2984685      1502045     1502045
01C2MQ7452T3P2KEVEC3KK9X9K  1514671200000  1514671320000  2964502      1488809     1488809
01C2MQAY7TFBDX7V5T73DRSC36  1514671320000  1514671440000  2984268      1497455     1497455
01C2MT4RFXBQ9YZFM00MXJ7NPC  1514671440000  1514671560000  2976643      1497198     1497198
01C2MTM45GBX7S262DRCFTVKYS  1514671560000  1514671680000  3000406      1514643     1514643
01C2MV34QQT7BXXK3QYSRX7Z5M  1514671680000  1514671800000  2960699      1490544     1490544
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 2, 2018

@pgier Why do you have 120s blocks? The default is 2 hours, and we don't expect users to change that.

@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Jan 3, 2018

That was legacy from a pre 2.0.0 release where block size wasn't defaulting. Will test with it removed.

@smarterclayton

This comment has been minimized.

Copy link
Author

smarterclayton commented Jan 9, 2018

After removing block size setting (to take the default), we have gone 4 days without this issue across 5 clusters (previously was occurring once per day or two). Closing, thanks for the spot.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.