load_gen memory growth and utimate out-of-memory failure. #937

jrgm · 2012-01-23T20:17:10Z

While running bin/load_gen from client4 against stage servers over this past weekend I had noticed that RSS was
into the GB range. (Sorry, I wasn't tracking RSS growth systematically; next time).

Here is the related QPS graph for this period of time:
http://pencil.scl2.stage.svc.mozilla.com/dash/stage/bid_webhead_wsapi?start=2012-01-20+15%3A00%3A00&duration=48+hours&width=1000&height=400

I suspect that the growth in RSS and the decline in QPS (GH-875) are two sides of the same coin (at least partially).

Utimately the load_gen run died with these last few messages (timestamps provided by a tai64n filter that was tailing the logfile
output):
2012-01-22 13:18:03.938622500 200620.80 153967.20 182433.73 695 R, 24 S (32sp 12rs 6al 12rh 633sn 0iy 0cs)
2012-01-22 13:18:05.714453500 88231.68 140820.09 180863.70 695 R, 37 S (32sp 12rs 6al 12rh 632sn 0iy 1cs)
2012-01-22 13:18:09.957594500 393904.08 191436.89 184414.37 695 R, 53 S (32sp 12rs 6al 12rh 632sn 0iy 1cs)
2012-01-22 13:18:11.682048500 160496.64 185248.84 184015.74 695 R, 0 S (32sp 12rs 6al 12rh 632sn 0iy 1cs)
2012-01-22 13:18:15.943881500 27604.80 153720.03 181408.89 695 R, 15 S (33sp 12rs 6al 12rh 631sn 0iy 1cs)
2012-01-22 13:18:47.097894500 FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

jrgm · 2012-01-23T20:38:45Z

Two things to note in the graph:

the gap in the graph about 12am Saturday was due to some failure in the stage environment.
about 10am Sunday morning the QPS drops precipitously. I wasn't looking at that time, but smells like maybe GC thrash had begun (/me waves hands).

lloyd · 2012-01-25T07:05:59Z

I feel like there is a high chance that this is the root cause for issue #875.

It would be useful to sample loadgen RES size and CPU usage at hourly intervals. if the growing heap is causing loadgen to loose efficiency (we've seen this before), then we should see CPU usage grow at 100% as time goes on.

I think the next step is to sample hourly and then based on the data we'll figure out an approach.

seanmonstar · 2012-09-24T18:59:24Z

closing this. if it's still a problem, please reopen.

jrgm · 2012-09-24T19:13:40Z

I did some numbers a while back, and the steady growth appears to be largely accounted for by the growth in the in-memory user population (which have multiple certs in the struct). Simple workaround is just to restart after a couple of days, before memory is consumed, if I want to have longer duration runs. (Fancier fix is to user some out of process store for these items). Verifying as "it is understood, and there is a workaround".

ghost assigned lloyd Jan 25, 2012

seanmonstar closed this as completed Sep 24, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_gen memory growth and utimate out-of-memory failure. #937

load_gen memory growth and utimate out-of-memory failure. #937

jrgm commented Jan 23, 2012

jrgm commented Jan 23, 2012

lloyd commented Jan 25, 2012

seanmonstar commented Sep 24, 2012

jrgm commented Sep 24, 2012

load_gen memory growth and utimate out-of-memory failure. #937

load_gen memory growth and utimate out-of-memory failure. #937

Comments

jrgm commented Jan 23, 2012

jrgm commented Jan 23, 2012

lloyd commented Jan 25, 2012

seanmonstar commented Sep 24, 2012

jrgm commented Sep 24, 2012