Skip to content
This repository has been archived by the owner on May 10, 2019. It is now read-only.

load_gen memory growth and utimate out-of-memory failure. #937

Closed
jrgm opened this issue Jan 23, 2012 · 4 comments
Closed

load_gen memory growth and utimate out-of-memory failure. #937

jrgm opened this issue Jan 23, 2012 · 4 comments

Comments

@jrgm
Copy link
Contributor

jrgm commented Jan 23, 2012

While running bin/load_gen from client4 against stage servers over this past weekend I had noticed that RSS was
into the GB range. (Sorry, I wasn't tracking RSS growth systematically; next time).

Here is the related QPS graph for this period of time:
http://pencil.scl2.stage.svc.mozilla.com/dash/stage/bid_webhead_wsapi?start=2012-01-20+15%3A00%3A00&duration=48+hours&width=1000&height=400

I suspect that the growth in RSS and the decline in QPS (GH-875) are two sides of the same coin (at least partially).

Utimately the load_gen run died with these last few messages (timestamps provided by a tai64n filter that was tailing the logfile
output):
2012-01-22 13:18:03.938622500 200620.80 153967.20 182433.73 695 R, 24 S (32sp 12rs 6al 12rh 633sn 0iy 0cs)
2012-01-22 13:18:05.714453500 88231.68 140820.09 180863.70 695 R, 37 S (32sp 12rs 6al 12rh 632sn 0iy 1cs)
2012-01-22 13:18:09.957594500 393904.08 191436.89 184414.37 695 R, 53 S (32sp 12rs 6al 12rh 632sn 0iy 1cs)
2012-01-22 13:18:11.682048500 160496.64 185248.84 184015.74 695 R, 0 S (32sp 12rs 6al 12rh 632sn 0iy 1cs)
2012-01-22 13:18:15.943881500 27604.80 153720.03 181408.89 695 R, 15 S (33sp 12rs 6al 12rh 631sn 0iy 1cs)
2012-01-22 13:18:47.097894500 FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory

@jrgm
Copy link
Contributor Author

jrgm commented Jan 23, 2012

Two things to note in the graph:

  1. the gap in the graph about 12am Saturday was due to some failure in the stage environment.
  2. about 10am Sunday morning the QPS drops precipitously. I wasn't looking at that time, but smells like maybe GC thrash had begun (/me waves hands).

@ghost ghost assigned lloyd Jan 25, 2012
@lloyd
Copy link
Contributor

lloyd commented Jan 25, 2012

I feel like there is a high chance that this is the root cause for issue #875.

It would be useful to sample loadgen RES size and CPU usage at hourly intervals. if the growing heap is causing loadgen to loose efficiency (we've seen this before), then we should see CPU usage grow at 100% as time goes on.

I think the next step is to sample hourly and then based on the data we'll figure out an approach.

@seanmonstar
Copy link
Contributor

closing this. if it's still a problem, please reopen.

@jrgm
Copy link
Contributor Author

jrgm commented Sep 24, 2012

I did some numbers a while back, and the steady growth appears to be largely accounted for by the growth in the in-memory user population (which have multiple certs in the struct). Simple workaround is just to restart after a couple of days, before memory is consumed, if I want to have longer duration runs. (Fancier fix is to user some out of process store for these items). Verifying as "it is understood, and there is a workaround".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants