Reporter seems to get into a bad state #10
Comments
Bah worst kind of bug ever :) Is the data failing to show up in Metrics for this source also? |
actually there does not appear to be any gaps in the data, next time it happens I'll let it keep in that state for a while and verify |
So it is now in a bad state on 2 of the 4 processes, and unfortunately there are now gaps in the data. |
Damn it, I guess now I have to look :) If you could take a thread dump of one of those processes and put it in a gist, I'd appreciate it.. Not sure it'll actually help, but there's always a chance.. Also, could you let me know what version of Netty is being pulled in? |
That will be a bit difficult since its a heroku app unfortunately. Is it possibly "stuck" because the librato api doesnt like the metrics its posting and times out and then the reporter continually tries the "poison message" Maybe it should discard the current batch after X failures? Its a play framework 2.0.2 app and the dependencies that get pulled in are ahc 1.7.4 and netty 3.4.1.Final |
Or perhaps there are too many metrics being posted? should the metrics post be chunked up into groups of 100 or so? |
Ahh looks like its doing that already... |
I'm pretty sure batches that error out get thrown away, but now that you mention it, maybe there's a bug in how that's handled?.. Let me summon @josephruscio and see if there are any error logs for your account. Would certainly explain why it degenerates over time. Does it still log how many metrics it attempted to post at each interval if there's an error? (don't have time to look at the code right now) It does batch already, as you found. |
Yeah, looks like it should, but unfortuantely I dont have debug logging on atm...can turn that on |
Ok app's restarted, so its not in a bad state now, and the batches are only 145 metrics |
scott@heroku.com thanks, the gap times were ~9:30 to 10:45am this morning |
awesome thanks, Ill watch for it on my side On Wed, Oct 3, 2012 at 12:35 PM, Joseph Ruscio notifications@github.comwrote:
|
@josephruscio one process in a bad state, since 2:27pm are you seeing anything? |
@sclasen @mihasya just checked. Looks like there are maybe 4 dynos pushing in stats? Appears to be 4 IP addresses repeating in our logs. 3 of them are still reporting in, but one of them stopped making requests after 14:26:42 ... which would correspond with 2:27pm. So I think this is definitely wedged client-side somewhere. There are no errors preceding the cessation of requests. /cc @mheffner |
Thanks! This probably has nothing to do with it but...the process that is failing is the only one attempting to post more than 200 metrics, at 216. The rest are 199 or less. |
Hmm Aha? I got another process over 200 metrics and its barfing now too! |
Aha pushed another over 200, now its hosed |
aaaand pushed the last one over 200...now all hosed |
Excellent info @sclasen Questions.
|
On Wed, Oct 3, 2012 at 3:00 PM, Mikhail P notifications@github.com wrote:
|
OK, could you try setting the timeout to 25 seconds then? It's possible that I'm not handling the timeout error correctly, causing the client to get into a crappy state, and 200 might just be the magic number that causes your post to librato to take longer than 5s EDIT if this fixes your particular problem, the bug changes to "Timeout errors eventually send the client to a shitty terrible death" |
👍 Trying it out now. |
Gah, some are working over 200 but this just happened
|
Interesting.. so I just noticed one other interesting detail: you're running |
⚡ that did it, so far so good anyhow 2012-10-03T23:02:33+00:00 app[web.4]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 274 measurements |
I should add timing data to those logs so we can have an idea of how long a normal post actually takes. Joe seems to not be seeing anything above ~3sec for the Let me know if it comes back. That was a very strange bug that I still don't fully understand, I think the way the http client handles basic auth is just buggy. The fix was to just manually generate the base64 auth header and set it instead of passing the creds to the client. Computers, how do they.. |
Thanks for the help! |
The reporter seems to get into a bad state from which it does not recover.
The below stack trace occurs on the reporting interval.
In this case there are 4 processes running and only one exhibits this behavior.
Without a restart it continues to produce this on the reporting interval, indefinitely. Upon restart its fine.
2012-10-01T22:59:12+00:00 app[web.1]: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: No response received after 5 2012-10-01T22:59:12+00:00 app[web.1]: at com.ning.http.client.providers.netty.NettyResponseFuture.get(NettyResponseFuture.java:223) ~[async-http-client-1.7.4.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.postPortion(LibratoBatch.java:131) [librato-java-0.0.5.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.post(LibratoBatch.java:115) [librato-java-0.0.5.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoReporter.run(LibratoReporter.java:75) [metrics-librato-2.1.2.3.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.lang.Thread.run(Thread.java:636) [na:1.6.0_20]
The text was updated successfully, but these errors were encountered: