Reporter seems to get into a bad state #10

sclasen · 2012-10-01T23:07:20Z

The reporter seems to get into a bad state from which it does not recover.

The below stack trace occurs on the reporting interval.

In this case there are 4 processes running and only one exhibits this behavior.

Without a restart it continues to produce this on the reporting interval, indefinitely. Upon restart its fine.

2012-10-01T22:59:12+00:00 app[web.1]: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: No response received after 5 2012-10-01T22:59:12+00:00 app[web.1]: at com.ning.http.client.providers.netty.NettyResponseFuture.get(NettyResponseFuture.java:223) ~[async-http-client-1.7.4.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.postPortion(LibratoBatch.java:131) [librato-java-0.0.5.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.post(LibratoBatch.java:115) [librato-java-0.0.5.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoReporter.run(LibratoReporter.java:75) [metrics-librato-2.1.2.3.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.lang.Thread.run(Thread.java:636) [na:1.6.0_20]

The text was updated successfully, but these errors were encountered:

mihasya · 2012-10-01T23:13:08Z

Bah worst kind of bug ever :) Is the data failing to show up in Metrics for this source also?

sclasen · 2012-10-01T23:32:00Z

actually there does not appear to be any gaps in the data, next time it happens I'll let it keep in that state for a while and verify

sclasen · 2012-10-03T16:37:03Z

So it is now in a bad state on 2 of the 4 processes, and unfortunately there are now gaps in the data.

mihasya · 2012-10-03T16:43:18Z

Damn it, I guess now I have to look :)

If you could take a thread dump of one of those processes and put it in a gist, I'd appreciate it.. Not sure it'll actually help, but there's always a chance..

Also, could you let me know what version of Netty is being pulled in?

sclasen · 2012-10-03T16:56:03Z

That will be a bit difficult since its a heroku app unfortunately.

Is it possibly "stuck" because the librato api doesnt like the metrics its posting and times out and then the reporter continually tries the "poison message"

Maybe it should discard the current batch after X failures?

Its a play framework 2.0.2 app and the dependencies that get pulled in are ahc 1.7.4 and netty 3.4.1.Final

sclasen · 2012-10-03T17:06:10Z

Or perhaps there are too many metrics being posted? should the metrics post be chunked up into groups of 100 or so?

sclasen · 2012-10-03T17:30:05Z

Ahh looks like its doing that already...

mihasya · 2012-10-03T17:34:15Z

I'm pretty sure batches that error out get thrown away, but now that you mention it, maybe there's a bug in how that's handled?.. Let me summon @josephruscio and see if there are any error logs for your account. Would certainly explain why it degenerates over time.

Does it still log how many metrics it attempted to post at each interval if there's an error? (don't have time to look at the code right now)

It does batch already, as you found.

sclasen · 2012-10-03T17:40:33Z

Yeah, looks like it should, but unfortuantely I dont have debug logging on atm...can turn that on

sclasen · 2012-10-03T17:56:44Z

Ok app's restarted, so its not in a bad state now, and the batches are only 145 metrics

josephruscio · 2012-10-03T17:59:19Z

@sclasen what's the librato account email for this application? Per @mihasya's request, I can check in the logs on our side for anything funny.

sclasen · 2012-10-03T18:01:40Z

scott@heroku.com thanks, the gap times were ~9:30 to 10:45am this morning

josephruscio · 2012-10-03T19:35:45Z

@sclasen @mihasya there are no 4xx/5xx errors showing up in the logs from this morning. I just flagged this account on to log all 2xx responses as well, so if/when it happens again, we should at least be able to confirm that the requests stop hitting our API when it locks up. /cc @mheffner

sclasen · 2012-10-03T19:43:04Z

awesome thanks, Ill watch for it on my side

On Wed, Oct 3, 2012 at 12:35 PM, Joseph Ruscio notifications@github.comwrote:

@sclasen https://github.com/sclasen @mihasyahttps://github.com/mihasyathere are no 4xx/5xx errors showing up in the logs from this morning. I
just flagged this account on to log all 2xx responses as well, so if/when
it happens again, we should at least be able to confirm that the requests
stop hitting our API when it locks up. /cc @mheffnerhttps://github.com/mheffner

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/10#issuecomment-9119132.

sclasen · 2012-10-03T21:40:13Z

@josephruscio one process in a bad state, since 2:27pm are you seeing anything?

josephruscio · 2012-10-03T21:48:15Z

@sclasen @mihasya just checked. Looks like there are maybe 4 dynos pushing in stats? Appears to be 4 IP addresses repeating in our logs. 3 of them are still reporting in, but one of them stopped making requests after 14:26:42 ... which would correspond with 2:27pm. So I think this is definitely wedged client-side somewhere. There are no errors preceding the cessation of requests. /cc @mheffner

sclasen · 2012-10-03T21:53:11Z

Thanks! This probably has nothing to do with it but...the process that is failing is the only one attempting to post more than 200 metrics, at 216. The rest are 199 or less.

sclasen · 2012-10-03T21:55:55Z

Hmm Aha? I got another process over 200 metrics and its barfing now too!

sclasen · 2012-10-03T21:57:13Z

Aha pushed another over 200, now its hosed

sclasen · 2012-10-03T21:58:07Z

aaaand pushed the last one over 200...now all hosed

mihasya · 2012-10-03T22:00:42Z

Excellent info @sclasen

Questions.

What's the interval at which you're posting? If it's substantially larger than 5sec, could you try instantiating the reporter with a longer timeout? See https://github.com/librato/metrics-librato/blob/master/src/main/java/com/librato/metrics/LibratoReporter.java#L190
When you say "pushed the last one over 200" do you mean your app is instantiating more metrics to report on, or the Reporter just keeps logging more and more metics?

sclasen · 2012-10-03T22:04:42Z

30sec is the interval
I hit some other URLs in the app which are instrumented, and caused the
total number of metrics being reported to go over 200.

On Wed, Oct 3, 2012 at 3:00 PM, Mikhail P notifications@github.com wrote:

Excellent info @sclasen https://github.com/sclasen

Questions.

What's the interval at which you're posting? If it's substantially
larger than 5sec, could you try instantiating the reporter with a longer
timeout? See
https://github.com/librato/metrics-librato/blob/master/src/main/java/com/librato/metrics/LibratoReporter.java#L190

When you say "pushed the last one over 200" do you mean your app is
instantiating more metrics to report on, or the Reporter just keeps logging
more and more metics?

—
Reply to this email directly or view it on GitHubhttps://github.com/Reporter seems to get into a bad state #10#issuecomment-9124002.

mihasya · 2012-10-03T22:11:30Z

OK, could you try setting the timeout to 25 seconds then? It's possible that I'm not handling the timeout error correctly, causing the client to get into a crappy state, and 200 might just be the magic number that causes your post to librato to take longer than 5s

EDIT if this fixes your particular problem, the bug changes to "Timeout errors eventually send the client to a shitty terrible death"

sclasen · 2012-10-03T22:17:31Z

👍

Trying it out now.

sclasen · 2012-10-03T22:33:13Z

Gah, some are working over 200 but this just happened

2012-10-03T22:30:46+00:00 app[web.1]: [metrics-librato-reporter-thread-1] ERROR com.librato.metrics.LibratoBatch - Unable to post to Librato API 2012-10-03T22:30:46+00:00 app[web.1]: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: No response received after 25 2012-10-03T22:30:46+00:00 app[web.1]: at com.ning.http.client.providers.netty.NettyResponseFuture.get(NettyResponseFuture.java:223) ~[async-http-client-1.7.4.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.postPortion(LibratoBatch.java:131) [librato-java-0.0.5.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.post(LibratoBatch.java:115) [librato-java-0.0.5.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at com.librato.metrics.LibratoReporter.run(LibratoReporter.java:75) [metrics-librato-2.1.2.3.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.lang.Thread.run(Thread.java:636) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: Caused by: java.util.concurrent.TimeoutException: No response received after 25 2012-10-03T22:30:46+00:00 app[web.1]: at com.ning.http.client.providers.netty.NettyResponseFuture.get(NettyResponseFuture.java:215) ~[async-http-client-1.7.4.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: ... 11 common frames omitted 2012-10-03T22:30:46+00:00 app[web.1]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 223 measurements

mihasya · 2012-10-03T22:42:35Z

Interesting.. so I just noticed one other interesting detail: you're running metrics-librato-2.1.2.3 and librato-java-0.0.5. Both libs have since been revved due to a very similar looking bug involving authentication :D try bumping to metrics-librato-2.1.2.4, which will pull in the correct version of librato-java as well.

sclasen · 2012-10-03T23:03:08Z

⚡ that did it, so far so good anyhow

2012-10-03T23:02:33+00:00 app[web.4]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 274 measurements
2012-10-03T23:02:35+00:00 app[web.3]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 274 measurements
2012-10-03T23:02:35+00:00 app[web.2]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 291 measurements
2012-10-03T23:02:46+00:00 app[web.1]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 264 measurements

mihasya · 2012-10-03T23:30:09Z

I should add timing data to those logs so we can have an idea of how long a normal post actually takes. Joe seems to not be seeing anything above ~3sec for the POST requests in his logs, so it'll be nice to see how long the Reporter thinks it's taking as opposed to what's being seen server side.

Let me know if it comes back. That was a very strange bug that I still don't fully understand, I think the way the http client handles basic auth is just buggy. The fix was to just manually generate the base64 auth header and set it instead of passing the creds to the client. Computers, how do they..

sclasen · 2012-10-03T23:33:52Z

Thanks for the help!

sclasen closed this as completed Oct 3, 2012

This was referenced Oct 4, 2012

Add a knob to tune the batch size of the Reporter #12

Closed

Allow passing an AsyncHttpClient to the Reporter/Builder #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reporter seems to get into a bad state #10

Reporter seems to get into a bad state #10

sclasen commented Oct 1, 2012

mihasya commented Oct 1, 2012

sclasen commented Oct 1, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

josephruscio commented Oct 3, 2012

sclasen commented Oct 3, 2012

josephruscio commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

josephruscio commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

Reporter seems to get into a bad state #10

Reporter seems to get into a bad state #10

Comments

sclasen commented Oct 1, 2012

mihasya commented Oct 1, 2012

sclasen commented Oct 1, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

josephruscio commented Oct 3, 2012

sclasen commented Oct 3, 2012

josephruscio commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

josephruscio commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012

mihasya commented Oct 3, 2012

sclasen commented Oct 3, 2012