Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Reporter seems to get into a bad state #10

Closed
sclasen opened this issue Oct 1, 2012 · 29 comments
Closed

Reporter seems to get into a bad state #10

sclasen opened this issue Oct 1, 2012 · 29 comments

Comments

@sclasen
Copy link

sclasen commented Oct 1, 2012

The reporter seems to get into a bad state from which it does not recover.

The below stack trace occurs on the reporting interval.

In this case there are 4 processes running and only one exhibits this behavior.

Without a restart it continues to produce this on the reporting interval, indefinitely. Upon restart its fine.

2012-10-01T22:59:12+00:00 app[web.1]: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: No response received after 5 2012-10-01T22:59:12+00:00 app[web.1]: at com.ning.http.client.providers.netty.NettyResponseFuture.get(NettyResponseFuture.java:223) ~[async-http-client-1.7.4.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.postPortion(LibratoBatch.java:131) [librato-java-0.0.5.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.post(LibratoBatch.java:115) [librato-java-0.0.5.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at com.librato.metrics.LibratoReporter.run(LibratoReporter.java:75) [metrics-librato-2.1.2.3.jar:na] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.6.0_20] 2012-10-01T22:59:12+00:00 app[web.1]: at java.lang.Thread.run(Thread.java:636) [na:1.6.0_20]

@mihasya
Copy link
Contributor

mihasya commented Oct 1, 2012

Bah worst kind of bug ever :) Is the data failing to show up in Metrics for this source also?

@sclasen
Copy link
Author

sclasen commented Oct 1, 2012

actually there does not appear to be any gaps in the data, next time it happens I'll let it keep in that state for a while and verify

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

So it is now in a bad state on 2 of the 4 processes, and unfortunately there are now gaps in the data.

@mihasya
Copy link
Contributor

mihasya commented Oct 3, 2012

Damn it, I guess now I have to look :)

If you could take a thread dump of one of those processes and put it in a gist, I'd appreciate it.. Not sure it'll actually help, but there's always a chance..

Also, could you let me know what version of Netty is being pulled in?

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

That will be a bit difficult since its a heroku app unfortunately.

Is it possibly "stuck" because the librato api doesnt like the metrics its posting and times out and then the reporter continually tries the "poison message"

Maybe it should discard the current batch after X failures?

Its a play framework 2.0.2 app and the dependencies that get pulled in are ahc 1.7.4 and netty 3.4.1.Final

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Or perhaps there are too many metrics being posted? should the metrics post be chunked up into groups of 100 or so?

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Ahh looks like its doing that already...

@mihasya
Copy link
Contributor

mihasya commented Oct 3, 2012

I'm pretty sure batches that error out get thrown away, but now that you mention it, maybe there's a bug in how that's handled?.. Let me summon @josephruscio and see if there are any error logs for your account. Would certainly explain why it degenerates over time.

Does it still log how many metrics it attempted to post at each interval if there's an error? (don't have time to look at the code right now)

It does batch already, as you found.

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Yeah, looks like it should, but unfortuantely I dont have debug logging on atm...can turn that on

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Ok app's restarted, so its not in a bad state now, and the batches are only 145 metrics

@josephruscio
Copy link
Contributor

@sclasen what's the librato account email for this application? Per @mihasya's request, I can check in the logs on our side for anything funny.

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

scott@heroku.com thanks, the gap times were ~9:30 to 10:45am this morning

@josephruscio
Copy link
Contributor

@sclasen @mihasya there are no 4xx/5xx errors showing up in the logs from this morning. I just flagged this account on to log all 2xx responses as well, so if/when it happens again, we should at least be able to confirm that the requests stop hitting our API when it locks up. /cc @mheffner

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

awesome thanks, Ill watch for it on my side

On Wed, Oct 3, 2012 at 12:35 PM, Joseph Ruscio notifications@github.comwrote:

@sclasen https://github.com/sclasen @mihasyahttps://github.com/mihasyathere are no 4xx/5xx errors showing up in the logs from this morning. I
just flagged this account on to log all 2xx responses as well, so if/when
it happens again, we should at least be able to confirm that the requests
stop hitting our API when it locks up. /cc @mheffnerhttps://github.com/mheffner


Reply to this email directly or view it on GitHubhttps://github.com//issues/10#issuecomment-9119132.

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

@josephruscio one process in a bad state, since 2:27pm are you seeing anything?

@josephruscio
Copy link
Contributor

@sclasen @mihasya just checked. Looks like there are maybe 4 dynos pushing in stats? Appears to be 4 IP addresses repeating in our logs. 3 of them are still reporting in, but one of them stopped making requests after 14:26:42 ... which would correspond with 2:27pm. So I think this is definitely wedged client-side somewhere. There are no errors preceding the cessation of requests. /cc @mheffner

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Thanks! This probably has nothing to do with it but...the process that is failing is the only one attempting to post more than 200 metrics, at 216. The rest are 199 or less.

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Hmm Aha? I got another process over 200 metrics and its barfing now too!

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Aha pushed another over 200, now its hosed

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

aaaand pushed the last one over 200...now all hosed

@mihasya
Copy link
Contributor

mihasya commented Oct 3, 2012

Excellent info @sclasen

Questions.

  1. What's the interval at which you're posting? If it's substantially larger than 5sec, could you try instantiating the reporter with a longer timeout? See https://github.com/librato/metrics-librato/blob/master/src/main/java/com/librato/metrics/LibratoReporter.java#L190
  2. When you say "pushed the last one over 200" do you mean your app is instantiating more metrics to report on, or the Reporter just keeps logging more and more metics?

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

  1. 30sec is the interval
  2. I hit some other URLs in the app which are instrumented, and caused the
    total number of metrics being reported to go over 200.

On Wed, Oct 3, 2012 at 3:00 PM, Mikhail P notifications@github.com wrote:

Excellent info @sclasen https://github.com/sclasen

Questions.

  1. What's the interval at which you're posting? If it's substantially
    larger than 5sec, could you try instantiating the reporter with a longer
    timeout? See
    https://github.com/librato/metrics-librato/blob/master/src/main/java/com/librato/metrics/LibratoReporter.java#L190

  2. When you say "pushed the last one over 200" do you mean your app is
    instantiating more metrics to report on, or the Reporter just keeps logging
    more and more metics?


    Reply to this email directly or view it on GitHubhttps://github.com/Reporter seems to get into a bad state #10#issuecomment-9124002.

@mihasya
Copy link
Contributor

mihasya commented Oct 3, 2012

OK, could you try setting the timeout to 25 seconds then? It's possible that I'm not handling the timeout error correctly, causing the client to get into a crappy state, and 200 might just be the magic number that causes your post to librato to take longer than 5s

EDIT if this fixes your particular problem, the bug changes to "Timeout errors eventually send the client to a shitty terrible death"

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

👍

Trying it out now.

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Gah, some are working over 200 but this just happened

2012-10-03T22:30:46+00:00 app[web.1]: [metrics-librato-reporter-thread-1] ERROR com.librato.metrics.LibratoBatch - Unable to post to Librato API 2012-10-03T22:30:46+00:00 app[web.1]: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: No response received after 25 2012-10-03T22:30:46+00:00 app[web.1]: at com.ning.http.client.providers.netty.NettyResponseFuture.get(NettyResponseFuture.java:223) ~[async-http-client-1.7.4.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.postPortion(LibratoBatch.java:131) [librato-java-0.0.5.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at com.librato.metrics.LibratoBatch.post(LibratoBatch.java:115) [librato-java-0.0.5.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at com.librato.metrics.LibratoReporter.run(LibratoReporter.java:75) [metrics-librato-2.1.2.3.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: at java.lang.Thread.run(Thread.java:636) [na:1.6.0_20] 2012-10-03T22:30:46+00:00 app[web.1]: Caused by: java.util.concurrent.TimeoutException: No response received after 25 2012-10-03T22:30:46+00:00 app[web.1]: at com.ning.http.client.providers.netty.NettyResponseFuture.get(NettyResponseFuture.java:215) ~[async-http-client-1.7.4.jar:na] 2012-10-03T22:30:46+00:00 app[web.1]: ... 11 common frames omitted 2012-10-03T22:30:46+00:00 app[web.1]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 223 measurements

@mihasya
Copy link
Contributor

mihasya commented Oct 3, 2012

Interesting.. so I just noticed one other interesting detail: you're running metrics-librato-2.1.2.3 and librato-java-0.0.5. Both libs have since been revved due to a very similar looking bug involving authentication :D try bumping to metrics-librato-2.1.2.4, which will pull in the correct version of librato-java as well.

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

⚡ that did it, so far so good anyhow

2012-10-03T23:02:33+00:00 app[web.4]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 274 measurements
2012-10-03T23:02:35+00:00 app[web.3]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 274 measurements
2012-10-03T23:02:35+00:00 app[web.2]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 291 measurements
2012-10-03T23:02:46+00:00 app[web.1]: [metrics-librato-reporter-thread-1] DEBUG com.librato.metrics.LibratoBatch - Posted 264 measurements

@mihasya
Copy link
Contributor

mihasya commented Oct 3, 2012

I should add timing data to those logs so we can have an idea of how long a normal post actually takes. Joe seems to not be seeing anything above ~3sec for the POST requests in his logs, so it'll be nice to see how long the Reporter thinks it's taking as opposed to what's being seen server side.

Let me know if it comes back. That was a very strange bug that I still don't fully understand, I think the way the http client handles basic auth is just buggy. The fix was to just manually generate the base64 auth header and set it instead of passing the creds to the client. Computers, how do they..

@sclasen
Copy link
Author

sclasen commented Oct 3, 2012

Thanks for the help!

@sclasen sclasen closed this as completed Oct 3, 2012
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants