New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could not write results to Graphite with new GraphiteWriterOutputFactory #525
Comments
I have restarted my app that embeds jmxtrans-core.
And after some time NullPointers again (same stack trace as above). Is there some kind of issue with the pooling of outbound resources? Should I tune GraphiteWriterFactory differently? |
Ok, after some time it looks like jmxtrans recovered itself and started pushing data to Graphite. |
Can you try upgrading to jmxtrans 260? Some related issues have been solved there. The More than that, you need to have a look at your network and / or graphite server to understand why packets are being dropped. If you understand the cause, there might be something that can be done on the jmxtrans side... |
Yes, I will try upgrading to 260 to see if those issues go away. |
Alright, this is what I see after trying out 260:
|
Still see them, just different stack trace:
What actually happens with the messages that haven't been seet to the destinations due to broken pipe? Are they being accumulated and wait until connection is being restored? Or are they being discarded and never reach destination? |
The stacktrace does not seem to be complete... Messages that can't be sent are disgarded. Having a retry queue would mean increasing memory usage in time of trouble, which is not something that appeals to me all that much. From your analysis, it looks like there is a connectivity issue between your jmxtrans instance and Hosted Graphite. Do you have an idea of how many metrics per second you are sending? Could you check if there is a DNS resolution change at the same time as the error? Does jmxtrans recover quickly (do you see just one error every now and then, or bursts of errors at the same time)? |
I have contacted HostedGraphite support, according to their developers that have viewed the logs there are no indications of issues on their end. Also, I have noticed that there are issues with sending data to Hosted Graphite as well as querying. At least, what I read from the thread name:
Then there are these:
|
It looks like logs from multiple threads are interleaved. As far as I can see from the stacktraces, all errors are from writing to Graphite, not reading over JMX. It might make sense to tune logging config to ensure log messages are not mixed together (check the log4j config, and feel free to submit a patch). This still looks like a communication error between jmxtrans and graphite. It could be network, it could be a firewall, a change of IP somewhere, ... a lot of different things... Do you have an idea of the frequency and distribution of those errors? |
Yeah, you were right regarding logs. the problem was in misconfigured logback and the fact that Heroku log drains do not preserve order of messages dumped into stdout. But I have configured logback to output only error level messages, just to check on the frequency and have ability to better capture stack traces. Here is the example:
I do have 3 remote JMX servers from which I collect metrics. You mentioned that the network latency could be an issue. Can it be related to the fact that there might be some very aggressive timeout set on acquiring socket connection?? Maybe it can be increased. With regards to how often the Broken pipe issue appears in the logs: I had app running for 15 minutes and I've seen 31 occurrences of the stack trace, which is quite high. |
I have also spotted few of those:
|
I have also noticed that even I see a lot of messages sent off to the GraphiteWriter2 and way lower number of errors, only a few of those messages make it to the destination (hosted graphite), looks like they are being lost in the process. Might it be related to GraphiteWriterOutputFactory? Should I try switching back to GraphiteWriter? At this moment I am questioning everything. |
Looking at the stacktraces above, it seems that in all cases, the issue occurs during flush (those stacktraces still look /wrong/). So connection has already been opened, but when we actually try to send a packet, things are getting problematic. There can be lots of reasons for that... It is expected that you loose more messages than the number of errors you see in the logs. Basically, things work like that:
|
I suspect there is an issue with GraphiteWriterFactory. I have switched back to using GraphiteWriter as an output writer, however, I've run into #519 (comment)
It is very similar to the GraphiteWriter implementation, except it doesn't rely on Guice injection, which is broken in my case for whatever reason and it has explicit flushing. |
Using jmxtrans 259
Seeing following exception on the logs:
10:45:46.344 [jmxtrans-result-8] WARN com.googlecode.jmxtrans.jmx.ResultProcessor - Could not write results [Result(attributeName=OneMinuteRate, className=com.yammer.metrics.reporting.JmxReporter$Timer, objDomain=org.apache.cassandra.metrics, typeName=type=ClientRequest,scope=Read,name=Latency, values={OneMinuteRate=2.964393875E-314}, epoch=1480085145326, keyAlias=ThroughputReads)] of query Query(objectName=org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency, keys=[], attr=[OneMinuteRate], typeNames=[], resultAlias=ThroughputReads, useObjDomainAsKey=false, allowDottedKeys=false, useAllTypeNames=false, outputWriterInstances=[]) to output writer com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter@74b940ab
java.lang.NullPointerException: null
at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:59)
at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:50)
at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
What's the deal with it and how do I fix/workaround it?
Here is the config file that I generate:
The text was updated successfully, but these errors were encountered: