Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not write results to Graphite with new GraphiteWriterOutputFactory #525

Open
imochurad opened this issue Nov 25, 2016 · 14 comments
Open

Comments

@imochurad
Copy link

imochurad commented Nov 25, 2016

Using jmxtrans 259
Seeing following exception on the logs:

10:45:46.344 [jmxtrans-result-8] WARN com.googlecode.jmxtrans.jmx.ResultProcessor - Could not write results [Result(attributeName=OneMinuteRate, className=com.yammer.metrics.reporting.JmxReporter$Timer, objDomain=org.apache.cassandra.metrics, typeName=type=ClientRequest,scope=Read,name=Latency, values={OneMinuteRate=2.964393875E-314}, epoch=1480085145326, keyAlias=ThroughputReads)] of query Query(objectName=org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency, keys=[], attr=[OneMinuteRate], typeNames=[], resultAlias=ThroughputReads, useObjDomainAsKey=false, allowDottedKeys=false, useAllTypeNames=false, outputWriterInstances=[]) to output writer com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter@74b940ab
java.lang.NullPointerException: null
at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:59)
at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:50)
at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

What's the deal with it and how do I fix/workaround it?

Here is the config file that I generate:

{
  "servers" : [ {
    "host" : "localhost",
    "port" : 7199,
    "numQueryThreads" : 5,
    "outputWriters" : [ {
      "@class" : "com.googlecode.jmxtrans.model.output.GraphiteWriterFactory",
      "rootPrefix" : "hosted_graphite_api_key",
      "host" : "carbon.hostedgraphite.com",
      "port" : 2003,
      "flushStrategy" : "always",
      "poolSize" : 10
    } ],
    "queries" : [ {
      "obj" : "org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency",
      "attr" : [ "OneMinuteRate" ],
      "resultAlias" : "ThroughputWrites",
      "allowDottedKeys" : false,
      "useAllTypeNames" : false,
      "useObjDomainAsKey" : false
    }, {
      "obj" : "org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency",
      "attr" : [ "OneMinuteRate" ],
      "resultAlias" : "ThroughputReads",
      "allowDottedKeys" : false,
      "useAllTypeNames" : false,
      "useObjDomainAsKey" : false
    }, {
      "obj" : "org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=TotalLatency",
      "attr" : [ "Count" ],
      "resultAlias" : "LatencyWrites",
      "allowDottedKeys" : false,
      "useAllTypeNames" : false,
      "useObjDomainAsKey" : false
    }, {
      "obj" : "org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=TotalLatency",
      "attr" : [ "Count" ],
      "resultAlias" : "LatencyReads",
      "allowDottedKeys" : false,
      "useAllTypeNames" : false,
      "useObjDomainAsKey" : false
    }, {
      "obj" : "org.apache.cassandra.metrics:type=Storage,name=Load",
      "attr" : [ "Count" ],
      "resultAlias" : "Load",
      "allowDottedKeys" : false,
      "useAllTypeNames" : false,
      "useObjDomainAsKey" : false
    } ]
  } ]
}
@imochurad
Copy link
Author

imochurad commented Nov 25, 2016

I have restarted my app that embeds jmxtrans-core.
Upon restart it works fine first, sending data over.
Then I see following exception:

java.net.SocketException: Broken pipe
	at java.net.SocketOutputStream.socketWrite0(Native Method)
	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
	at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
	at java.io.BufferedWriter.flush(BufferedWriter.java:254)
	at com.googlecode.jmxtrans.model.output.support.pool.AlwaysFlush.flush(AlwaysFlush.java:32)
	at com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable.release(WriterPoolable.java:54)
	at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:59)
	at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:50)
	at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

And after some time NullPointers again (same stack trace as above). Is there some kind of issue with the pooling of outbound resources? Should I tune GraphiteWriterFactory differently?

@imochurad
Copy link
Author

imochurad commented Nov 25, 2016

Ok, after some time it looks like jmxtrans recovered itself and started pushing data to Graphite.
But then Broken pipe happened again and it hung there for few minutes.
Kinda disappointing since I'd like to avoid these interruptions.
Please advise how broken pipes and NPEs could be avoided.

@gehel
Copy link
Member

gehel commented Nov 25, 2016

Can you try upgrading to jmxtrans 260? Some related issues have been solved there.

The Broken pipe itself is most probably a communication issue between jmxtrans and Graphite. You could probably used a timeBased flush strategy with a reasonable flushDelayInSeconds, this will allow some buffering to occur and will send larger packets instead of just a single metric per packet. You might want to try reducing the TCP pool size and see if that helps...

More than that, you need to have a look at your network and / or graphite server to understand why packets are being dropped. If you understand the cause, there might be something that can be done on the jmxtrans side...

@imochurad
Copy link
Author

Yes, I will try upgrading to 260 to see if those issues go away.
Regarding FlushStrategy: ideally, I would like to poll some metrics every second. Will that work fine with timeBased flush strategy and flushDelayInSeconds? Let's say I'll have flushDelayInSeconds set for 10 seconds. Will it delay on sending all the metrics for 10 seconds and then dump all the data points in one shot? I am trying to collect very sensitive metrics that is being updated very frequently and I need to detect and act on unusual ups and downs.

@imochurad
Copy link
Author

Alright, this is what I see after trying out 260:
NPE is gone, it is not happening anymore.
I have tested my app with following scenarios:

  1. app launched in local env with local JMX server (1 instance) and local Graphite. Verdict: works well.
  2. app launched in local env with local JMX server (1 instance) and Hosted Graphite. Verdict: works well.
  3. app launched in Heroku with remote JMX servers (3 instances) and Hosted Graphite. Verdict: Broken pipes seen quite frequently.
    I plan on trying to change FlushStrategy to timeBased to see if it helps.

@imochurad
Copy link
Author

imochurad commented Nov 25, 2016

Still see them, just different stack trace:

2016-11-25T17:33:31.883682+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.TimeBasedFlush.flush(TimeBasedFlush.java:51)
2016-11-25T17:33:31.883682+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable.release(WriterPoolable.java:54)
2016-11-25T17:33:31.883683+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:62)
2016-11-25T17:33:31.883684+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:52)
2016-11-25T17:33:31.883684+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
2016-11-25T17:33:31.883685+00:00 app[worker.1]: 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2016-11-25T17:33:31.883685+00:00 app[worker.1]: 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2016-11-25T17:33:31.883686+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2016-11-25T17:33:31.883686+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2016-11-25T17:33:31.883687+00:00 app[worker.1]: 	at java.lang.Thread.run(Thread.java:745)

What actually happens with the messages that haven't been seet to the destinations due to broken pipe? Are they being accumulated and wait until connection is being restored? Or are they being discarded and never reach destination?

@gehel
Copy link
Member

gehel commented Nov 26, 2016

The stacktrace does not seem to be complete...

Messages that can't be sent are disgarded. Having a retry queue would mean increasing memory usage in time of trouble, which is not something that appeals to me all that much.

From your analysis, it looks like there is a connectivity issue between your jmxtrans instance and Hosted Graphite. Do you have an idea of how many metrics per second you are sending? Could you check if there is a DNS resolution change at the same time as the error? Does jmxtrans recover quickly (do you see just one error every now and then, or bursts of errors at the same time)?

@imochurad
Copy link
Author

imochurad commented Nov 28, 2016

I have contacted HostedGraphite support, according to their developers that have viewed the logs there are no indications of issues on their end.

Also, I have noticed that there are issues with sending data to Hosted Graphite as well as querying. At least, what I read from the thread name: jmxtrans-query-9

2016-11-25T17:13:25.338098+00:00 app[worker.1]: 17:13:25.338 [jmxtrans-query-9] DEBUG com.googlecode.jmxtrans.model.Query - Executing queryName [org.apache.cassandra.metrics:name=Load,type=Storage] from query [Query(objectName=org.apache.cassandra.metrics:type=Storage,name=Load, keys=[], attr=[Count], typeNames=[], resultAlias=Load, useObjDomainAsKey=false, allowDottedKeys=false, useAllTypeNames=false, outputWriterInstances=[])]
2016-11-25T17:13:25.338205+00:00 app[worker.1]: java.net.SocketException: Broken pipe
2016-11-25T17:13:25.338206+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite0(Native Method)
2016-11-25T17:13:25.338207+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
2016-11-25T17:13:25.338207+00:00 app[worker.1]: 	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
2016-11-25T17:13:25.338208+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
2016-11-25T17:13:25.338208+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
2016-11-25T17:13:25.338208+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
2016-11-25T17:13:25.338209+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
2016-11-25T17:13:25.338209+00:00 app[worker.1]: 	at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
2016-11-25T17:13:25.338210+00:00 app[worker.1]: 	at java.io.BufferedWriter.flush(BufferedWriter.java:254)
2016-11-25T17:13:25.338211+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.AlwaysFlush.flush(AlwaysFlush.java:32)
2016-11-25T17:13:25.338211+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable.release(WriterPoolable.java:54)
2016-11-25T17:13:25.338212+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:62)
2016-11-25T17:13:25.338212+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:52)
2016-11-25T17:13:25.338213+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
2016-11-25T17:13:25.338213+00:00 app[worker.1]: 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2016-11-25T17:13:25.338214+00:00 app[worker.1]: 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2016-11-25T17:13:25.338214+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2016-11-25T17:13:25.338215+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2016-11-25T17:13:25.338215+00:00 app[worker.1]: 	at java.lang.Thread.run(Thread.java:745)

Then there are these:

2016-11-25T17:13:01.188622+00:00 app[worker.1]: 17:13:01.188 [jmxtrans-result-4] DEBUG com.googlecode.jmxtrans.model.output.GraphiteWriter2 - Query result: Result(attributeName=OneMinuteRate, className=com.yammer.metrics.reporting.JmxReporter$Timer, objDomain=org.apache.cassandra.metrics, typeName=type=ClientRequest,scope=Write,name=Latency, values={OneMinuteRate=7.506478122583131E-34}, epoch=1480093981188, keyAlias=ThroughputWrites)
2016-11-25T17:13:01.188711+00:00 app[worker.1]:
2016-11-25T17:13:01.188793+00:00 app[worker.1]: 17:13:01.188 [jmxtrans-result-4] DEBUG com.googlecode.jmxtrans.model.output.GraphiteWriter2 - Graphite Message: hosteg.ThroughputWrites.OneMinuteRate 7.506478122583131E-34 1480093981
2016-11-25T17:13:01.189005+00:00 app[worker.1]: java.net.SocketException: Broken pipe
2016-11-25T17:13:01.189005+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite0(Native Method)
2016-11-25T17:13:01.189006+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
2016-11-25T17:13:01.189007+00:00 app[worker.1]: 	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
2016-11-25T17:13:01.189007+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
2016-11-25T17:13:01.189008+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
2016-11-25T17:13:01.189008+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
2016-11-25T17:13:01.189009+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
2016-11-25T17:13:01.189009+00:00 app[worker.1]: 	at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
2016-11-25T17:13:01.189010+00:00 app[worker.1]: 	at java.io.BufferedWriter.flush(BufferedWriter.java:254)
2016-11-25T17:13:01.189010+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.AlwaysFlush.flush(AlwaysFlush.java:32)
2016-11-25T17:13:01.189011+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable.release(WriterPoolable.java:54)
2016-11-25T17:13:01.189011+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:62)
2016-11-25T17:13:01.179107+00:00 app[worker.1]: 17:13:01.178 [ServerScheduler_Worker-7] DEBUG org.quartz.core.JobRunShell - Calling execute on job ServerJob.[node]:[port]-1480091252529-6293319566
2016-11-25T17:13:01.189012+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:52)
2016-11-25T17:13:01.189012+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
2016-11-25T17:13:01.189013+00:00 app[worker.1]: 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2016-11-25T17:13:01.189014+00:00 app[worker.1]: 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2016-11-25T17:13:01.189014+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2016-11-25T17:13:01.189014+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2016-11-25T17:13:01.189015+00:00 app[worker.1]: 	at java.lang.Thread.run(Thread.java:745)

@gehel
Copy link
Member

gehel commented Nov 28, 2016

It looks like logs from multiple threads are interleaved. As far as I can see from the stacktraces, all errors are from writing to Graphite, not reading over JMX. It might make sense to tune logging config to ensure log messages are not mixed together (check the log4j config, and feel free to submit a patch).

This still looks like a communication error between jmxtrans and graphite. It could be network, it could be a firewall, a change of IP somewhere, ... a lot of different things...

Do you have an idea of the frequency and distribution of those errors?

@imochurad
Copy link
Author

imochurad commented Nov 30, 2016

Yeah, you were right regarding logs. the problem was in misconfigured logback and the fact that Heroku log drains do not preserve order of messages dumped into stdout. But I have configured logback to output only error level messages, just to check on the frequency and have ability to better capture stack traces. Here is the example:

2016-11-30T15:08:00.728731+00:00 app[worker.1]: 15:08:00.728 [jmxtrans-result-0] ERROR com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable - Could not flush writer :: java.net.SocketException: Broken pipe
2016-11-30T15:08:00.728734+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite0(Native Method)
2016-11-30T15:08:00.728735+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
2016-11-30T15:08:00.728736+00:00 app[worker.1]: 	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
2016-11-30T15:08:00.728736+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
2016-11-30T15:08:00.728737+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
2016-11-30T15:08:00.728737+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
2016-11-30T15:08:00.728738+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
2016-11-30T15:08:00.728738+00:00 app[worker.1]: 	at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
2016-11-30T15:08:00.728738+00:00 app[worker.1]: 	at java.io.BufferedWriter.flush(BufferedWriter.java:254)
2016-11-30T15:08:00.728739+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.TimeBasedFlush.flush(TimeBasedFlush.java:51)
2016-11-30T15:08:00.728740+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable.release(WriterPoolable.java:54)
2016-11-30T15:08:00.728740+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:62)
2016-11-30T15:08:00.728741+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:52)
2016-11-30T15:08:00.728741+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
2016-11-30T15:08:00.728741+00:00 app[worker.1]: 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2016-11-30T15:08:00.728742+00:00 app[worker.1]: 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2016-11-30T15:08:00.728742+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2016-11-30T15:08:00.728743+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2016-11-30T15:08:00.728743+00:00 app[worker.1]: 	at java.lang.Thread.run(Thread.java:745)
2016-11-30T15:08:10.731008+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite0(Native Method)
2016-11-30T15:08:10.731009+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
2016-11-30T15:08:10.731010+00:00 app[worker.1]: 	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
2016-11-30T15:08:10.731011+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
2016-11-30T15:08:10.731011+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
2016-11-30T15:08:10.731012+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
    private static final String GRAPHITE_WTITER_FACTORY_FLUSH_STRATEGY = "timeBased";
    private static final Integer GRAPHITE_WRITER_FACTORY_FLUSH_DELAY_IN_SECONDS = 10;
    private static final Integer GRAPHITE_WRITER_FACTORY_POOL_SIZE = 10;

I do have 3 remote JMX servers from which I collect metrics.
At the moment I collect only 3 metrics types from mentioned servers with different frequencies: 5 seconds, 60 seconds and 60 seconds. In order to do that, I double the number of jmxtrans servers to accommodate different frequencies.
That is my setup as of now.

You mentioned that the network latency could be an issue. Can it be related to the fact that there might be some very aggressive timeout set on acquiring socket connection?? Maybe it can be increased.

With regards to how often the Broken pipe issue appears in the logs: I had app running for 15 minutes and I've seen 31 occurrences of the stack trace, which is quite high.

@imochurad
Copy link
Author

imochurad commented Nov 30, 2016

I have also spotted few of those:

2016-11-30T15:24:45.905067+00:00 app[worker.1]: 15:24:45.904 [jmxtrans-result-6] ERROR com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable - Could not flush writer :: java.net.SocketException: Connection timed out
2016-11-30T15:24:45.905071+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite0(Native Method)
2016-11-30T15:24:45.905072+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
2016-11-30T15:24:45.905072+00:00 app[worker.1]: 	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
2016-11-30T15:24:45.905073+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
2016-11-30T15:24:45.905073+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)
2016-11-30T15:24:45.905074+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:295)
2016-11-30T15:24:45.905074+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:141)
2016-11-30T15:24:45.905075+00:00 app[worker.1]: 	at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:229)
2016-11-30T15:24:45.905075+00:00 app[worker.1]: 	at java.io.BufferedWriter.flush(BufferedWriter.java:254)
2016-11-30T15:24:45.905076+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.TimeBasedFlush.flush(TimeBasedFlush.java:51)
2016-11-30T15:24:45.905076+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.pool.WriterPoolable.release(WriterPoolable.java:54)
2016-11-30T15:24:45.905076+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.WriterPoolOutputWriter.doWrite(WriterPoolOutputWriter.java:62)
2016-11-30T15:24:45.905077+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.model.output.support.ResultTransformerOutputWriter.doWrite(ResultTransformerOutputWriter.java:52)
2016-11-30T15:24:45.905077+00:00 app[worker.1]: 	at com.googlecode.jmxtrans.jmx.ResultProcessor$1.run(ResultProcessor.java:58)
2016-11-30T15:24:45.905078+00:00 app[worker.1]: 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2016-11-30T15:24:45.905078+00:00 app[worker.1]: 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2016-11-30T15:24:45.905080+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2016-11-30T15:24:45.905080+00:00 app[worker.1]: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2016-11-30T15:24:45.905081+00:00 app[worker.1]: 	at java.lang.Thread.run(Thread.java:745)
2016-11-30T15:25:16.225892+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite0(Native Method)
2016-11-30T15:25:16.225895+00:00 app[worker.1]: 	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
2016-11-30T15:25:16.225896+00:00 app[worker.1]: 	at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
2016-11-30T15:25:16.225897+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
2016-11-30T15:25:16.225897+00:00 app[worker.1]: 	at sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:291)

@imochurad
Copy link
Author

I have also noticed that even I see a lot of messages sent off to the GraphiteWriter2 and way lower number of errors, only a few of those messages make it to the destination (hosted graphite), looks like they are being lost in the process. Might it be related to GraphiteWriterOutputFactory? Should I try switching back to GraphiteWriter? At this moment I am questioning everything.

@gehel
Copy link
Member

gehel commented Nov 30, 2016

Looking at the stacktraces above, it seems that in all cases, the issue occurs during flush (those stacktraces still look /wrong/). So connection has already been opened, but when we actually try to send a packet, things are getting problematic. There can be lots of reasons for that...

It is expected that you loose more messages than the number of errors you see in the logs. Basically, things work like that:

  • The writer write metrics one by one to the socket, the metrics are not actually send over the network, but are stored in an internal buffer.
  • At some point (according to the flush strategy) a flush occurs, which will actually push the metrics over the network. If an error occurs at this point, all the metrics that were in the buffer and not yet flushed are all lost. Only one error is reported.

@imochurad
Copy link
Author

imochurad commented Nov 30, 2016

I suspect there is an issue with GraphiteWriterFactory. I have switched back to using GraphiteWriter as an output writer, however, I've run into #519 (comment)
As a workaround, for now, I've created my own implementation of the OutputWriter:

public class SimpleGraphiteWriter extends BaseOutputWriter {

    private static final Logger LOG = LoggerFactory.getLogger(SimpleGraphiteWriter.class);

    private static final String DEFAULT_ROOT_PREFIX = "servers";

    private GenericKeyedObjectPool<InetSocketAddress, Socket> pool;

    private final String rootPrefix;
    private final InetSocketAddress address;


    @JsonCreator
    public SimpleGraphiteWriter(@JsonProperty("typeNames") ImmutableList<String> typeNames,
            @JsonProperty("booleanAsNumber") boolean booleanAsNumber,
            @JsonProperty("debug") Boolean debugEnabled,
            @JsonProperty("rootPrefix") String rootPrefix, @JsonProperty("host") String host,
            @JsonProperty("port") Integer port,
            @JsonProperty("settings") Map<String, Object> settings) {
        super(typeNames, booleanAsNumber, debugEnabled, settings);
        this.rootPrefix = firstNonNull(rootPrefix, (String) getSettings().get("rootPrefix"),
                DEFAULT_ROOT_PREFIX);
        if (host == null) {
            host = (String) getSettings().get(HOST);
        }
        if (host == null) {
            throw new NullPointerException("Host cannot be null.");
        }
        if (port == null) {
            port = Settings.getIntegerSetting(getSettings(), PORT, null);
        }
        if (port == null) {
            throw new NullPointerException("Port cannot be null.");
        }
        this.address = new InetSocketAddress(host, port);
        this.pool = createPool(new SocketFactory(), SocketFactory.class.getSimpleName());
    }

    private <K, V> GenericKeyedObjectPool<K, V> createPool(
            KeyedPoolableObjectFactory<K, V> factory, String poolName) {
        GenericKeyedObjectPool<K, V> pool = new GenericKeyedObjectPool<>(factory);
        pool.setTestOnBorrow(true);
        pool.setMaxActive(-1);
        pool.setMaxIdle(-1);
        pool.setTimeBetweenEvictionRunsMillis(MILLISECONDS.convert(5, MINUTES));
        pool.setMinEvictableIdleTimeMillis(MILLISECONDS.convert(5, MINUTES));
        return pool;
    }

    @Override
    public void validateSetup(Server server, Query query) throws ValidationException {
    }

    @Override
    protected void internalWrite(Server server, Query query, ImmutableList<Result> results)
            throws Exception {

        Socket socket = null;
        PrintWriter writer = null;

        try {
            socket = pool.borrowObject(address);
            // socket = new Socket(address.getHostString(), address.getPort());
            writer = new PrintWriter(new OutputStreamWriter(socket.getOutputStream(), UTF_8),
                    false);

            List<String> typeNames = this.getTypeNames();

            for (Result result : results) {
                LOG.debug("Query result: {}", result);
                Map<String, Object> resultValues = result.getValues();
                for (Entry<String, Object> values : resultValues.entrySet()) {
                    Object value = values.getValue();
                    if (isNumeric(value)) {

                        String line = KeyUtils
                                .getKeyString(server, query, result, values, typeNames, rootPrefix)
                                .replaceAll("[()]", "_") + " " + value.toString() + " "
                                + result.getEpoch() / 1000 + "\n";
                        LOG.debug("Graphite Message: {}", line);
                        writer.write(line);
                    } else {
                        LOG.debug(
                                "Unable to submit non-numeric value to Graphite: [{}] from result [{}]",
                                value, result);
                    }
                }
            }
            writer.flush();
        } catch (Exception e) {
            LOG.error("Unable to write data to Graphite", e);
        } finally {
            if (writer != null && writer.checkError()) {
                LOG.error("Error writing to Graphite, clearing Graphite socket pool");
                pool.invalidateObject(address, socket);
            } else {
                pool.returnObject(address, socket);
            }
        }
    }


    public static final class Builder {
        private final ImmutableList.Builder<String> typeNames = ImmutableList.builder();
        private boolean booleanAsNumber;
        private Boolean debugEnabled;
        private String rootPrefix;
        private String host;
        private Integer port;

        private Builder() {}

        public Builder addTypeNames(List<String> typeNames) {
            this.typeNames.addAll(typeNames);
            return this;
        }

        public Builder addTypeName(String typeName) {
            typeNames.add(typeName);
            return this;
        }

        public Builder setBooleanAsNumber(boolean booleanAsNumber) {
            this.booleanAsNumber = booleanAsNumber;
            return this;
        }

        public Builder setDebugEnabled(boolean debugEnabled) {
            this.debugEnabled = debugEnabled;
            return this;
        }

        public Builder setRootPrefix(String rootPrefix) {
            this.rootPrefix = rootPrefix;
            return this;
        }

        public Builder setHost(String host) {
            this.host = host;
            return this;
        }

        public Builder setPort(int port) {
            this.port = port;
            return this;
        }

        public SimpleGraphiteWriter build() {
            return new SimpleGraphiteWriter(typeNames.build(), booleanAsNumber, debugEnabled,
                    rootPrefix, host, port, null);
        }
    }
}

It is very similar to the GraphiteWriter implementation, except it doesn't rely on Guice injection, which is broken in my case for whatever reason and it has explicit flushing.
Works like a charm, no more broken pipes and connection time outs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants