Missing data in received metrics from multiple JMX instances. #329

spynode · 2015-09-13T18:53:12Z

I am trying to gather metrics from 8 applications, each of which holds ~ 2000-4000 mbeans with varying amount of attributes for each, with approximate total of 200K metrics. I send them to Graphite host. When configuring Jmxtrans to gather metrics just from 2-4 hosts, information is complete without any data missing, but after enabling all 8 JMX hosts in jmxtrans configuration, there are holes in every metric received on Graphite. Initially I tried to configure jmxtrans to use wildcard mbean definition : . Later I generated json of all mbeans but with same results. I don't use attribute definitions, letting jmxtrans to gather them all.
Here is part of configuration:

{
    "servers": [
        {            
            "host": "xx.xx.xx.xx",
            "port": "8082",
            "username": "xx",
            "password": "xx",
            "alias": "appserv1",
            "numQueryThreads" : "32",
            "cronExpression": "0 0/1 * * * ?",
            "queries": [
               {
                   "obj": "app:name=a.async.request_latency_ms",
                   "attr": [],
                   "keys": [],
                   "allowDottedKeys": "true",
                   "resultAlias": "app",
                   "outputWriters": [
                        {
                            "@class": "com.googlecode.jmxtrans.model.output.GraphiteWriter",
                            "settings": {
                                "port": 2003,
                                "host": "127.0.0.1",
                                "typeNames": [
                                    "name"
                                ],
                                "rootPrefix": "STAGE"
                            }
                        }
                    ]
            },
               {
                   "obj": "app:name=b.async.request_latency_ms",
                   "attr": [],
                   "keys": [],
                   "allowDottedKeys": "true",
                   "resultAlias": "app",
                   "outputWriters": [
                        {
                            "@class": "com.googlecode.jmxtrans.model.output.GraphiteWriter",
                            "settings": {
                                "port": 2003,
                                "host": "127.0.0.1",
                                "typeNames": [
                                    "name"
                                ],
                                "rootPrefix": "STAGE"
                            }
                        }
                    ]
            }, ...

Is there any reason why jmxtrans would skip some data on bigger load or larger incoming data sets? In jmxtrans logs when looking for particular metric name, I see that it is not queried every minute, but it is mentioned in logs with same irregular intervals as the metrics show up in Graphite. I cannot identify anything in Jmxtrans logs which would suggest a problem, even with DEBUG log level.

The text was updated successfully, but these errors were encountered:

gehel · 2015-09-13T20:53:14Z

Could you tell us which version of JmxTrans you are using ? There are known issues in the way TCP connections are managed in older versions (I could check exactly which), but they should be corrected for recent version.

spynode · 2015-09-14T06:59:34Z

It is latest build - 251.

gehel · 2015-09-14T07:30:57Z

Any chance you could send a threaddump taken when you see the issue? Do you see memory or cpu starvation on the jmxtrans machine? Do you see errors on the graphite side?

spynode · 2015-09-14T08:50:02Z

On 4 CPU VM, load average: 1.98, 1.89, 1.82

total used free shared buffers cached Mem: 3830 3715 114 0 673 1284 -/+ buffers/cache: 1758 2072 Swap: 2047 100 1947

There are no errors on Graphite side except from "invalid line received from client 127.0.0.1:49933, ignoring" which are malformed metrics sent to Graphite derived from objects with non-number attributes.
The thing is, these fragments of metrics, which are missing from Graphite, are not sent to it. Searching jmxtrans log file for specific metric names, returns DEBUG c.g.j.model.output.GraphiteWriter - Graphite Message: metric name , with random intervals between records, which are mostly larger than 1 minute.

gehel · 2015-09-14T08:52:07Z

Any chance to get a thread dump?
On 14 Sep 2015 10:50, "spynode" notifications@github.com wrote:

On 4 CPU VM, load average: 1.98, 1.89, 1.82

total used free shared buffers cached
Mem: 3830 3715 114 0 673 1284
-/+ buffers/cache: 1758 2072
Swap: 2047 100 1947

There are no errors on Graphite side except from "invalid line received
from client 127.0.0.1:49933, ignoring" which are malformed metrics sent
to Graphite derived from objects with non-number attributes.
The thing is, these fragments of metrics, which are missing from Graphite,
are not sent to it. Searching jmxtrans log file for specific metric names,
returns DEBUG c.g.j.model.output.GraphiteWriter - Graphite Message: metric
name , with random intervals between records, which are mostly larger
than 1 minute.

—
Reply to this email directly or view it on GitHub
#329 (comment).

gehel · 2015-09-14T09:01:26Z

Seems that you are running with logs at debug level and that this causes
contention on the log file. Could you send another thread dump with logs in
info or warn level so that we can see if there is another contention?

If you want to keep logs in debug, you could try to configure an async
logger with overflow, that should ensure that logging does not block the
main functions.
On 14 Sep 2015 10:52, "Guillaume Lederrey" guillaume.lederrey@gmail.com
wrote:

Any chance to get a thread dump?
On 14 Sep 2015 10:50, "spynode" notifications@github.com wrote:

On 4 CPU VM, load average: 1.98, 1.89, 1.82

total used free shared buffers cached
Mem: 3830 3715 114 0 673 1284
-/+ buffers/cache: 1758 2072
Swap: 2047 100 1947

There are no errors on Graphite side except from "invalid line received
from client 127.0.0.1:49933, ignoring" which are malformed metrics sent
to Graphite derived from objects with non-number attributes.
The thing is, these fragments of metrics, which are missing from
Graphite, are not sent to it. Searching jmxtrans log file for specific
metric names, returns DEBUG c.g.j.model.output.GraphiteWriter - Graphite
Message: metric name , with random intervals between records, which
are mostly larger than 1 minute.

—
Reply to this email directly or view it on GitHub
#329 (comment).

spynode · 2015-09-14T10:13:28Z

Thread dumps sent before, were taken with log level INFO. I've sent you new thread dump, taken with log level WARN.

gehel · 2015-09-14T13:49:10Z

Seems that logs are still at debug level. Logging configuration is utterly broken for quite some time in JmxTrans ... :-( Do you still see debug level messages in the logs ?

PR #288 should fix the logging configuration issue, but it needs to be rebased before it can be merged. I did not have time to do it yet... If you have some time and energy, could you try to rebase it and see if it helps ? I'd be happy to review and merge it, but not sure I have the time right now to do it myself...

spynode · 2015-09-15T07:05:41Z

With log levels INFO or WARN nothing is actually logged to file. With log level NONE, there is still large number of threads related to logging but nothing is actually logged. If you think #288 can fix the issue, I will consider rebasing it.

gehel · 2015-09-15T08:25:55Z

That's strange ... Your thread dumps show that most threads are blocked in logback OutputStreamAppender, and this seems to come from debug level messages. #288 will probably not fix the issue, but it might help in diagnosing the problem.

Are you installing from RPM, DEB or just running the jar directly ?

spynode · 2015-09-15T09:48:53Z

I am installing release 251 from rpm.

gehel · 2015-11-26T10:14:08Z

Version 252 has been released, with minor improvements to logging. Any chance you could try again? Or has your issue been magically solved?

spynode · 2016-11-18T15:46:28Z

Sorry for not commenting on this for so long. I returned to using Jmxtrans after some period of trying other tools. Occasional holes in metrics from random endpoints can still be observed but it is unclear if it has same root cause. They correlate with other JMX host(s) being unavailable for connection for longer periods of time. I am closing this issue and will
report new one when I will be confident about the facts gathered.

gehel · 2016-11-18T15:59:25Z

@spynode thanks for the house cleaning!

spynode closed this as completed Nov 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing data in received metrics from multiple JMX instances. #329

Missing data in received metrics from multiple JMX instances. #329

spynode commented Sep 13, 2015

gehel commented Sep 13, 2015

spynode commented Sep 14, 2015

gehel commented Sep 14, 2015

spynode commented Sep 14, 2015

gehel commented Sep 14, 2015

gehel commented Sep 14, 2015

spynode commented Sep 14, 2015

gehel commented Sep 14, 2015

spynode commented Sep 15, 2015

gehel commented Sep 15, 2015

spynode commented Sep 15, 2015

gehel commented Nov 26, 2015

spynode commented Nov 18, 2016

gehel commented Nov 18, 2016

Missing data in received metrics from multiple JMX instances. #329

Missing data in received metrics from multiple JMX instances. #329

Comments

spynode commented Sep 13, 2015

gehel commented Sep 13, 2015

spynode commented Sep 14, 2015

gehel commented Sep 14, 2015

spynode commented Sep 14, 2015

gehel commented Sep 14, 2015

gehel commented Sep 14, 2015

spynode commented Sep 14, 2015

gehel commented Sep 14, 2015

spynode commented Sep 15, 2015

gehel commented Sep 15, 2015

spynode commented Sep 15, 2015

gehel commented Nov 26, 2015

spynode commented Nov 18, 2016

gehel commented Nov 18, 2016