Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing data in received metrics from multiple JMX instances. #329

Closed
spynode opened this issue Sep 13, 2015 · 14 comments
Closed

Missing data in received metrics from multiple JMX instances. #329

spynode opened this issue Sep 13, 2015 · 14 comments

Comments

@spynode
Copy link

spynode commented Sep 13, 2015

I am trying to gather metrics from 8 applications, each of which holds ~ 2000-4000 mbeans with varying amount of attributes for each, with approximate total of 200K metrics. I send them to Graphite host. When configuring Jmxtrans to gather metrics just from 2-4 hosts, information is complete without any data missing, but after enabling all 8 JMX hosts in jmxtrans configuration, there are holes in every metric received on Graphite. Initially I tried to configure jmxtrans to use wildcard mbean definition : . Later I generated json of all mbeans but with same results. I don't use attribute definitions, letting jmxtrans to gather them all.
Here is part of configuration:

{
    "servers": [
        {            
            "host": "xx.xx.xx.xx",
            "port": "8082",
            "username": "xx",
            "password": "xx",
            "alias": "appserv1",
            "numQueryThreads" : "32",
            "cronExpression": "0 0/1 * * * ?",
            "queries": [
               {
                   "obj": "app:name=a.async.request_latency_ms",
                   "attr": [],
                   "keys": [],
                   "allowDottedKeys": "true",
                   "resultAlias": "app",
                   "outputWriters": [
                        {
                            "@class": "com.googlecode.jmxtrans.model.output.GraphiteWriter",
                            "settings": {
                                "port": 2003,
                                "host": "127.0.0.1",
                                "typeNames": [
                                    "name"
                                ],
                                "rootPrefix": "STAGE"
                            }
                        }
                    ]
            },
               {
                   "obj": "app:name=b.async.request_latency_ms",
                   "attr": [],
                   "keys": [],
                   "allowDottedKeys": "true",
                   "resultAlias": "app",
                   "outputWriters": [
                        {
                            "@class": "com.googlecode.jmxtrans.model.output.GraphiteWriter",
                            "settings": {
                                "port": 2003,
                                "host": "127.0.0.1",
                                "typeNames": [
                                    "name"
                                ],
                                "rootPrefix": "STAGE"
                            }
                        }
                    ]
            }, ...

Is there any reason why jmxtrans would skip some data on bigger load or larger incoming data sets? In jmxtrans logs when looking for particular metric name, I see that it is not queried every minute, but it is mentioned in logs with same irregular intervals as the metrics show up in Graphite. I cannot identify anything in Jmxtrans logs which would suggest a problem, even with DEBUG log level.

@gehel
Copy link
Member

gehel commented Sep 13, 2015

Could you tell us which version of JmxTrans you are using ? There are known issues in the way TCP connections are managed in older versions (I could check exactly which), but they should be corrected for recent version.

@spynode
Copy link
Author

spynode commented Sep 14, 2015

It is latest build - 251.

@gehel
Copy link
Member

gehel commented Sep 14, 2015

Any chance you could send a threaddump taken when you see the issue? Do you see memory or cpu starvation on the jmxtrans machine? Do you see errors on the graphite side?

@spynode
Copy link
Author

spynode commented Sep 14, 2015

On 4 CPU VM, load average: 1.98, 1.89, 1.82

total used free shared buffers cached Mem: 3830 3715 114 0 673 1284 -/+ buffers/cache: 1758 2072 Swap: 2047 100 1947

There are no errors on Graphite side except from "invalid line received from client 127.0.0.1:49933, ignoring" which are malformed metrics sent to Graphite derived from objects with non-number attributes.
The thing is, these fragments of metrics, which are missing from Graphite, are not sent to it. Searching jmxtrans log file for specific metric names, returns DEBUG c.g.j.model.output.GraphiteWriter - Graphite Message: metric name , with random intervals between records, which are mostly larger than 1 minute.

@gehel
Copy link
Member

gehel commented Sep 14, 2015

Any chance to get a thread dump?
On 14 Sep 2015 10:50, "spynode" notifications@github.com wrote:

On 4 CPU VM, load average: 1.98, 1.89, 1.82

total used free shared buffers cached
Mem: 3830 3715 114 0 673 1284
-/+ buffers/cache: 1758 2072
Swap: 2047 100 1947

There are no errors on Graphite side except from "invalid line received
from client 127.0.0.1:49933, ignoring" which are malformed metrics sent
to Graphite derived from objects with non-number attributes.
The thing is, these fragments of metrics, which are missing from Graphite,
are not sent to it. Searching jmxtrans log file for specific metric names,
returns DEBUG c.g.j.model.output.GraphiteWriter - Graphite Message: metric
name
, with random intervals between records, which are mostly larger
than 1 minute.


Reply to this email directly or view it on GitHub
#329 (comment).

@gehel
Copy link
Member

gehel commented Sep 14, 2015

Seems that you are running with logs at debug level and that this causes
contention on the log file. Could you send another thread dump with logs in
info or warn level so that we can see if there is another contention?

If you want to keep logs in debug, you could try to configure an async
logger with overflow, that should ensure that logging does not block the
main functions.
On 14 Sep 2015 10:52, "Guillaume Lederrey" guillaume.lederrey@gmail.com
wrote:

Any chance to get a thread dump?
On 14 Sep 2015 10:50, "spynode" notifications@github.com wrote:

On 4 CPU VM, load average: 1.98, 1.89, 1.82

total used free shared buffers cached
Mem: 3830 3715 114 0 673 1284
-/+ buffers/cache: 1758 2072
Swap: 2047 100 1947

There are no errors on Graphite side except from "invalid line received
from client 127.0.0.1:49933, ignoring" which are malformed metrics sent
to Graphite derived from objects with non-number attributes.
The thing is, these fragments of metrics, which are missing from
Graphite, are not sent to it. Searching jmxtrans log file for specific
metric names, returns DEBUG c.g.j.model.output.GraphiteWriter - Graphite
Message: metric name , with random intervals between records, which
are mostly larger than 1 minute.


Reply to this email directly or view it on GitHub
#329 (comment).

@spynode
Copy link
Author

spynode commented Sep 14, 2015

Thread dumps sent before, were taken with log level INFO. I've sent you new thread dump, taken with log level WARN.

@gehel
Copy link
Member

gehel commented Sep 14, 2015

Seems that logs are still at debug level. Logging configuration is utterly broken for quite some time in JmxTrans ... :-( Do you still see debug level messages in the logs ?

PR #288 should fix the logging configuration issue, but it needs to be rebased before it can be merged. I did not have time to do it yet... If you have some time and energy, could you try to rebase it and see if it helps ? I'd be happy to review and merge it, but not sure I have the time right now to do it myself...

@spynode
Copy link
Author

spynode commented Sep 15, 2015

With log levels INFO or WARN nothing is actually logged to file. With log level NONE, there is still large number of threads related to logging but nothing is actually logged. If you think #288 can fix the issue, I will consider rebasing it.

@gehel
Copy link
Member

gehel commented Sep 15, 2015

That's strange ... Your thread dumps show that most threads are blocked in logback OutputStreamAppender, and this seems to come from debug level messages. #288 will probably not fix the issue, but it might help in diagnosing the problem.

Are you installing from RPM, DEB or just running the jar directly ?

@spynode
Copy link
Author

spynode commented Sep 15, 2015

I am installing release 251 from rpm.

@gehel
Copy link
Member

gehel commented Nov 26, 2015

Version 252 has been released, with minor improvements to logging. Any chance you could try again? Or has your issue been magically solved?

@spynode
Copy link
Author

spynode commented Nov 18, 2016

Sorry for not commenting on this for so long. I returned to using Jmxtrans after some period of trying other tools. Occasional holes in metrics from random endpoints can still be observed but it is unclear if it has same root cause. They correlate with other JMX host(s) being unavailable for connection for longer periods of time. I am closing this issue and will
report new one when I will be confident about the facts gathered.

@spynode spynode closed this as completed Nov 18, 2016
@gehel
Copy link
Member

gehel commented Nov 18, 2016

@spynode thanks for the house cleaning!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants