Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with MongoDB plug-in #5326

Closed
SteveH-US opened this issue Jan 22, 2019 · 13 comments · Fixed by #6307
Closed

Problem with MongoDB plug-in #5326

SteveH-US opened this issue Jan 22, 2019 · 13 comments · Fixed by #6307
Assignees
Milestone

Comments

@SteveH-US
Copy link

SteveH-US commented Jan 22, 2019

Relevant telegraf.conf: 1.9.1

System info:

Ubuntu 14.04

[Include Telegraf version, operating system name, and other relevant details]

Steps to reproduce:

  1. deploy telegraf on MongoDB mongos server
  2. observe results

Expected behavior:

"mongodb_shard_stats" measurement populated

Actual behavior:

The following error is produced

2019-01-22T17:26:40Z E! Error getting first oplog entry (Can't use 'local' database through mongos)

and no metrics are posted

Additional info:

I think I found a problem reviewing the code. The same "gatherMetrics" method attempts to get oplog details from "local.oplog.rs" collection and chunk details from the "chunk" collection. The only part of a sharded MongoDB cluster that has replicaset details and "chunk" details is the config replica set. However, the metrics produced by the config server are not representative of the load being put on the sharded cluster, which is going through the mongos and the shards.

Please advise. What am I missing?

[Include gist of relevant config, logs, etc.]

@danielnelson
Copy link
Contributor

Can you point the plugin at the mongod servers only?

@SteveH-US
Copy link
Author

I believe that you mean point telegraf to the shard (replica sets) themselves; Yes, I can. However, not only does the "mongodb_shard_stats" measurement not get populated, but even if it was, the JSON docs returned from the 'mongod' is empty and not interesting. In order for the "shardConnPoolStats" results to be useful, one would have to run the command from the 'mongos'. However, doing so produces an error when telegraf errantly tries to get 'oplog' details from that 'mongos' which do not exist.

@danielnelson
Copy link
Contributor

Would it be possible to comment out this line and see if any errors remain?

oplogStats := s.gatherOplogStats()

@SteveH-US
Copy link
Author

Once I get a deployment running again with the updated MongoDB plug-in I'll let you know if there are other errors. Since the "shardConnPoolStats" admin command is running against the shard members, I would have expected that metrics to be send to the "mongodb_shard_stats" measurement, but that isn't happening. It could be because the data returned from that command on the shards would be empty anyway.

I have no doubt that the call to the "shardConnPoolStats" admin command will work with a mongos. As previously described, the code is simply not correct.

@danielnelson
Copy link
Contributor

@SteveH-US I opened a pull request which essentially just skips over this error and continues.

@SteveH-US
Copy link
Author

Hi Daniel,

Thanks for taking this on. However, it appears that you'll still may get an error at line 71 of "mongodb_server.go" when retrieving the "Timestamp". Even if the "op_first_time.Timestamp" property is initialized, the "stats" would be invalid. When reporting on these metrics, one would explicitly remove the mongos oplog metrics, otherwise, the stats from them would throw off the calculations.

IMHO, this plug-in ought not be reporting opLog metrics for mongos at all.

@danielnelson
Copy link
Contributor

Thanks for taking a look, I see what you mean.

We already were doing a check to see if we are in a replica set, so I've updated the code to skip the oplog completely if we are not in a replica set. I also made it so the oplog field is not added if the oplog collection cannot be queried. Can you take another look?

Also, follow up on your original comment about chunks, do you think we should do the same for these: only look them up if we are connected to a replica set member?

@SteveH-US
Copy link
Author

Actually, you can only look for the config.chunks if the cluster member you're running on is a mongos.

@danielnelson
Copy link
Contributor

Okay, right now we are still reporting jumbo_chunks=0i when connected to a mongos. I think I will leave it as is for now, it seems technically correct if not useful.

If someone reading this has a system with jumbo chunks but would love to see the output of this on a mongos and a shardsvr mongod when there are jumbo chunks.

> db.getSiblingDB("config").getCollection("chunks").find({"jumbo": true})

@SteveH-US
Copy link
Author

SteveH-US commented Aug 28, 2019 via email

@danielnelson
Copy link
Contributor

Yeah that makes a lot of sense based on my limited understanding. Do you think you would be able to research the queries we would need for this and create a new issue for this?

The mongodb plugin is near the point where we need to consider doing a redesign, as the library we are using is no longer under active development and there are some issues supporting newer MongoDB versions. We may want make a clean break and do a v2 of this plugin and it would be really helpful if we had a good list of what is important to bring (as well as what we should skip).

@SteveH-US
Copy link
Author

Sure, I could.

As far as metrics, there is a "changelog" collection in the config database that has details about what the balancer is doing. Monitoring the balancer is probably the most interesting thing to monitor from the sharded cluster, other than changes to the shard cluster configuration itself.

Here's an example of the type of changes recorded in one of the config DB collections.

$ db.changelog.aggregate([{$match:{"time":{ "$gte" : new Date(ISODate().getTime() - 1000 * 3600 * 24 * 1) }}}, {$group:{_id:{ns:"$ns",what:"$what"},count:{$sum:1}}},{$sort:{_id:1}}])

{ "_id" : { "ns" : "MyDB.BankTransactions", "what" : "multi-split" }, "count" : 3 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.commit" }, "count" : 24 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.error" }, "count" : 405 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.from" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.start" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.JournalEntries", "what" : "moveChunk.to" }, "count" : 24 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.commit" }, "count" : 320 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.error" }, "count" : 78 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.from" }, "count" : 398 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.start" }, "count" : 398 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "moveChunk.to" }, "count" : 320 }
{ "_id" : { "ns" : "MyDB.Pays", "what" : "multi-split" }, "count" : 108 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.commit" }, "count" : 6 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.error" }, "count" : 423 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.from" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.start" }, "count" : 429 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "moveChunk.to" }, "count" : 6 }
{ "_id" : { "ns" : "MyDB.RawEvents", "what" : "multi-split" }, "count" : 38 }

However, MongoDB only lists one of the several collections in that database, this "changelog", but states we ought not depend on it.

Following their advice, plug-ins like this one ought not be querying the "config" database. As such, I'm not sure it makes sense for a plug-in recording shard level metrics; unless your team wants to be on the hook for reacting to changes MDB makes in this database.

What do you think?

@danielnelson
Copy link
Contributor

Yeah that's a tricky one to answer. If there is no public API and the information is important to monitor, we may have no choice but to implement internal queries and deal with the fallout. It may make sense going forward to do a better job of making the distinction between public and internal APIs and segregating the code.

One thing that could reduce our need to do queries against the internal databases is if we had a plugin that allowed ad-hoc queries against MongoDB (#4252). However, this really just pushes the problem over to the users of the plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants