Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: shard (group) count metrics for influxdb #1221

Closed
francisdb opened this issue May 18, 2016 · 19 comments · Fixed by #1222
Closed

Request: shard (group) count metrics for influxdb #1221

francisdb opened this issue May 18, 2016 · 19 comments · Fixed by #1222

Comments

@francisdb
Copy link

We had problems with InfuxDB when large amounts of new shards were being created. (influxdata/influxdb#6635)
With this metric we would have be able to correlate our crashes with the shard creation.

@sparrc
Copy link
Contributor

sparrc commented May 18, 2016

good idea, should be simple to implement as well

@francisdb
Copy link
Author

@sparrc that was fast! Thanks

@francisdb
Copy link
Author

I can confirm this is working correctly with telegraf 0.13.1

@francisdb
Copy link
Author

@sparrc this is not what I expected. What we need is a gauge so we can see where shards are cleaned up as well as created. Repeating creating followed by deletion indicates there is an issue where old data is inserted.

@sparrc
Copy link
Contributor

sparrc commented May 24, 2016

I don't understand what you mean

@francisdb
Copy link
Author

This is the output in grafana:

image

But in fact the shards are not growing all the time, we currently have about ~100 shards. This will fall back to ~70 once retention kicks in.

So I expect the chart to go like
70 -> 100 -> 70 -> 100
instead of
70 -> 100 -> 130 -> 160

The first pattern indicates that something is writing old data that is being cleaned up all the time.

@sparrc
Copy link
Contributor

sparrc commented May 24, 2016

can you provide the output from http://<host>:8086/debug/vars in a gist or attachment?

@francisdb
Copy link
Author

I can't send you the output but here are some observations:

SHOW SHARDS in influx shows ~100 rows
sudo find /var/lib/influxdb/data/ -type d | wc -l returns ~100
the shard: section in debug/vars indeed shows a lot more shards ~ 350
shard: entries without "values": {} values leaves about 270 rows
looking at tsm1_filestore: and filtering out the ones with "values": {} leaves about ~100 rows

So I guess the reported shards are all the shards ever created and the filtered tsm1_filestore results are actually what SHOW SHARDS reports.

@sparrc
Copy link
Contributor

sparrc commented May 24, 2016

I see, that's a pretty easy fix then, I thought that all shard entries were supposed to be counted.

@sparrc sparrc reopened this May 24, 2016
@francisdb
Copy link
Author

I expected to see the same output as SHOW SHARDS, maybe both are interesting?

@francisdb
Copy link
Author

This is what the current results are if influx is restarted (influxdb 0.12.2)

image

sparrc added a commit that referenced this issue May 25, 2016
sparrc added a commit that referenced this issue May 25, 2016
@sparrc sparrc closed this as completed in 6351aa5 May 25, 2016
@francisdb
Copy link
Author

@sparrc any idea when the nightlies are built so I can test this before release?

@sparrc
Copy link
Contributor

sparrc commented May 25, 2016

I believe it's 2am or 3am US/PST

@francisdb
Copy link
Author

hmm, I'll try about the same time tomorrow then

@francisdb
Copy link
Author

Just downloaded the nightly

 Package: telegraf
 Version: 0.14.0~n201605260840-0

After installation I see a very small drop but a still way too high number in n_chards

image

I also just saw that my previous observations might not be correct

#skipping headers with the grep
influx -username xxx -password xxx -execute "SHOW SHARDS" | grep "T00:00:00Z" | wc -l
61

curl -s http://localhost:8086/debug/vars | grep shard | wc -l
2408

curl -s http://localhost:8086/debug/vars | grep shard | grep -v "\"values\": {}" | wc -l
2373

curl -s http://localhost:8086/debug/vars | grep tsm1_filestore | wc -l
2408

curl -s http://localhost:8086/debug/vars | grep tsm1_filestore | grep -v "\"values\": {}" | wc -l
42

I guess contacting somebody of the infuxdb team and asking what to filter on or looking into the source might be a better idea?

@francisdb
Copy link
Author

https://github.com/influxdata/influxdb/blob/cebe256773387b65a1e35658039cfd4df540f402/coordinator/statement_executor.go#L666

// Shards associated with deleted shard groups are effectively deleted.
// Don't list them.
if sgi.Deleted() {
    continue
}

maybe the /debug/vars call should also apply that filter?

@sparrc
Copy link
Contributor

sparrc commented May 26, 2016

doesn't look like that's going to be possible, so we'll probably need to start running queries on the db for some of these metrics. Unfortunately that will also require quite a bit larger of a change because we need a user we can authenticate as.

@sparrc sparrc reopened this May 26, 2016
@francisdb
Copy link
Author

I would prefer to create a ticket for influxdb that adds this deleted to /debug/vars

@sjwang90
Copy link
Contributor

Closing due to lack of interest and discussion. Feel free to reopen if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants