Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out series within shards that do not have data for that series #7496

Merged
merged 1 commit into from
Oct 21, 2016

Conversation

jsternberg
Copy link
Contributor

@jsternberg jsternberg commented Oct 20, 2016

Previously, we would return a full tag set for every shard and the tag
set would include all series that existed in the database index
including series that didn't physically exist within that shard. This
led to the tag sets returned being incredibly huge when we had high
cardinality but sparse data. Since the data was sparse, it was
unexpected that it would cause such a large strain on the system by most
people.

Now we filter out the series ids that are not assigned to the current
shard when computing a tag set for that shard. This lowers the memory
usage for high cardinality sparse data drastically and allows queries on
those to complete successfully.

This does not resolve issues for high cardinality data in every shard
that is also spread out over a long series of time. That situation isn't
nearly as common as the above situation though.

@jsternberg jsternberg force-pushed the js-filter-shards-without-series-key branch from cb98365 to 67f5041 Compare October 20, 2016 18:14
@jsternberg jsternberg changed the title Filter out series within shards that do not have data for that shard Filter out series within shards that do not have data for that series Oct 20, 2016
@jsternberg jsternberg force-pushed the js-filter-shards-without-series-key branch from 67f5041 to f9f6dcd Compare October 20, 2016 18:15
@jsternberg jsternberg added this to the 1.1.0 milestone Oct 20, 2016
Previously, we would return a full tag set for every shard and the tag
set would include all series that existed in the database index
including series that didn't physically exist within that shard. This
led to the tag sets returned being incredibly huge when we had high
cardinality but sparse data. Since the data was sparse, it was
unexpected that it would cause such a large strain on the system by most
people.

Now we filter out the series ids that are not assigned to the current
shard when computing a tag set for that shard. This lowers the memory
usage for high cardinality sparse data drastically and allows queries on
those to complete successfully.

This does not resolve issues for high cardinality data in every shard
that is also spread out over a long series of time. That situation isn't
nearly as common as the above situation though.
@jsternberg jsternberg force-pushed the js-filter-shards-without-series-key branch from f9f6dcd to 3681bc8 Compare October 20, 2016 19:18
@jwilder
Copy link
Contributor

jwilder commented Oct 20, 2016

I think this change may not be needed w/ TSI, but is need for the current index. @e-dard @benbjohnson can you take a look?

@benbjohnson
Copy link
Contributor

Correct, it's not necessarily for TSI since it'll only know about its series internally.

Copy link
Contributor

@jwilder jwilder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This dramatically improves my sparse data (ratings) dataset tests.

It goes from allocating 50+GB, running for minutes and ultimately getting killed by the OS to staying under 3GB and completing in 30s.

@jsternberg jsternberg merged commit 332de12 into master Oct 21, 2016
@jsternberg jsternberg deleted the js-filter-shards-without-series-key branch October 21, 2016 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants