Filter out series within shards that do not have data for that series #7496

jsternberg · 2016-10-20T18:13:25Z

Previously, we would return a full tag set for every shard and the tag
set would include all series that existed in the database index
including series that didn't physically exist within that shard. This
led to the tag sets returned being incredibly huge when we had high
cardinality but sparse data. Since the data was sparse, it was
unexpected that it would cause such a large strain on the system by most
people.

Now we filter out the series ids that are not assigned to the current
shard when computing a tag set for that shard. This lowers the memory
usage for high cardinality sparse data drastically and allows queries on
those to complete successfully.

This does not resolve issues for high cardinality data in every shard
that is also spread out over a long series of time. That situation isn't
nearly as common as the above situation though.

Previously, we would return a full tag set for every shard and the tag set would include all series that existed in the database index including series that didn't physically exist within that shard. This led to the tag sets returned being incredibly huge when we had high cardinality but sparse data. Since the data was sparse, it was unexpected that it would cause such a large strain on the system by most people. Now we filter out the series ids that are not assigned to the current shard when computing a tag set for that shard. This lowers the memory usage for high cardinality sparse data drastically and allows queries on those to complete successfully. This does not resolve issues for high cardinality data in every shard that is also spread out over a long series of time. That situation isn't nearly as common as the above situation though.

jwilder · 2016-10-20T20:23:50Z

I think this change may not be needed w/ TSI, but is need for the current index. @e-dard @benbjohnson can you take a look?

benbjohnson · 2016-10-20T20:48:25Z

Correct, it's not necessarily for TSI since it'll only know about its series internally.

jwilder

Nice! This dramatically improves my sparse data (ratings) dataset tests.

It goes from allocating 50+GB, running for minutes and ultimately getting killed by the OS to staying under 3GB and completing in 30s.

jsternberg force-pushed the js-filter-shards-without-series-key branch from cb98365 to 67f5041 Compare October 20, 2016 18:14

jsternberg changed the title ~~Filter out series within shards that do not have data for that shard~~ Filter out series within shards that do not have data for that series Oct 20, 2016

jsternberg force-pushed the js-filter-shards-without-series-key branch from 67f5041 to f9f6dcd Compare October 20, 2016 18:15

jsternberg added this to the 1.1.0 milestone Oct 20, 2016

jsternberg force-pushed the js-filter-shards-without-series-key branch from f9f6dcd to 3681bc8 Compare October 20, 2016 19:18

jwilder added the area/performance label Oct 20, 2016

jwilder approved these changes Oct 21, 2016

View reviewed changes

jsternberg merged commit 332de12 into master Oct 21, 2016

jsternberg deleted the js-filter-shards-without-series-key branch October 21, 2016 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out series within shards that do not have data for that series #7496

Filter out series within shards that do not have data for that series #7496

jsternberg commented Oct 20, 2016 •

edited

Loading

jwilder commented Oct 20, 2016

benbjohnson commented Oct 20, 2016

jwilder left a comment

Filter out series within shards that do not have data for that series #7496

Filter out series within shards that do not have data for that series #7496

Conversation

jsternberg commented Oct 20, 2016 • edited Loading

jwilder commented Oct 20, 2016

benbjohnson commented Oct 20, 2016

jwilder left a comment

Choose a reason for hiding this comment

jsternberg commented Oct 20, 2016 •

edited

Loading