Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying large datasets take up large amount of memory #341

Merged
merged 2 commits into from
Apr 1, 2014

Conversation

jvshahid
Copy link
Contributor

select count(value) some_series on a series with many points takes up a huge amount of memory. Shouldn't be the case...

@pauldix pauldix added this to the 0.5.0 milestone Mar 14, 2014
@jvshahid jvshahid modified the milestones: 0.5.1, 0.5.0 Mar 24, 2014
@jvshahid
Copy link
Contributor

This turned out to be due to the multiple levels of caching that we have. The sample data set had 12 million data points spread across ~170 shards. Each shard's Passthrough engine sent a response with 1100 points and at the coordinator we are buffering responses in a channel of size query-shard-buffer-size responses. This causes the memory usage to explode as more responses are buffered. This pr reduces the default query-shard-buffer-size to 10

@pauldix
Copy link
Member Author

pauldix commented Mar 27, 2014

Ok, after brainstorming, here's what we came up with. There should be a setting for the number of shards that can be queried in parallel. We should remove the option for per shard buffer size. This should be something that we compute based on the query. For example, if the shard duration is 7d and the group by interval is 1m, we know that we could possibly buffer 1440 * 7 points. If responses always bring back 200 points then we know the buffer would need to be 1440 * 7 / 200 + 1 in size.

Further, if there is no group by time interval, then the number of shards to query in parallel should be 1. That will guarantee that we never have stuff buffering while another newer shard is going slow.

pauldix and others added 2 commits April 1, 2014 19:11
Remove the setting for shard query buffer size and add logic for max
number of shards to query in parallel.
jvshahid added a commit that referenced this pull request Apr 1, 2014
Querying large datasets take up large amount of memory
@jvshahid jvshahid merged commit ee0277a into master Apr 1, 2014
@jvshahid jvshahid deleted the fix-341-query-memory-consumption branch April 1, 2014 23:13
jvshahid added a commit that referenced this pull request Apr 2, 2014
We shouldn't be dropping responses anymore since the out of order
response reception isn't possible. Also fix the logic that decides
whether the shards should be queried sequentially or not. The only safe
case to do parallel querying is when we have a single time series with
aggregation over time only. Any other case is currently not safe to run
in parallel.
jvshahid added a commit that referenced this pull request Apr 3, 2014
this patch uses a channel of response channels instead of slice of
response channels to create a pipeline instead of batches. In other
words before this patch we processed shardConcurrentLimit shards first,
then processed the next shardConcurrentLimit. With this patch we
constantly have shardConcurrentLimit in the pipeline, as soon as we're
done with one shard we start querying a new shard and so on. This
provides more parallelism and cleaner design.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants