New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

解决海狗查询时候因内存原因，每次扫描必须限制扫描的数据规模 #17

Closed

muyannian opened this issue Mar 25, 2013 · 1 comment

Labels

Owner

muyannian commented Mar 25, 2013

之前海狗的做法是，所有的分区都并发去请求每个shards，在机器资源有限的情况下
如果分区数量过多，会产生很多次http请求，然后merger server的压力过大。
故一直以来在adhoc项目上，海狗单次扫描的数据量限制在10亿，但这显然不能满足有些需求
，故改进之。

当前的做法是分多次提交，每次只提交固定的分区数量（比如说只提交4个分区），每个shard计算完毕后，将数据dump到hdfs中
最终提交一个merger的操作（并发数量取决于hash的数量），将所有dump到hdfs中的数据，进行merger

Owner Author

muyannian commented Mar 26, 2013

已经实现

muyannian closed this as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment