Import task for Mongoid models should use the cursor, rather than skip() #724
Conversation
Hi, this is surprising, we actually modeled this pseudo- |
@nickhoffman @michaelklishin, any suggestions here? |
items = klass.limit(options[:per_page]).skip(offset) | ||
index.import items.to_a, options, &block | ||
items = [] | ||
klass.all.each do |item| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use in_groups_of
as suggested in http://stackoverflow.com/a/9489691/95696 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#in_groups_of
will cause the entire result set to be retrieved from MongoDB. That's fine for small collections, but not for large collections.
Tire's README suggests using the lower-level approach when needing to index large collections:
The MongoDB documentation that @halfbrick linked to says that a good alternative to These two suggestions go hand-in-hand. For example, import articles in batches of 1 month: Article.count
=> 10000000
time_delta = 1.month
start_criteria = Article.asc(:created_at).only :created_at
start_date = start_criteria.first.created_at
end_date = start_date + time_delta
while (start_date <= Time.now) do
Article.where(:created_at.gte => start_date, :created_at.lt => end_date).import
start_date = start_criteria.where(:created_at.gte => end_date).first.created_at
end_date = start_date + time_delta
end |
@nickhoffman The README actually suggests both options (maybe confusingly, see #726). The real question here is what is the most efficient approach when importing large sets from Mongoid collections -- see the code in https://github.com/karmi/tire/blob/master/lib/tire/model/import.rb#L62-L71. The code uses |
I'm no mongoid expert either, but from what I understand mongoid queries (Criteria) are lazily evaluated and they wrap up the mongo cursor. The cursor by default returns 100 documents, or 1mb of data. So calling We didn't see any memory issues using |
@Halfbrick Thanks for the update. This is surprising, since I remember we specifically added the @nickhoffman Should we then just use |
BTW, 100 documents per batch feels suboptimal to me, usually Elasticsearch works best with batches in the thousands range. The current default is 1000. |
OK cool. My commit still uses the same batch limit as before (1000 by default), that 100 document limit is just what happens under the hood in mongoid. |
In Mongoid import strategy, use the modulo division for better semantics of breaking up the collection into batches. Fixes: #724
Hi, finally got to evaluate and integrate the patch. Please have a look at 7117550, there was an error in the batching code. Evaluated the code on a small testing Mongoid app, didn't notice any significant performance difference (100,000 docs), but the behaviour of the importing code should be the same as previously. Thanks! |
Ah, true. Good spotting! |
Iterating through mongoid collections using skip() can get really slow for large collections, as it has to traverse from the beginning of the collection.
http://docs.mongodb.org/manual/reference/method/cursor.skip/
We were seeing our import task freeze up with high I/O after importing around 2.5 million records in a 6 million record collection.
This commit alters the mongoid import task so that it batches manually and makes the use of mongoid's cursor to only return a small batch of results.