Skip to content
This repository has been archived by the owner on Jun 30, 2018. It is now read-only.

Import task for Mongoid models should use the cursor, rather than skip() #724

Closed
wants to merge 1 commit into from

Conversation

tobycox
Copy link

@tobycox tobycox commented May 8, 2013

Iterating through mongoid collections using skip() can get really slow for large collections, as it has to traverse from the beginning of the collection.
http://docs.mongodb.org/manual/reference/method/cursor.skip/

We were seeing our import task freeze up with high I/O after importing around 2.5 million records in a 6 million record collection.

This commit alters the mongoid import task so that it batches manually and makes the use of mongoid's cursor to only return a small batch of results.

@karmi
Copy link
Owner

karmi commented May 11, 2013

Hi, this is surprising, we actually modeled this pseudo-find_in_batchesfor Mongoid after suggestions in mongoid/mongoid#1334 if memory serves correct. I don't have much experience with tuning Mongoid queries, but I think klass.all would be quite inefficient and load everything into memory?

@karmi
Copy link
Owner

karmi commented May 11, 2013

@nickhoffman @michaelklishin, any suggestions here?

items = klass.limit(options[:per_page]).skip(offset)
index.import items.to_a, options, &block
items = []
klass.all.each do |item|
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use in_groups_of as suggested in http://stackoverflow.com/a/9489691/95696 ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#in_groups_of will cause the entire result set to be retrieved from MongoDB. That's fine for small collections, but not for large collections.

@nickhoffman
Copy link

Tire's README suggests using the lower-level approach when needing to index large collections:

Tire.index("articles").import a_batch_of_articles

The MongoDB documentation that @halfbrick linked to says that a good alternative to cursor.skip() is for the application to limit the size of the result set using queries that are applicable to the application.

These two suggestions go hand-in-hand. For example, import articles in batches of 1 month:

Article.count
=> 10000000

time_delta      = 1.month
start_criteria  = Article.asc(:created_at).only :created_at
start_date      = start_criteria.first.created_at
end_date        = start_date + time_delta

while (start_date <= Time.now) do
  Article.where(:created_at.gte => start_date, :created_at.lt => end_date).import

  start_date = start_criteria.where(:created_at.gte => end_date).first.created_at
  end_date   = start_date + time_delta
end

@karmi
Copy link
Owner

karmi commented May 12, 2013

@nickhoffman The README actually suggests both options (maybe confusingly, see #726). The real question here is what is the most efficient approach when importing large sets from Mongoid collections -- see the code in https://github.com/karmi/tire/blob/master/lib/tire/model/import.rb#L62-L71. The code uses step-limit-skip, as suggested in mongoid/mongoid#1334, where @Halfbrick's changes use simply each and leave the mechanics of efficient iteration to Mongoid. I really lack relevant knowledge here as to what is the best approach.

@tobycox
Copy link
Author

tobycox commented May 12, 2013

I'm no mongoid expert either, but from what I understand mongoid queries (Criteria) are lazily evaluated and they wrap up the mongo cursor. The cursor by default returns 100 documents, or 1mb of data.

So calling klass.all wont load the entire collection into memory, it'll just load each object as it is iterated over - in this case only loading each batch into memory.

We didn't see any memory issues using klass.all.each on our 5GB collection.

@karmi
Copy link
Owner

karmi commented May 12, 2013

@Halfbrick Thanks for the update. This is surprising, since I remember we specifically added the step implementation to prevent inefficient iteration. Your information seems to be correct from what I see in the resources you posted and others.

@nickhoffman Should we then just use klass.all.each in the Mongoid importing strategy?

@karmi
Copy link
Owner

karmi commented May 12, 2013

BTW, 100 documents per batch feels suboptimal to me, usually Elasticsearch works best with batches in the thousands range. The current default is 1000.

@tobycox
Copy link
Author

tobycox commented May 12, 2013

OK cool. My commit still uses the same batch limit as before (1000 by default), that 100 document limit is just what happens under the hood in mongoid.

@karmi karmi closed this in 45eb4f9 Jun 5, 2013
karmi added a commit that referenced this pull request Jun 5, 2013
* For collections less then :per_page, no data would be imported.
* For a collection of 1,200 items, the last 200 documents won't be imported for the default batch of 1,000

Fixes #724

Cc: @tobycox
karmi added a commit that referenced this pull request Jun 5, 2013
In Mongoid import strategy, use the modulo division for better semantics
of breaking up the collection into batches.

Fixes: #724
@karmi
Copy link
Owner

karmi commented Jun 5, 2013

Hi, finally got to evaluate and integrate the patch. Please have a look at 7117550, there was an error in the batching code.

Evaluated the code on a small testing Mongoid app, didn't notice any significant performance difference (100,000 docs), but the behaviour of the importing code should be the same as previously. Thanks!

@tobycox
Copy link
Author

tobycox commented Jun 5, 2013

Ah, true. Good spotting!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants