Import task for Mongoid models should use the cursor, rather than skip() #724

tobycox · 2013-05-08T09:50:31Z

Iterating through mongoid collections using skip() can get really slow for large collections, as it has to traverse from the beginning of the collection.
http://docs.mongodb.org/manual/reference/method/cursor.skip/

We were seeing our import task freeze up with high I/O after importing around 2.5 million records in a 6 million record collection.

This commit alters the mongoid import task so that it batches manually and makes the use of mongoid's cursor to only return a small batch of results.

karmi · 2013-05-11T19:58:49Z

Hi, this is surprising, we actually modeled this pseudo-find_in_batchesfor Mongoid after suggestions in mongoid/mongoid#1334 if memory serves correct. I don't have much experience with tuning Mongoid queries, but I think klass.all would be quite inefficient and load everything into memory?

karmi · 2013-05-11T20:01:06Z

@nickhoffman @michaelklishin, any suggestions here?

karmi · 2013-05-11T20:03:16Z

lib/tire/model/import.rb

-              items = klass.limit(options[:per_page]).skip(offset)
-              index.import items.to_a, options, &block
+            items = []
+            klass.all.each do |item|


Could we use in_groups_of as suggested in http://stackoverflow.com/a/9489691/95696 ?

#in_groups_of will cause the entire result set to be retrieved from MongoDB. That's fine for small collections, but not for large collections.

nickhoffman · 2013-05-11T22:07:27Z

Tire's README suggests using the lower-level approach when needing to index large collections:

Tire.index("articles").import a_batch_of_articles

The MongoDB documentation that @halfbrick linked to says that a good alternative to cursor.skip() is for the application to limit the size of the result set using queries that are applicable to the application.

These two suggestions go hand-in-hand. For example, import articles in batches of 1 month:

Article.count
=> 10000000

time_delta      = 1.month
start_criteria  = Article.asc(:created_at).only :created_at
start_date      = start_criteria.first.created_at
end_date        = start_date + time_delta

while (start_date <= Time.now) do
  Article.where(:created_at.gte => start_date, :created_at.lt => end_date).import

  start_date = start_criteria.where(:created_at.gte => end_date).first.created_at
  end_date   = start_date + time_delta
end

karmi · 2013-05-12T09:00:40Z

@nickhoffman The README actually suggests both options (maybe confusingly, see #726). The real question here is what is the most efficient approach when importing large sets from Mongoid collections -- see the code in https://github.com/karmi/tire/blob/master/lib/tire/model/import.rb#L62-L71. The code uses step-limit-skip, as suggested in mongoid/mongoid#1334, where @Halfbrick's changes use simply each and leave the mechanics of efficient iteration to Mongoid. I really lack relevant knowledge here as to what is the best approach.

tobycox · 2013-05-12T11:05:39Z

I'm no mongoid expert either, but from what I understand mongoid queries (Criteria) are lazily evaluated and they wrap up the mongo cursor. The cursor by default returns 100 documents, or 1mb of data.

So calling klass.all wont load the entire collection into memory, it'll just load each object as it is iterated over - in this case only loading each batch into memory.

We didn't see any memory issues using klass.all.each on our 5GB collection.

karmi · 2013-05-12T11:19:06Z

@Halfbrick Thanks for the update. This is surprising, since I remember we specifically added the step implementation to prevent inefficient iteration. Your information seems to be correct from what I see in the resources you posted and others.

@nickhoffman Should we then just use klass.all.each in the Mongoid importing strategy?

karmi · 2013-05-12T11:20:35Z

BTW, 100 documents per batch feels suboptimal to me, usually Elasticsearch works best with batches in the thousands range. The current default is 1000.

tobycox · 2013-05-12T12:15:11Z

OK cool. My commit still uses the same batch limit as before (1000 by default), that 100 document limit is just what happens under the hood in mongoid.

@tobycox

* For collections less then :per_page, no data would be imported. * For a collection of 1,200 items, the last 200 documents won't be imported for the default batch of 1,000 Fixes #724 Cc: @tobycox

In Mongoid import strategy, use the modulo division for better semantics of breaking up the collection into batches. Fixes: #724

karmi · 2013-06-05T11:37:57Z

Hi, finally got to evaluate and integrate the patch. Please have a look at 7117550, there was an error in the batching code.

Evaluated the code on a small testing Mongoid app, didn't notice any significant performance difference (100,000 docs), but the behaviour of the importing code should be the same as previously. Thanks!

tobycox · 2013-06-05T11:49:24Z

Ah, true. Good spotting!

Using mongo cursor for mongoid imports, to avoid using skip()

170c0d4

karmi reviewed May 11, 2013
View reviewed changes

karmi closed this in 45eb4f9 Jun 5, 2013

karmi added a commit that referenced this pull request Jun 5, 2013

[#724] Use modulo in Mongoid import strategy for breaking up collection

67b8903

In Mongoid import strategy, use the modulo division for better semantics of breaking up the collection into batches. Fixes: #724

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import task for Mongoid models should use the cursor, rather than skip() #724

Import task for Mongoid models should use the cursor, rather than skip() #724

tobycox commented May 8, 2013

karmi commented May 11, 2013

karmi commented May 11, 2013

karmi May 11, 2013

nickhoffman May 11, 2013

nickhoffman commented May 11, 2013

karmi commented May 12, 2013

tobycox commented May 12, 2013

karmi commented May 12, 2013

karmi commented May 12, 2013

tobycox commented May 12, 2013

karmi commented Jun 5, 2013

tobycox commented Jun 5, 2013

Import task for Mongoid models should use the cursor, rather than skip() #724

Import task for Mongoid models should use the cursor, rather than skip() #724

Conversation

tobycox commented May 8, 2013

karmi commented May 11, 2013

karmi commented May 11, 2013

karmi May 11, 2013

Choose a reason for hiding this comment

nickhoffman May 11, 2013

Choose a reason for hiding this comment

nickhoffman commented May 11, 2013

karmi commented May 12, 2013

tobycox commented May 12, 2013

karmi commented May 12, 2013

karmi commented May 12, 2013

tobycox commented May 12, 2013

karmi commented Jun 5, 2013

tobycox commented Jun 5, 2013