Simplifying to data frame is slow when querying large number of documents #37

renkun-ken · 2016-05-05T12:22:03Z

I have a collection of stock market data with billions of documents, each is a snapshot of the market depth of a stock in every 3 seconds.

When I query the data of a stock at the scale of years, the data transmission is fast but most time is spent simplifying to data frame.

> system.time(data <- stocks$find('{ "stock_id": "SH600000", "date": { "$gte": 20140101, "$lte": 20151231 }, "time": { "$gte": 93000, "$lte": 145700 }, "price": { "$gt": 0 }, "volume": { "$gt": 0 } }'))
 Imported 1337671 records. Simplifying into dataframe...
   user  system elapsed 
 26.172   0.188  29.395

But if I use the iterator and rbindlist chunk by chunk, the speed is much faster.

> library(data.table)
> system.time({
+   iter <- stocks$iterate('{ "stock_id": "SH600000", "date": { "$gte": 20140101, "$lte": 20151231 }, "time": { "$gte": 93000, "$lte": 145700 }, "price": { "$gt": 0 }, "volume": { "$gt": 0 } }')
+   while(!is.null(res <- iter$batch(size = 10000))) {
+     chunk <- rbindlist(res)
+   }
+ })
   user  system elapsed 
  6.596   0.120   9.577

I wonder if the simplifying mechanism can be made faster in some way?

The text was updated successfully, but these errors were encountered:

jeroen · 2016-05-11T08:20:10Z

Thanks for this report. I'm going to look into this.

fred777 · 2016-06-16T09:17:55Z

The magic comes from data.table which is way faster than plain old data.frame ;-)

I'm also iterating and rbindlist'ing most of the time because of this...

SymbolixAU · 2016-06-21T02:51:52Z

This is why I created mongolitedt - which does the rbindlist-ing for you.

Would be good if this could be incorporated into mongolite?

renkun-ken · 2016-06-21T03:00:17Z

When a large result set is returned, most time will be spent on simplifying the the result into data frame, which can be much slower than querying the data itself.

SymbolixAU · 2016-06-21T03:16:46Z

@renkun-ken - that's what mongolitedt is meant to address

jeroen · 2016-06-26T11:49:16Z

Is there any way to speed up the mongolite implementation without relying on data.table dependency? I'm a bit reluctant to drag in a heavy dependency such as data table if we only use a single function.

SymbolixAU · 2016-06-26T21:59:17Z

That's a fair statement.
I'm playing about with manipulating the data in C directly from the pointer/cursor. Making the assumption that the query will return a 'tabular' data set. This way, whole columns of data can be converted in one go, rather than each value one at a time.

But I'm not a C programmer and I'm not making much progress.

ericwatt · 2016-09-16T17:28:45Z

@jeroenooms what about having data.table as a suggests and using it conditionally?

fred777 · 2016-09-27T09:11:19Z

@ericwatt: good point!

MarkusLang1987 · 2018-05-15T14:00:54Z

Just found this interesting topic...

Isn´t the code wrong? When i have a dataset with 92000 observations and I use batches of size 10000, my data.frame will be 2000 observations large.

I´ve written this code:

endlist <- list()
i = 1

iter <- x$iterate()
while(!is.null(res <- iter$batch(size = 100))) {
    chunk <- rbindlist(res)
    endlist[[i]] <- chunk
    i = i + 1
}

ergebnis <- rbindlist(endlist)

This works for me.

atheriel · 2018-05-30T16:38:01Z

Hi, I've done something along the lines of @SymbolixAU's comment and written a new C-level parser for rectangular data, taking advantage of its regularity for a much more efficient allocation strategy. The side effect of this is that the result can simply be rbind()-ed together, avoiding a great deal of the slowness caused by simplify() and discussed in this issue.

You can see the results in this branch of my fork. In my testing it can generally be between 2 and 15 times faster to query data, depending on the number and composition of the columns and the responsiveness of the mongo server itself.

The current implementation is opt-in and not wholly feature complete: you must pass flat = TRUE to the find() method. Anyone sufficiently interested can try it out in the usual way via install_github(), but make sure to choose the correct branch (add-flat-cursor).

@jeroen Are you interested in trying to incorporate a change of this kind into the package? If so, I'm happy to open a PR for review.

jeroen · 2018-08-08T20:19:56Z

@atheriel I would be interested if we can make it in a way that doesn't introduce a hard dependency on data.table and also doesn't introduce too much new complexity in the code.

I would also prefer using dplyr over data.table if possible, because I hope we can add a dplyr back-end to mongolite at some point.

LuShuYangMing · 2020-01-16T10:24:48Z

@jeroen I also have a similar problem. I have a large vector (length: 60000000) which was cut into 1300 documents. And I want to retrieve them into a vector as faster as possible. Would you like to give me some suggestions?

FrissAnalytics mentioned this issue Oct 24, 2016

is it possible to disable simplifying to data frame? #60

Closed

jeroen added the enhancement label Feb 27, 2017

atheriel mentioned this issue Aug 1, 2018

Updates the specification tests to match the official new format. #146

Merged

atheriel mentioned this issue Aug 9, 2018

Add an opt-in parser tailored to rectangular data to improve the performance of large queries #148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplifying to data frame is slow when querying large number of documents #37

Simplifying to data frame is slow when querying large number of documents #37

renkun-ken commented May 5, 2016

jeroen commented May 11, 2016

fred777 commented Jun 16, 2016

SymbolixAU commented Jun 21, 2016

renkun-ken commented Jun 21, 2016

SymbolixAU commented Jun 21, 2016

jeroen commented Jun 26, 2016

SymbolixAU commented Jun 26, 2016

ericwatt commented Sep 16, 2016

fred777 commented Sep 27, 2016

MarkusLang1987 commented May 15, 2018

atheriel commented May 30, 2018

jeroen commented Aug 8, 2018

LuShuYangMing commented Jan 16, 2020

Simplifying to data frame is slow when querying large number of documents #37

Simplifying to data frame is slow when querying large number of documents #37

Comments

renkun-ken commented May 5, 2016

jeroen commented May 11, 2016

fred777 commented Jun 16, 2016

SymbolixAU commented Jun 21, 2016

renkun-ken commented Jun 21, 2016

SymbolixAU commented Jun 21, 2016

jeroen commented Jun 26, 2016

SymbolixAU commented Jun 26, 2016

ericwatt commented Sep 16, 2016

fred777 commented Sep 27, 2016

MarkusLang1987 commented May 15, 2018

atheriel commented May 30, 2018

jeroen commented Aug 8, 2018

LuShuYangMing commented Jan 16, 2020