Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying to data frame is slow when querying large number of documents #37

Open
renkun-ken opened this issue May 5, 2016 · 13 comments

Comments

@renkun-ken
Copy link
Contributor

I have a collection of stock market data with billions of documents, each is a snapshot of the market depth of a stock in every 3 seconds.

When I query the data of a stock at the scale of years, the data transmission is fast but most time is spent simplifying to data frame.

> system.time(data <- stocks$find('{ "stock_id": "SH600000", "date": { "$gte": 20140101, "$lte": 20151231 }, "time": { "$gte": 93000, "$lte": 145700 }, "price": { "$gt": 0 }, "volume": { "$gt": 0 } }'))
 Imported 1337671 records. Simplifying into dataframe...
   user  system elapsed 
 26.172   0.188  29.395 

But if I use the iterator and rbindlist chunk by chunk, the speed is much faster.

> library(data.table)
> system.time({
+   iter <- stocks$iterate('{ "stock_id": "SH600000", "date": { "$gte": 20140101, "$lte": 20151231 }, "time": { "$gte": 93000, "$lte": 145700 }, "price": { "$gt": 0 }, "volume": { "$gt": 0 } }')
+   while(!is.null(res <- iter$batch(size = 10000))) {
+     chunk <- rbindlist(res)
+   }
+ })
   user  system elapsed 
  6.596   0.120   9.577 

I wonder if the simplifying mechanism can be made faster in some way?

@jeroen
Copy link
Owner

jeroen commented May 11, 2016

Thanks for this report. I'm going to look into this.

@fred777
Copy link

fred777 commented Jun 16, 2016

The magic comes from data.table which is way faster than plain old data.frame ;-)

I'm also iterating and rbindlist'ing most of the time because of this...

@SymbolixAU
Copy link

This is why I created mongolitedt - which does the rbindlist-ing for you.

Would be good if this could be incorporated into mongolite?

@renkun-ken
Copy link
Contributor Author

When a large result set is returned, most time will be spent on simplifying the the result into data frame, which can be much slower than querying the data itself.

@SymbolixAU
Copy link

@renkun-ken - that's what mongolitedt is meant to address

@jeroen
Copy link
Owner

jeroen commented Jun 26, 2016

Is there any way to speed up the mongolite implementation without relying on data.table dependency? I'm a bit reluctant to drag in a heavy dependency such as data table if we only use a single function.

@SymbolixAU
Copy link

That's a fair statement.
I'm playing about with manipulating the data in C directly from the pointer/cursor. Making the assumption that the query will return a 'tabular' data set. This way, whole columns of data can be converted in one go, rather than each value one at a time.

But I'm not a C programmer and I'm not making much progress.

@ericwatt
Copy link

@jeroenooms what about having data.table as a suggests and using it conditionally?

@fred777
Copy link

fred777 commented Sep 27, 2016

@ericwatt: good point!

@MarkusLang1987
Copy link

Just found this interesting topic...

Isn´t the code wrong? When i have a dataset with 92000 observations and I use batches of size 10000, my data.frame will be 2000 observations large.

I´ve written this code:

endlist <- list()
i = 1

iter <- x$iterate()
while(!is.null(res <- iter$batch(size = 100))) {
    chunk <- rbindlist(res)
    endlist[[i]] <- chunk
    i = i + 1
}

ergebnis <- rbindlist(endlist)

This works for me.

@atheriel
Copy link
Contributor

Hi, I've done something along the lines of @SymbolixAU's comment and written a new C-level parser for rectangular data, taking advantage of its regularity for a much more efficient allocation strategy. The side effect of this is that the result can simply be rbind()-ed together, avoiding a great deal of the slowness caused by simplify() and discussed in this issue.

You can see the results in this branch of my fork. In my testing it can generally be between 2 and 15 times faster to query data, depending on the number and composition of the columns and the responsiveness of the mongo server itself.

The current implementation is opt-in and not wholly feature complete: you must pass flat = TRUE to the find() method. Anyone sufficiently interested can try it out in the usual way via install_github(), but make sure to choose the correct branch (add-flat-cursor).

@jeroen Are you interested in trying to incorporate a change of this kind into the package? If so, I'm happy to open a PR for review.

@jeroen
Copy link
Owner

jeroen commented Aug 8, 2018

@atheriel I would be interested if we can make it in a way that doesn't introduce a hard dependency on data.table and also doesn't introduce too much new complexity in the code.

I would also prefer using dplyr over data.table if possible, because I hope we can add a dplyr back-end to mongolite at some point.

@LuShuYangMing
Copy link

@jeroen I also have a similar problem. I have a large vector (length: 60000000) which was cut into 1300 documents. And I want to retrieve them into a vector as faster as possible. Would you like to give me some suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants