-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplifying to data frame is slow when querying large number of documents #37
Comments
Thanks for this report. I'm going to look into this. |
The magic comes from data.table which is way faster than plain old data.frame ;-) I'm also iterating and rbindlist'ing most of the time because of this... |
This is why I created Would be good if this could be incorporated into |
When a large result set is returned, most time will be spent on simplifying the the result into data frame, which can be much slower than querying the data itself. |
@renkun-ken - that's what |
Is there any way to speed up the mongolite implementation without relying on |
That's a fair statement. But I'm not a C programmer and I'm not making much progress. |
@jeroenooms what about having data.table as a suggests and using it conditionally? |
@ericwatt: good point! |
Just found this interesting topic... Isn´t the code wrong? When i have a dataset with 92000 observations and I use batches of size 10000, my data.frame will be 2000 observations large. I´ve written this code:
This works for me. |
Hi, I've done something along the lines of @SymbolixAU's comment and written a new C-level parser for rectangular data, taking advantage of its regularity for a much more efficient allocation strategy. The side effect of this is that the result can simply be You can see the results in this branch of my fork. In my testing it can generally be between 2 and 15 times faster to query data, depending on the number and composition of the columns and the responsiveness of the mongo server itself. The current implementation is opt-in and not wholly feature complete: you must pass @jeroen Are you interested in trying to incorporate a change of this kind into the package? If so, I'm happy to open a PR for review. |
@atheriel I would be interested if we can make it in a way that doesn't introduce a hard dependency on data.table and also doesn't introduce too much new complexity in the code. I would also prefer using dplyr over data.table if possible, because I hope we can add a dplyr back-end to mongolite at some point. |
@jeroen I also have a similar problem. I have a large vector (length: 60000000) which was cut into 1300 documents. And I want to retrieve them into a vector as faster as possible. Would you like to give me some suggestions? |
I have a collection of stock market data with billions of documents, each is a snapshot of the market depth of a stock in every 3 seconds.
When I query the data of a stock at the scale of years, the data transmission is fast but most time is spent simplifying to data frame.
But if I use the iterator and
rbindlist
chunk by chunk, the speed is much faster.I wonder if the simplifying mechanism can be made faster in some way?
The text was updated successfully, but these errors were encountered: