-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
$lookup should reference join collection instead of clone data. #59
Comments
Hi @Redsandro Thanks for the question. Without a doubt, there is a performance hit with cloning. Although It has occurred to me to add an option to allow a user to specify whether cloning should happen or not for performance reasons. This has not been prioritized though because it has yet to be raised as a concern, probably because the library is most ideal for small datasets and that is how most people use it. Considering I have yet to do extensive benchmarks with
|
Thanks for the detailed reply @kofrasa. Before I comment, I want to emphasize that I don't know what exactly is cloned and how it's transformed. So some or perhaps all of my thoughts might not make sense. To be honest, I haven't yet tried Now even if we would ignore the performance hit; if I have 1 GB of datasets in memory, cloning it all would be a huge waste of memory. Note that I understand your approach, and for small datasets memory is 'cheap' enough to be a good payoff for exactly the reasons you stated. Lastly, cloning larger datasets on a regular basis makes I would say:
This. But then we can't keep writing the data in the background to |
Would you be willing to explain what data is being cloned/transformed when and how, on the I'm especially interested in doing If there is no deep cloning involved with basic queries, I might get away with doing E.g.:
This creates a new collection where the values are references to existing data. I'm using |
Hi @Redsandro The pipeline operators For a dataset of say 1GB, I don't think I will consider adding an option to optimize queries by mutating the input collection. This would reduce cloning but also make some operators (e.g. In your case Finally, please note that the MongoDB I hope this helps. |
Thanks @kofrasa. Does the clone function check if data was previously cloned? Understanding the need to guarantee that the underlying collection is not changed, I believe that cloning should happen a maximum of 1 times in the entire pipeline, and it could/should be postponed until absolutely necessary, making it possible to construct pipelines with minimal cloning. Let's take If If every step that absolutely must clone checks if the collection was previously cloned, and if not, do clone, we would have a maximum of one deep clone (e.g.
Perhaps I'm skimping on some versatilities because I'm thinking of my specific usecase and a workaround for using references, but it's most certainly a join of sorts, where the to-be-joined part is indeed a filter, because how else can I lookup the value. The use of This:
Is basically the following pipeline:
Note that all I believe if I could mark the collection the above example generates as "do not clone ever" and use it with |
To answer you question, the clone function does not maintain state. Cloning currently happens on-demand but for each stage. I considered the approach of cloning once and then keeping a history for subsequent stages but opted out due to the complexity and significant code size introduced. Even with that in place, it won't address the
For example, given the object I found a bug in I am marking this issue as a bug. Thanks for reporting and taking the time to discuss. |
Thanks for your elaborate response @kofrasa. Fixing this clone step adds a huge performance gain foor Am I correct to assume the following?
|
I think it is better to keep things simple. The fix for #60 will make the extra option unnecessary for most operators. |
@kofrasa Thank you for the commit. It is appreciated.
I didn't mean to propose a change, but I was wondering if I have the correct assumptions in the above list. |
Spot on @Redsandro. Your assumptions are correct and I very much like the breakdown. It describes how things should work. |
@kofrasa what is the
Or rephrased (for relevance to this topic): (How) do |
It is used in a few operators (e.g. No indexing is currently done with that field. Unless an index is required for different stages, building and using one just for matching will not improve performance. |
Ok gotcha. @kofrasa said:
For complex matches or selecting a lot of documents, you are right. But for Let me illustrate using Say you have only 100 documents in the pipeline, and you want to do an For every lookup, you have to iterate through 10,000 documents. The matching document can be in the beginning, middle, or end, so the average would (eventually) be 5,000 iterations. For 100 lookups, that is 500,000 iterations. Now to create an index on the fly, you have to iterate 10,000 documents once, and then do 100 lookups. That's 10,100 iterations. That's already a ~4950% performance boost. Needless to say, when the index was pre-created, you'd only need 100 iterations for the lookup. 500,000% performance boost. 😎 TL;DR: It might be interesting at some point to create an on-the-fly index |
I think I understand what you mean. Please see the For a collection of 100 documents and join collection of 10,000 Generally, given a pipeline collection In your specific use case, you have multiple An index saves having to iterate and hash the pipeline collection lookup field for each stage. |
Ah, fantastic. Please forgive my ignorance. Your I think you thought things through thoroughly, and it's a good bet to try and use this (once you fix the |
I'm not sure what's happening in lookup.js:
I'm confused by this specifically:
obj[asField] = map(indexes, (i) => clone(joinColl[i]))
Is the
$lookup
data being cloned, or added by reference?Cloning, of course, is trivial for small datasets but huge (unnecessary) overhead on big datasets.
The text was updated successfully, but these errors were encountered: