Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

$lookup should reference join collection instead of clone data. #59

Closed
Redsandro opened this issue Sep 3, 2017 · 15 comments
Closed

$lookup should reference join collection instead of clone data. #59

Redsandro opened this issue Sep 3, 2017 · 15 comments
Labels

Comments

@Redsandro
Copy link

I'm not sure what's happening in lookup.js:

each(joinColl, (obj, i) => {
    let k = hashCode(obj[foreignField])
    hash[k] = hash[k] || []
    hash[k].push(i)
  })

  each(collection, (obj) => {
    let k = hashCode(obj[localField])
    let indexes = hash[k] || []
    obj[asField] = map(indexes, (i) => clone(joinColl[i]))
    result.push(obj)
  })

I'm confused by this specifically:

obj[asField] = map(indexes, (i) => clone(joinColl[i]))

Is the $lookup data being cloned, or added by reference?
Cloning, of course, is trivial for small datasets but huge (unnecessary) overhead on big datasets.

@kofrasa
Copy link
Owner

kofrasa commented Sep 4, 2017

Hi @Redsandro

Thanks for the question.

Without a doubt, there is a performance hit with cloning. Although mingo only implements querying, it does make some transformations as well. I decided to use cloning in order to avoid surprising users by changing their original data structure unexpectedly. This is not unique to $lookup which does not reference the join data. All operators that transform data, do not do so in place and must clone first. So cloning happens on as needed basis.

It has occurred to me to add an option to allow a user to specify whether cloning should happen or not for performance reasons. This has not been prioritized though because it has yet to be raised as a concern, probably because the library is most ideal for small datasets and that is how most people use it.

Considering I have yet to do extensive benchmarks with mingo I will take this opportunity to gather some data.

  1. How large is your dataset?
  2. What are the performance characteristics you are observing?

@Redsandro
Copy link
Author

Redsandro commented Sep 4, 2017

Thanks for the detailed reply @kofrasa.

Before I comment, I want to emphasize that I don't know what exactly is cloned and how it's transformed. So some or perhaps all of my thoughts might not make sense.

To be honest, I haven't yet tried mingo. I am researching possibilities to improve my application by not having to wait for mongodb writes before I can query them. Let's say that in a heavy scenario I would need to $lookup 6000 records. But that's not all. I need to do this with (again, heavy scenario) 30 100 concurrent datasets. All together there will be multiple queries per second on some dataset or another.

Now even if we would ignore the performance hit; if I have 1 GB of datasets in memory, cloning it all would be a huge waste of memory. Note that I understand your approach, and for small datasets memory is 'cheap' enough to be a good payoff for exactly the reasons you stated.

Lastly, cloning larger datasets on a regular basis makes node.js block further requests while it's busy, so the performance gain from not using mongodb is lost by having node.js block.

I would say:

It has occurred to me to add an option to allow a user to specify whether cloning should happen or not for performance reasons.

This. But then we can't keep writing the data in the background to mongodb if the data is being transformed before the write is completed, negating the reason to use mingo in the first place. But - and either it's a good idea or you already do this - if there are small objects created that reference the data from the set without cloning everything, this problem would not exist and the data remains the same.

@Redsandro
Copy link
Author

Redsandro commented Sep 4, 2017

Would you be willing to explain what data is being cloned/transformed when and how, on the readme.md?

I'm especially interested in doing $count and $redact. Many, many counts. Do these clone/transform data?

If there is no deep cloning involved with basic queries, I might get away with doing $lookups manually, and use the result with mingo.

E.g.:

let collection = collection0.map(doc => ({
    data: doc,
    lookup1: lazy(collection1).where({id: doc.lookup1.id}).value(),
    lookup2: lazy(collection2).where({id: doc.lookup2.id}).value(),
    lookup3: lazy(collection3).where({id: doc.lookup3.id}).value()
}))

This creates a new collection where the values are references to existing data. I'm using lazy.js to do lookups. It's similar-ish to lodash and underscore, but outperforms them in many of the scenario's I'm encountering.

@kofrasa
Copy link
Owner

kofrasa commented Sep 4, 2017

Hi @Redsandro

The pipeline operators $unwind, $lookup, $redact, $addFields, and $project reshape objects. Cloning is used to guarantee that the underlying collection is not changed. This is a hard requirement for any query engine, even an in-memory one. A query should not mutate the input dataset.

For a dataset of say 1GB, I don't think mingo is the right solution. Although the library is usable in NodeJS environment, it was initially written for use in the browser where large datasets are typically impractical.

I will consider adding an option to optimize queries by mutating the input collection. This would reduce cloning but also make some operators (e.g. $out) produce nonsensical results, assuming no extra work is done to correct it. That can be addressed, and it is a reasonable tradeoff for more performance. It will require clear documentation that objects in the input collection can be modified and should not be considered unchanged after having been used in a query.

In your case $count is ok but $redact still needs to clone because the object gets transformed.

Finally, please note that the MongoDB $lookup operator and your example above are not the same. The former actually performs a join operation while the latter is a filter. You can use filters via mingo.Query and have access to the full set of MongoDB query operators ($and, $or, $elemMatch etc). This requires no cloning and is more powerful than many alternatives.

I hope this helps.

@Redsandro
Copy link
Author

Redsandro commented Sep 5, 2017

Thanks @kofrasa.

Does the clone function check if data was previously cloned? Understanding the need to guarantee that the underlying collection is not changed, I believe that cloning should happen a maximum of 1 times in the entire pipeline, and it could/should be postponed until absolutely necessary, making it possible to construct pipelines with minimal cloning.

Let's take $project for example. It will be used often in pipelines, but not all pipelines require deep-cloning. I'm hypothesizing that creating new objects pointing to (referencing) the data in the objects from the previous step (shallow-cloning) in the pipeline is cheaper. Note that 'referencing' single-value-properties would essentially be cloning those properties, but if the data consists (perhaps even exclusively) of subdocuments (Objects), it would be referenced rather than deep-cloned. If the original collection data is not used further, it gets garbage collected. If it is used further (e.g. a mongodb write in the background), it is still the same unchanged data.

If $unwind would be next, this would (probably) require deep-cloning. It could look for mingo metadata to see if the collection was previously cloned. If so, mutate. If not, clone first, mark as cloned, and then mutate.

If every step that absolutely must clone checks if the collection was previously cloned, and if not, do clone, we would have a maximum of one deep clone (e.g. $unwind) and one shallow clone (e.g. $project)

Finally, please note that the MongoDB $lookup operator and your example above are not the same. The former actually performs a join operation while the latter is a filter.

Perhaps I'm skimping on some versatilities because I'm thinking of my specific usecase and a workaround for using references, but it's most certainly a join of sorts, where the to-be-joined part is indeed a filter, because how else can I lookup the value. The use of lazy can be compared to a single findOne(). Let me make one small edit below:

This:

let collection = collection0.map(doc => ({
    data: doc,
    lookup1: lazy(collection1).where({id: doc.lookup1.id}).first(),
    lookup2: lazy(collection2).where({id: doc.lookup2.id}).first(),
    lookup3: lazy(collection3).where({id: doc.lookup3.id}).first()
}))

Is basically the following pipeline:

db.collection0.aggregate([{
    $match: {}
}, {
    $project: {
        data: '$$ROOT'
    }
}, {
    $lookup: {
        from: 'collection1',
        localField: 'id',
        foreignField: 'id',
        as: 'lookup1'
    }
}, {
    $lookup: {
        from: 'collection2',
        localField: 'id',
        foreignField: 'id',
        as: 'lookup2'
    }
}, {
    $lookup: {
        from: 'collection3',
        localField: 'id',
        foreignField: 'id',
        as: 'lookup3'
    }
}]).toArray()

Note that all collection0's data is now projected into the data key, a necessity to keep the original data untainted, and construct the collection using only references (zero cloning).

I believe if I could mark the collection the above example generates as "do not clone ever" and use it with mingo, the original data remains untainted because it is merely referenced. (Except for $unwind.)

@kofrasa
Copy link
Owner

kofrasa commented Sep 5, 2017

To answer you question, the clone function does not maintain state.

Cloning currently happens on-demand but for each stage. I considered the approach of cloning once and then keeping a history for subsequent stages but opted out due to the complexity and significant code size introduced. Even with that in place, it won't address the $lookup case since the collection being cloned is the secondary data source. Too many decision points open up with this approach and it is not always clear what is the right thing to do.

mingo does reference values of the input data but will return a new object with reference to unchanged parts of the original if an operation modifies, add, or remove a value from the original object. So what are hypothesizing is what happens in some cases. Cloning the whole object therefore is not necessary, but it is the simplest and safest approach to take in some cases.

For example, given the object {a: {b: {c: {d: { e: 1} } } } } if the the value for key e is changed, the entire object must be cloned. Given another object say {x: 3, a: {b: {c: {d: { e: 1} } } } }, if we change the value for x: 4 then we must create a new object with the updated value but reference the value a such that {x: 4, a: <ref-a-val>}.

I found a bug in $lookup here #60 which does not do what I describe above but instead modifies the original. The fix should also remove the need to clone the join collection which will address your use-case.

I am marking this issue as a bug.

Thanks for reporting and taking the time to discuss.

@kofrasa kofrasa added the bug label Sep 5, 2017
@kofrasa kofrasa changed the title Does $lookup reference or clone data? $lookup should reference join collection instead of clone data. Sep 5, 2017
@Redsandro
Copy link
Author

Thanks for your elaborate response @kofrasa.

Fixing this clone step adds a huge performance gain foor $lookup and will indeed address my use-case. Thank you.

Am I correct to assume the following?

@kofrasa
Copy link
Owner

kofrasa commented Sep 5, 2017

I think it is better to keep things simple.

The fix for #60 will make the extra option unnecessary for most operators.

@kofrasa kofrasa closed this as completed in 2fb3650 Sep 6, 2017
@Redsandro
Copy link
Author

Redsandro commented Sep 6, 2017

@kofrasa Thank you for the commit. It is appreciated.

I think it is better to keep things simple.

I didn't mean to propose a change, but I was wondering if I have the correct assumptions in the above list.

@kofrasa
Copy link
Owner

kofrasa commented Sep 6, 2017

Spot on @Redsandro.

Your assumptions are correct and I very much like the breakdown. It describes how things should work.

@Redsandro
Copy link
Author

@kofrasa what is the idKey used for?

mingo.setup({
    key: '_id' // default
});

Or rephrased (for relevance to this topic):

(How) do find(), $match and $lookup use this? Are these functions optimized using this index in any way, e.g. finding by _id being faster than a lazy/lodash/underscore where filter?

@kofrasa
Copy link
Owner

kofrasa commented Sep 6, 2017

idKey refers to field that uniquely identifies the objects in the collection.

It is used in a few operators (e.g. $project) that require it. The default name _id was used to be consistent with MongoDB.

No indexing is currently done with that field. Unless an index is required for different stages, building and using one just for matching will not improve performance.

@Redsandro
Copy link
Author

Redsandro commented Sep 6, 2017

Ok gotcha.

@kofrasa said:

No indexing is currently done with that field. Unless the index is required for several different operators (which I don't think it is), the performance for building and using one just for matching will not be greater.

For complex matches or selecting a lot of documents, you are right. But for findOne() or matching a unique value like _id, creating an index is more performant when you do more than 2 lookups (3+).

Let me illustrate using $lookup as an example, as it is less dependent on use-case.

Say you have only 100 documents in the pipeline, and you want to do an _id based lookup from a collection of 10,000 documents (e.g. all the students in a university or members of a marketplace).

For every lookup, you have to iterate through 10,000 documents. The matching document can be in the beginning, middle, or end, so the average would (eventually) be 5,000 iterations. For 100 lookups, that is 500,000 iterations.

Now to create an index on the fly, you have to iterate 10,000 documents once, and then do 100 lookups. That's 10,100 iterations. That's already a ~4950% performance boost.

Needless to say, when the index was pre-created, you'd only need 100 iterations for the lookup. 500,000% performance boost. 😎

TL;DR: It might be interesting at some point to create an on-the-fly index for _id for the as field in lookups.

@kofrasa
Copy link
Owner

kofrasa commented Sep 6, 2017

I think I understand what you mean. Please see the $lookup implementation here. There are a few implementation details that can be improved but the general idea remains unchanged.

For a collection of 100 documents and join collection of 10,000 mingo will peform 10,100 iterations not 500,000.

Generally, given a pipeline collection xs and join collection ys, we do size(xs) + size(ys) iterations. $lookup uses a hash-join algorithm. What I was alluding to is that indexes aren't particularly useful for mingo.

In your specific use case, you have multiple $lookups in the pipeline, joining on the same field id so you want to index the pipeline collection. For that to happen though, the different lookup stages cannot be independent. Also, if any of the stages joined on a separate foreign field, building an index to throw away adds no value.

An index saves having to iterate and hash the pipeline collection lookup field for each stage.
You may want to implement a custom operator for your special case say$multiLookup, which will first index the input and use the index for all the stages involved.

@Redsandro
Copy link
Author

Redsandro commented Sep 6, 2017

Ah, fantastic. Please forgive my ignorance.

Your hashCode lookup is basically what I meant with on-the-fly throw-away index, except I'd write an index rather than a hash table. For no particular reason, just a first thought. Arrays are faster but hashing is slower.

I think you thought things through thoroughly, and it's a good bet to try and use this (once you fix the clone(joinColl[i])) for lookups; see #63) inside node.js. Since collections don't need to be 'inserted' like in LokiJS, I can try more trickery and split collections up into bite size chunks - immitating Loki live collections - manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants