Added .group() function #153

Open
wants to merge 28 commits into
from

Conversation

Projects
None yet
@adityamukho

Ref: http://docs.mongodb.org/manual/reference/method/db.collection.group/

Added a function to the Cursor protoype that allows users to reduce the result set (after find()), allowing operations like aggregation (sum, avg, count, etc) or any other reductor. This function can work in conjunction with limit(), skip() and sort(). The order of the execution pipeline is: find -> group -> sort -> slice (skip+limit).

The function is invoked on the Cursor object returned by find(), with an object that looks like:

key: {
    'field1': 1,
    'nested.field.2': 1
},
reduce: function (curr, result) {
    result.count++;
    result.size += curr.size;
},
initial: {
    count: 0,
    size: 0
},
finalize: function (result) {
    result.avg = Math.round(result.size / result.count);
}

Unlike the group() function in MongoDB, this does not need a cond part, since the find() operation takes care of that beforehand.

The function returns the Cursor on which it was invoked, allowing for chaining sort, skip/limit ops later.

The reduction outputs a resultset that looks like:

[
  {
    field1: 'A',
    nested_field_2: 'b',
    count: 2,
    size: 4,
    avg: 2
  },
  {
    field1: 'A',
    nested_field_2: 'b1',
    count: 3,
    size: 9,
    avg: 3
  },
  {
    field1: 'B',
    nested_field_2: 'b',
    count: 1,
    size: 4,
    avg: 4
  },
  {
    field1: 'B',
    nested_field_2: 'b2',
    count: 5,
    size: 1,
    avg: 0
  },
...
...
]

The field names in the groupBy keys have their . replaced by _ so that the subsequent sort function, if present, works properly.

I have tried my best to ensure the function is optimized, and robust. The test suite passes all tests. Have tested this function on a DB with ~380,000 records and the same reductor as shown above, and it works fine. Haven't added test cases directly to the project yet, though. Will do it if this looks likely to be merged :).

PS. The browser versions were auto-generated using the build script. I haven't modified them manually. The test page shows no errors there either.

@Ivshti

This comment has been minimized.

Show comment
Hide comment
@Ivshti

Ivshti Mar 9, 2014

I haven't tested this much, but first I tried to use it like:
col.group({ key: { firstName: 1 } }).exec(function(err, res) { console.log(res) }) and no result was ever returned

My collection has two objects, both have firstName field. What am I doing wrong? Can this be used without the reduce, finalize, etc?

Ivshti commented Mar 9, 2014

I haven't tested this much, but first I tried to use it like:
col.group({ key: { firstName: 1 } }).exec(function(err, res) { console.log(res) }) and no result was ever returned

My collection has two objects, both have firstName field. What am I doing wrong? Can this be used without the reduce, finalize, etc?

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Mar 9, 2014

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

You need to invoke col.find(...).group(...) rather than col.group(). It is a slight departure from the mongo invocation, but I felt this made more sense in the nedb context, keeping things in line with the other resultset post-processors like sort, skip, etc.

Also, this way one doesnt need to support the 'cond' attribute, since find() takes care of filtering already. In this implementation, grouping is only possible on the result of the find() call, and so must be invoked on the cursor (hence you should omit passing a callback to find).

finalize() is optional, but a reduce() is necessary - in a way it is the whole point of the group function :-).

On March 10, 2014 1:57:20 AM GMT+05:30, Ivo Georgiev notifications@github.com wrote:

I haven't tested this much, but first I tried to use it like:
col.group({ key: { firstName: 1 } }).exec(function(err, res) {
console.log(res) }) and no result was ever returned

My collection has two objects, both have firstName field. What am I
doing wrong? Can this be used without the reduce, finalize, etc?


Reply to this email directly or view it on GitHub:
#153 (comment)


Sent from my Android device with K-9 Mail. Please excuse my brevity.
-----BEGIN PGP SIGNATURE-----
Version: APG v1.0.9

iQFgBAEBCABKBQJTHNk/QxxBZGl0eWEgTXVraG9wYWRoeWF5ICgyMDQ4IGJpdCBS
U0EmUlNBKSA8d2VibWFzdGVyQGFkaXR5YW11a2hvLmNvbT4ACgkQXqCS3Ttcc2Ls
Cgf+IqrBmwfHsLk7TeUrw8EwmXY7vjfdyVuO8S20xKP7BMJGdlZCJIwTg610p4SP
EuJXAVDUIfuMLeQ723HUGtrBXg2BF4mY9gmmpQvbGA4sf99Wu22rGQ/xUv0youzE
YYgjBiBW0ce2LO3kEYls0evdCx+6JDqbmJP/GZMxPMskQCXCk0SPizs+BpXsgMi/
nOSNFa8XfyMR85boSAxAZm8+dt8y21h34OaWnRt2ZwCZ7Ml4HAbYXF+9LjDXGgg4
hsP2jl/PU2UwGD1iAhgBwXI0a4haGwo+/WgLUuYJFpH5PyDSXLYGoG8e9INg7BzE
DZLqnI7cxc2hwaqlJ705TanXCw==
=e8Gi
-----END PGP SIGNATURE-----


Powered by BigRock.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

You need to invoke col.find(...).group(...) rather than col.group(). It is a slight departure from the mongo invocation, but I felt this made more sense in the nedb context, keeping things in line with the other resultset post-processors like sort, skip, etc.

Also, this way one doesnt need to support the 'cond' attribute, since find() takes care of filtering already. In this implementation, grouping is only possible on the result of the find() call, and so must be invoked on the cursor (hence you should omit passing a callback to find).

finalize() is optional, but a reduce() is necessary - in a way it is the whole point of the group function :-).

On March 10, 2014 1:57:20 AM GMT+05:30, Ivo Georgiev notifications@github.com wrote:

I haven't tested this much, but first I tried to use it like:
col.group({ key: { firstName: 1 } }).exec(function(err, res) {
console.log(res) }) and no result was ever returned

My collection has two objects, both have firstName field. What am I
doing wrong? Can this be used without the reduce, finalize, etc?


Reply to this email directly or view it on GitHub:
#153 (comment)


Sent from my Android device with K-9 Mail. Please excuse my brevity.
-----BEGIN PGP SIGNATURE-----
Version: APG v1.0.9

iQFgBAEBCABKBQJTHNk/QxxBZGl0eWEgTXVraG9wYWRoeWF5ICgyMDQ4IGJpdCBS
U0EmUlNBKSA8d2VibWFzdGVyQGFkaXR5YW11a2hvLmNvbT4ACgkQXqCS3Ttcc2Ls
Cgf+IqrBmwfHsLk7TeUrw8EwmXY7vjfdyVuO8S20xKP7BMJGdlZCJIwTg610p4SP
EuJXAVDUIfuMLeQ723HUGtrBXg2BF4mY9gmmpQvbGA4sf99Wu22rGQ/xUv0youzE
YYgjBiBW0ce2LO3kEYls0evdCx+6JDqbmJP/GZMxPMskQCXCk0SPizs+BpXsgMi/
nOSNFa8XfyMR85boSAxAZm8+dt8y21h34OaWnRt2ZwCZ7Ml4HAbYXF+9LjDXGgg4
hsP2jl/PU2UwGD1iAhgBwXI0a4haGwo+/WgLUuYJFpH5PyDSXLYGoG8e9INg7BzE
DZLqnI7cxc2hwaqlJ705TanXCw==
=e8Gi
-----END PGP SIGNATURE-----


Powered by BigRock.com

@Ivshti

This comment has been minimized.

Show comment
Hide comment
@Ivshti

Ivshti Mar 9, 2014

Sorry, I did invoke .find() - just forgot to write it. Otherwise it would have thrown an error (.group would be undefined for the collection since it's defined in the cursor).

However, in my case there was no error - the callback was simply not invoked. This behavior is not ideal - in case reduce is necessary, there should be an error thrown if it's not passed.

Ivshti commented Mar 9, 2014

Sorry, I did invoke .find() - just forgot to write it. Otherwise it would have thrown an error (.group would be undefined for the collection since it's defined in the cursor).

However, in my case there was no error - the callback was simply not invoked. This behavior is not ideal - in case reduce is necessary, there should be an error thrown if it's not passed.

@Ivshti

This comment has been minimized.

Show comment
Hide comment
@Ivshti

Ivshti Mar 9, 2014

It seems to work when I added both initial: and reduce:

I'm sorry, the error is being called on the callback. My mistake, ignore the initial comment.

Ivshti commented Mar 9, 2014

It seems to work when I added both initial: and reduce:

I'm sorry, the error is being called on the callback. My mistake, ignore the initial comment.

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Mar 10, 2014

Hmmm, it would've been really strange if it hadn't thrown an error for the missing input - the validation code to check for valid keys, reduce and initial is present.

Hmmm, it would've been really strange if it hadn't thrown an error for the missing input - the validation code to check for valid keys, reduce and initial is present.

@louischatriot

This comment has been minimized.

Show comment
Hide comment
@louischatriot

louischatriot Mar 12, 2014

Owner

Thanks for the PR, this seems very interesting. Will try to review and merge in the coming days.

Owner

louischatriot commented Mar 12, 2014

Thanks for the PR, this seems very interesting. Will try to review and merge in the coming days.

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Mar 13, 2014

Added a commit that allows the caller to omit specifying the key. This will result in the reducer being applied to the entire find() result, followed by finalize(), if defined.

Added a commit that allows the caller to omit specifying the key. This will result in the reducer being applied to the entire find() result, followed by finalize(), if defined.

@Azema

This comment has been minimized.

Show comment
Hide comment
@Azema

Azema Apr 2, 2014

Hi,

It's a very good functionality, thank you @adityamukho. I've merge this on my fork (with the new sort function #159) and it's work correctly for the moment.

The thing that stuck me was that the array of values ​​are not accepted, even by manipulating the data in the reduce function. Is this normal?

Azema commented Apr 2, 2014

Hi,

It's a very good functionality, thank you @adityamukho. I've merge this on my fork (with the new sort function #159) and it's work correctly for the moment.

The thing that stuck me was that the array of values ​​are not accepted, even by manipulating the data in the reduce function. Is this normal?

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Apr 4, 2014

Thanks for the feedback @Azema
I'm guessing you're talking about the case where the path indicated by the key points to an array rather than a scalar. Yes, in that case it wouldn't work as expected, since the grouping algorithm doesn't perform a 'deep compare'.

In any case, can you share the dataset and operators you used in your query? It may be useful to take a closer look.

Thanks for the feedback @Azema
I'm guessing you're talking about the case where the path indicated by the key points to an array rather than a scalar. Yes, in that case it wouldn't work as expected, since the grouping algorithm doesn't perform a 'deep compare'.

In any case, can you share the dataset and operators you used in your query? It may be useful to take a closer look.

@Azema

This comment has been minimized.

Show comment
Hide comment
@Azema

Azema Apr 4, 2014

Hi @adityamukho,

Here is the information you asked.

I tried to delete the condition that checks that the value indicated by the key points is an array and in my case it worked, because the array was concatenated. But I think it would not work in all cases.

A example of data used:

{ "title": "A", "genre": ["a","b"] }
{ "title": "B", "genre": ["c"] }
{ "title": "C", "genre": ["d","e"] }
{ "title": "D", "genre": [] }

My query:

  var sort = {genre: 1};
  db.find({})
    .group({
      key: {'genre': 1},
      reduce: function(curr, result) {
        if (curr.genre instanceof Array && curr.genre.length > 0) {
          result.genre = curr.genre[0];
        }
        return result;
      },
      initial: {}
    }).sort({genre: 1}).exec(function(err, genres) {
      res.json({results: genres});
    });

The results of the query:

[
  { "genre": "a" }
]

Here are the expected results:

[
  { "genre": "a" },
  { "genre": "c" },
  { "genre": "d" }
]

Azema commented Apr 4, 2014

Hi @adityamukho,

Here is the information you asked.

I tried to delete the condition that checks that the value indicated by the key points is an array and in my case it worked, because the array was concatenated. But I think it would not work in all cases.

A example of data used:

{ "title": "A", "genre": ["a","b"] }
{ "title": "B", "genre": ["c"] }
{ "title": "C", "genre": ["d","e"] }
{ "title": "D", "genre": [] }

My query:

  var sort = {genre: 1};
  db.find({})
    .group({
      key: {'genre': 1},
      reduce: function(curr, result) {
        if (curr.genre instanceof Array && curr.genre.length > 0) {
          result.genre = curr.genre[0];
        }
        return result;
      },
      initial: {}
    }).sort({genre: 1}).exec(function(err, genres) {
      res.json({results: genres});
    });

The results of the query:

[
  { "genre": "a" }
]

Here are the expected results:

[
  { "genre": "a" },
  { "genre": "c" },
  { "genre": "d" }
]
@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Apr 6, 2014

I'm looking into enabling array values in key fields, but in the meantime I think the expected result should be..

[
  { "genre": "a" },
  { "genre": "c" },
  { "genre": "d" },
  { "genre": [] }
]

..since the actual keys: genre: ["a","b"], genre: ["c"], genre: ["d","e"], genre: [] are all being overwritten by the reducer, except the last one. Not sure where the sort ought to put the last element, but it should be there somewhere.

I'm looking into enabling array values in key fields, but in the meantime I think the expected result should be..

[
  { "genre": "a" },
  { "genre": "c" },
  { "genre": "d" },
  { "genre": [] }
]

..since the actual keys: genre: ["a","b"], genre: ["c"], genre: ["d","e"], genre: [] are all being overwritten by the reducer, except the last one. Not sure where the sort ought to put the last element, but it should be there somewhere.

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Apr 11, 2014

Hi @Azema
I've added support for non-primitive keys in a separate branch - 7030704

I'm converting non-primitive keys to a fixed length string hash during the intermediate collection stage. This is less than ideal since the number of possible non-primitive keys is greater than the number of possible fixed length hashes. However, in situations where such a hash overlap does not occur, it does cover the use case you have given.

I would refrain from merging this into the master branch until I or someone else can come up with a better way to uniquely and efficiently identify (possibly large) non-primitive objects (using a variable length hash perhaps).

Hi @Azema
I've added support for non-primitive keys in a separate branch - 7030704

I'm converting non-primitive keys to a fixed length string hash during the intermediate collection stage. This is less than ideal since the number of possible non-primitive keys is greater than the number of possible fixed length hashes. However, in situations where such a hash overlap does not occur, it does cover the use case you have given.

I would refrain from merging this into the master branch until I or someone else can come up with a better way to uniquely and efficiently identify (possibly large) non-primitive objects (using a variable length hash perhaps).

@Azema

This comment has been minimized.

Show comment
Hide comment
@Azema

Azema Apr 11, 2014

hi @adityamukho,

Thank you for your efforts and I will test your changes. I will come back to you if I find a problem.

I think your idea is good hash, but it may increase the processing time. However, I do not have better things to offer for now.

Azema commented Apr 11, 2014

hi @adityamukho,

Thank you for your efforts and I will test your changes. I will come back to you if I find a problem.

I think your idea is good hash, but it may increase the processing time. However, I do not have better things to offer for now.

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Apr 11, 2014

Have updated the code to use a variable length hash. It uses the djb2 algorithm, which may not be quite as fast as the previous xxhash, but ensures uniqueness, and is still much faster than any crypto hash or 2-way hash. The included implementation, provided by the es-hash module has the added advantage of being browser-compatible.

The default behaviour would be to hash all keys during the collection stage, ensuring correct output in all cases, at the cost of a slight performance penalty. The hashing can be disabled by passing noHashKeys: true in the group object, thereby getting a performance boost. This should be done ONLY if one is sure that all groupBy keys in the result set are either primitives or small arrays (of primitives). (For large arrays, it may be faster to use hashed representations than their stringified versions.)

https://github.com/adityamukho/nedb/tree/feature-array-keys

Have updated the code to use a variable length hash. It uses the djb2 algorithm, which may not be quite as fast as the previous xxhash, but ensures uniqueness, and is still much faster than any crypto hash or 2-way hash. The included implementation, provided by the es-hash module has the added advantage of being browser-compatible.

The default behaviour would be to hash all keys during the collection stage, ensuring correct output in all cases, at the cost of a slight performance penalty. The hashing can be disabled by passing noHashKeys: true in the group object, thereby getting a performance boost. This should be done ONLY if one is sure that all groupBy keys in the result set are either primitives or small arrays (of primitives). (For large arrays, it may be faster to use hashed representations than their stringified versions.)

https://github.com/adityamukho/nedb/tree/feature-array-keys

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Apr 14, 2014

Another small validation - I just finished writing an NeDB ORM adapter for the Sails framework (https://github.com/adityamukho/sails-nedb), based on this fork. The Waterline (Sails' persistence layer) tests are fairly rigorous and extensive, and so far all tests have passed, including the aggregation tests.

Another small validation - I just finished writing an NeDB ORM adapter for the Sails framework (https://github.com/adityamukho/sails-nedb), based on this fork. The Waterline (Sails' persistence layer) tests are fairly rigorous and extensive, and so far all tests have passed, including the aggregation tests.

@Pheo-Player

This comment has been minimized.

Show comment
Hide comment
@Pheo-Player

Pheo-Player Aug 15, 2014

This would be a great feature to have. Is it still planned to merge this?

This would be a great feature to have. Is it still planned to merge this?

@adityamukho

This comment has been minimized.

Show comment
Hide comment
@adityamukho

adityamukho Aug 16, 2014

Haven't looked into this PR in a while. Looks like some files have changed in the master repo while I was gone, which would now require a manual conflict resolution. Also, I have to push myself to write some test cases. But other than these two points, I think this PR is still ripe for a merge. I have used this function extensively in several applications and have not seen any errors or incorrect behavior so far.

Haven't looked into this PR in a while. Looks like some files have changed in the master repo while I was gone, which would now require a manual conflict resolution. Also, I have to push myself to write some test cases. But other than these two points, I think this PR is still ripe for a merge. I have used this function extensively in several applications and have not seen any errors or incorrect behavior so far.

@iclems

This comment has been minimized.

Show comment
Hide comment
@iclems

iclems Oct 21, 2014

@adityamukho congrats, it's really useful -- I hope it'll be merged some day!

iclems commented Oct 21, 2014

@adityamukho congrats, it's really useful -- I hope it'll be merged some day!

@FaynePerera

This comment has been minimized.

Show comment
Hide comment
@FaynePerera

FaynePerera Dec 4, 2014

Hi, @adityamukho!
Just a simple request, it would be nice to have this function fully documented. I know this has a lot to do with the Mongo group function, but there are some differences worth mentioning, I'm sure. Turns out to be very useful, at least the count option, wich is the only one documented enough in this PR. But as for the sum option, it's no so clear. Could you please give some example with the sum?
Neverthless, congrats, I'm sure this could be a great feature.

Hi, @adityamukho!
Just a simple request, it would be nice to have this function fully documented. I know this has a lot to do with the Mongo group function, but there are some differences worth mentioning, I'm sure. Turns out to be very useful, at least the count option, wich is the only one documented enough in this PR. But as for the sum option, it's no so clear. Could you please give some example with the sum?
Neverthless, congrats, I'm sure this could be a great feature.

@itayw itayw referenced this pull request in joola/nedb Feb 18, 2015

Merged

Add .sort() and .group() func #2

@Glavin001

This comment has been minimized.

Show comment
Hide comment
@Glavin001

Glavin001 May 28, 2015

Any progress on this? It is rather worrying that a pull request like this does not get merged and merge conflicts start to pile up. Makes it tough for contributors to consider contributing. Looks like other users are forking nedb because this original repo does not contain the functionality they want.

Any progress on this? It is rather worrying that a pull request like this does not get merged and merge conflicts start to pile up. Makes it tough for contributors to consider contributing. Looks like other users are forking nedb because this original repo does not contain the functionality they want.

@luislobo

This comment has been minimized.

Show comment
Hide comment
@luislobo

luislobo Nov 11, 2015

+1 This should be merged...

+1 This should be merged...

@luislobo

This comment has been minimized.

Show comment
Hide comment
@luislobo

luislobo Jan 6, 2016

Any news on merging this?

luislobo commented Jan 6, 2016

Any news on merging this?

@JamesMGreene

This comment has been minimized.

Show comment
Hide comment
@JamesMGreene

JamesMGreene Jan 8, 2016

Contributor

👍

Contributor

JamesMGreene commented Jan 8, 2016

👍

@tcurdt

This comment has been minimized.

Show comment
Hide comment
@tcurdt

tcurdt Feb 21, 2016

Urgh - open for 9 months now :-/
Any chance for a merge?

tcurdt commented Feb 21, 2016

Urgh - open for 9 months now :-/
Any chance for a merge?

@louischatriot

This comment has been minimized.

Show comment
Hide comment
@louischatriot

louischatriot Feb 21, 2016

Owner

Unfortunately not, no time to do it in the coming months I expect ...

Owner

louischatriot commented Feb 21, 2016

Unfortunately not, no time to do it in the coming months I expect ...

@coderofsalvation

This comment has been minimized.

Show comment
Hide comment
@coderofsalvation

coderofsalvation Apr 17, 2016

@louischatriot will you consider adding a contributor to this project? (PR's which take > 9 months seems quite extreme).

@louischatriot will you consider adding a contributor to this project? (PR's which take > 9 months seems quite extreme).

@fjeddy

This comment has been minimized.

Show comment
Hide comment
@fjeddy

fjeddy Oct 24, 2016

Should this project be considered dead? (NedB) ?

fjeddy commented Oct 24, 2016

Should this project be considered dead? (NedB) ?

@louischatriot

This comment has been minimized.

Show comment
Hide comment
@louischatriot

louischatriot Oct 25, 2016

Owner

No, it still works well. Features are not being added regularly, but that
is pretty different from being a dead project.

2016-10-24 16:01 GMT+02:00 Freddy notifications@github.com:

Should this project be considered dead? (NedB) ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#153 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4ik9lCSjCm9RAe-jGUnTklaRhvGeM7ks5q3Lo6gaJpZM4Bns7W
.

Owner

louischatriot commented Oct 25, 2016

No, it still works well. Features are not being added regularly, but that
is pretty different from being a dead project.

2016-10-24 16:01 GMT+02:00 Freddy notifications@github.com:

Should this project be considered dead? (NedB) ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#153 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA4ik9lCSjCm9RAe-jGUnTklaRhvGeM7ks5q3Lo6gaJpZM4Bns7W
.

pi0 added a commit to nedbhq/nedb-core that referenced this pull request Nov 30, 2016

Added .group() function #153 by @adityamukho
This commit is a cleaned version of original PR.
@pi0

This comment has been minimized.

Show comment
Hide comment

pi0 commented Nov 30, 2016

@JamesMGreene JamesMGreene referenced this pull request in JamesMGreene/nestdb Aug 15, 2017

Open

Consider adding `.group(...)` function a la MongoDB #12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment