Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cant count result #195

Closed
metalwolf opened this issue Jan 5, 2016 · 14 comments
Closed

Cant count result #195

metalwolf opened this issue Jan 5, 2016 · 14 comments

Comments

@metalwolf
Copy link

ON previous mongo driver you had all the result functions ->count()
It seems the new driver lacks the counting function

$coleccion->find()->count() is missing, it is impossible to count result on phongo

@jmikola
Copy link
Member

jmikola commented Jan 6, 2016

This was previously asked in #178. In the new driver, a Cursor object now represents the actual results of an executed command or query, whereas cursors in the legacy driver had a dual-nature (pre- or post-executed). The legacy MongoCursor::count() method actually invoked a count command with the same criteria; however, this meant it was entirely possible for the count command's calculation to differ from the actual query results, since they were not necessarily operating on the same result data.

The userland library implements a count() method on the collection, which also takes filter criteria, and we'd rather encourage that so users understand that the operations are separate. If we do end up making Cursor countable in the userland library, we would use iterator_count() to provide an exact count on the results; however, this will also require us to cache the results so that users can rewind the cursor and iterate again (this was most recently discussed in PHPLIB-81). While the legacy driver would re-execute the query on a rewind, that's also not an option given the nature of a Cursor in the new driver.

I realize this may come across as an inconvenience, but I believe the explicit API is a ultimately a step forward from the legacy driver, where many users might have been oblivious to what was actually happening under the hood. Feel free to follow up if you have additional questions.

@jmikola
Copy link
Member

jmikola commented Feb 23, 2016

Closing this out as I believe it was answered, but feel free to follow up if you have additional questions.

@Zelfapp
Copy link

Zelfapp commented Mar 16, 2016

The reasoning for taking out count(), sort(), and limit() I get. However, those are a ton of breaking changes for moving to the new driver here on a large application. Lots of rewriting unfortunately.

My question, is where's an example of the new best practice for easily counting filtered results on http://mongodb.github.io/mongo-php-library/classes/collection/?

It states: "MongoDB\Collection is perhaps the most useful class in this library. It provides methods for common operations on a collection, such as inserting documents, querying, updating, counting, etc.". However, there is no example of counting.

Will you please provide the updated example using the code below of using MongoDB\Collection for counting?

$cursor = $collection->find($filter);

if ($cursor->count() > 0) {

}

It would be great on http://mongodb.github.io/mongo-php-library/ to see a complete list of breaking changes from the legacy mongo driver and new best practices code examples of their equivalents using the new mongodb driver and library.

@derickr
Copy link
Contributor

derickr commented Mar 16, 2016

There is documentation at http://mongodb.github.io/mongo-php-library/api/class-MongoDB.Collection.html#_count, although that's not particularly useful or extensive.

The original driver's count() method, does in fact, not do something with a cursor. It constructs the original query, and then runs that with a count command. In fact, this is what happens under the hood (old driver);

$filter = [ 'name' => 'Derick' ];
$cursor = $collect->find( $filter ); // nothing happens right now
foreach ( $cursor as $record ) { }; // upon foreach, the query gets executed
echo $cursor->count();
/* original query is wrapped in command, which then is executed and its result returned. Under the hood, it would run: */
$collection->db->command( [ 'count' => $collection->db->getName(), 'query' => $filter ] );

In the new driver, this behaviour is different:


$filter = [ 'name' => 'Derick' ];
$q = new \MongoDB\Driver\Query( $filter );
$r = $manager->executeQuery( "{$dbname}.{$collectionname}", $q ); // query is already executed here

/* To count, you'd simply construct the command yourself: */

$cmd = new \MongoDB\Driver\Command( [ 'count' => $collectionname, 'query' => $filter ] );
$r = $manager->executeCommand( $dbname, $cmd );

I also wrote about this in an article: https://derickrethans.nl/new-drivers-part3-cursor.html

@Zelfapp
Copy link

Zelfapp commented Mar 16, 2016

Ah, I didn't see https://derickrethans.nl/new-drivers-part3-cursor.html. That would have saved a lot of time. Having a breaking changes section on mongo-php-library would be helpful. My two cents there.

Would $count = $collection->count($filter); be the one liner of your example above? Seems like it works. The part I'm still confused about is $cursor->count() was removed because it could possibly be inaccurate, but we're still running the query twice, so what have we gained by not simply being able to call $cursor->count() other than pointing out the fact that it is not accurate?

@jmikola
Copy link
Member

jmikola commented Mar 17, 2016

The part I'm still confused about is $cursor->count() was removed because it could possibly be inaccurate, but we're still running the query twice, so what have we gained by not simply being able to call $cursor->count() other than pointing out the fact that it is not accurate?

By removing the method from the Cursor class, we're forcing users to acknowledge that it is not a true count of the cursor's results. This goes hand in hand with eliminating the dual nature of the cursor (as MongoCursor had in the legacy extension) and its pre- and post-executed states -- along with the fact that rewind() would re-execute the query. I'm convinced that avoiding misunderstandings of how the cursor operators is worth the extra line of code or two to invoke the count command through Collection.

The only accurate solution would be to use iterator_count(); however, that requires traversing the iterator to compute the count. While you could do that, it would leave you unable to rewind and traverse the actual result data, which I doubt is very useful. This is because the Cursor is merely a handle on results from an executed query and can only move forward. Pending PHPLIB-81, we should have a mechanism in place to compose a Cursor object and allow its results to be cached so that users can rewind, count, etc. (and accurately on the very same data!).

@hackel
Copy link

hackel commented Dec 28, 2016

Does this mean the only way to iterate over a cursor, short of using toArray, is to use a foreach loop? I've got some code that was using a while ($result->count() > 0) loop, which I cannot find a way to replicate. I thought iterator_count was the answer, but it throws
MongoDB\Driver\Exception\LogicException with message 'Cursors cannot yield multiple iterators'.

@lindelius
Copy link

lindelius commented Dec 28, 2016

@hackel There are two ways to retrieve the count. Either run the same filter through a count query

$count = $collection->count($filter);

or increment a counter while iterating over the cursor

$cursor = $collection->find($filter);
$count  = 0;

foreach ($cursor as $document) {
    $count++;
}

I can't recall exactly how the old count() method worked, but did that while loop really work? I'm not so sure that it actually returned the number of documents remaining, which is how it must have worked for that loop to work as it should. But to answer your question, there are more ways of iterating over a cursor than to use a foreach loop, but it's by far the simplest. Can I ask why you would want to do it in another way?

@jmikola
Copy link
Member

jmikola commented Dec 28, 2016

I can't recall exactly how the old count() method worked, but did that while loop really work? I'm not so sure that it actually returned the number of documents remaining, which is how it must have worked for that loop to work as it should.

In the legacy driver, MongoCursor::count() executed a count command with each invocation. It had no direct relation to the cursor's result set. I also don't understand how such a while loop would have worked in practice, but I'd have to reserve judgement until seeing a full paste of code for it.

I cannot find a way to replicate. I thought iterator_count was the answer, but it throws MongoDB\Driver\Exception\LogicException with message 'Cursors cannot yield multiple iterators'

iterator_count() traverses an iterator and and counts its elements, which means that you cannot use it with a MongoDB\Driver\Cursor since those can only be iterated once and never rewound. The "iterator" referred to in the exception message is an internal PHP iterator (not to be confused with PHP's Iterator class), which Traversable objects expose for iterating via language constructs and internal functions.

We do have PHPLIB-81 open to track the idea of creating an Iterator class that could decorate a MongoDB\Driver\Cursor and cache its results in memory. That would allow rewinding, counting (of the actual elements to be iterated), and movement in either direction across the result set.

@hackel
Copy link

hackel commented Dec 28, 2016

@lindelius It's from Laravel's chunk method used to repeat a query over a large result set by paging through it. I was trying to replicate this outside of the jenssegers/laravel-mongodb library because it uses a ton of memory. It actually calls iterator_to_array on the cursor first, which is why the loop works, but is not particularly efficient.

I ended up with this hacky solution, based on a stackoverflow post:

$cursor = $collection->find(...);
$it = new \IteratorIterator($cursor);
$it->rewind();

while ($it->valid()) {
  if (call_user_func($callback, $it) === false) {
    return false;
  }

  $page++;
  $cursor = $collection->find(next $page);
  $it = new \IteratorIterator($cursor);
  $it->rewind();
}

I guess this is very similar to PHPLIB-81, but using the built-in IteratorIterator class to decorate the cursor. I have no idea how the valid method works, but so far it's getting the job done. It seems like CachingIterator::hasNext would be more appropriate.

The only other way I can think of to do this would be to count the iterations in a foreach loop as you described, then exit the outer loop if the resulting count is less than the page size after finishing the inner loop. Either that or running one extra query at the start to get the document count.

I thought it might be possible to accomplish this with Mongo's batchSize option, but as far as I can tell, this wouldn't actually work as it is internal to the driver, is that correct?

@lindelius
Copy link

lindelius commented Dec 28, 2016

Well, the first problem is that Laravel's Eloquent ORM is designed for SQL databases and the jenssegers/laravel-mongodb library is just a MongoDB wrapper for it. I refuse to use it for my projects (and I use Laravel and Lumen quite a lot) as it doesn't reflect how developers are supposed to work with MongoDB at all. I've created my own library (which wraps the official MongoDB PHP library) to give me the added functionality that I need for my projects. I'm actively using and updating it as I find more things that I need to add or change, so feel free to check it out and use it if it satisfies your needs (sorry for the plug).

When it comes to the chunking, I'm not sure what exactly you are using it for, but the results from the database are already chunked up into batches, which means that you aren't loading the entire result set in from start. Also, if you are intending to chunk the results by requesting, say 20 documents at a time, using new calls to find() you can do that with a limit and skip instead of what you are doing now (not entirely sure what you are doing now, tbh). If you are expecting to work with big result sets you should also consider writing a custom chunk method for that collection and use something that can be found via an index, e.g. using ObjectID ranges for chunking the result set, since the skip + limit method isn't ideal for huge collections.

@jmikola
Copy link
Member

jmikola commented Dec 29, 2016

I guess this is very similar to PHPLIB-81, but using the built-in IteratorIterator class to decorate the cursor. I have no idea how the valid method works, but so far it's getting the job done. It seems like CachingIterator::hasNext would be more appropriate.

We already use IteratorIterator in the library to decorate a Cursor object (see: CollectionInfoCommandIterator). IteratorIterator is useful for decorating any Traversable and exposing methods to the underlying handlers (e.g. get current element/key, rewind, move forward). In this case, IteratorIterator::valid() will directly call php_phongo_cursor_iterator_valid() and return whether there is a current element. For a new Cursor, this requires us to attempt populating the first element, which is exactly what you achieve by calling IteratorIterator::rewind() (i.e. php_phongo_cursor_iterator_rewind()).

The downside of your approach is that you're executing one query for each page in sequence, which means more roundtrips and cursors/queries on the server side.

I thought it might be possible to accomplish this with Mongo's batchSize option, but as far as I can tell, this wouldn't actually work as it is internal to the driver, is that correct?

batchSize is used to control how many documents are provided to the driver at a time over the wire protocol (see: find command docs). For instance, a small batch size would entail more roundtrips to the database to iterate across a result set. This is largely transparent to the application.

If you are expecting to work with big result sets you should also consider writing a custom chunk method for that collection and use something that can be found via an index, e.g. using ObjectID ranges for chunking the result set, since the skip + limit method isn't ideal for huge collections.

Pagination via index ranges is absolutely more performant than using skip to advance to later "pages" in the result set. You should certainly look into this approach.

@hackel
Copy link

hackel commented Jan 4, 2017

Thanks for the feedback, @lindelius and @jmikola. Sorry for hijacking this issue into a support request, that was not my intention!

@lindelius I completely agree, I had to learn the hard way about laravel-mongodb, and now it's too ingrained in this application to remove. Thanks for the plug, though! Your library looks intriguing. I'll definitely keep an eye on that for future projects.

What I'm doing now is using limit/skip, actually. The collection is around 120k now, so not exactly "huge," but may grow a bit. To query by index ranges, would I have to add an extra indexed, auto-incrementing id field with predictable page boundaries, or can I just use _id with limit? Sort by _id and query each page with a limit and {_id: {$gt: $last_id}}`? I'm currently not experiencing any performance issues from MongoDB with skip/limit—it's all on the PHP side, but I'd definitely like to know the best way to do it.

@jmikola
Copy link
Member

jmikola commented Jan 5, 2017

To query by index ranges, would I have to add an extra indexed, auto-incrementing id field with predictable page boundaries, or can I just use _id with limit?

Since the first four bytes of the 12-byte ObjectID is a timestamp, values are generated in increasing order and can be used for sorting. There should be no need for an extra field or index. See this Stack Overflow thread for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants