Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] Add cursor pagination (aka keyset pagination) #37216

Merged
merged 15 commits into from
May 6, 2021

Conversation

paras-malhotra
Copy link
Contributor

@paras-malhotra paras-malhotra commented May 2, 2021

Background

This PR implements cursor pagination in Laravel. The cursor is a base 64 encoded string that contains the comparison parameter values (see below).

Laravel Current Pagination (Offset Pagination)

Laravel's current implementation of pagination is offset based. This generates queries like so (for the 2nd page):

select * from users order by id asc offset 10 limit 10;

OR for a multiple ordered table

select * from users order by id, name offset 10 limit 10;

Cursor Pagination (aka Keyset Pagination)

Cursor pagination on the other hand, uses comparison operations instead of offset. This generates queries like so (for the 2nd page):

select * from users where id > 10 order by id asc limit 10;

OR for a multiple ordered table

select * from users where (id, name) > (10, 'Paras') order by id, name limit 10;

Usage

Usage is exactly the same as simplePaginate:

use App\Models\User;
use Illuminate\Support\Facades\DB;

$users = User::orderBy('id')->cursorPaginate(10);
$users = DB::table('users')->orderBy('id')->cursorPaginate(10);

Advantages Of Cursor Pagination

  1. Solves the "duplicate" issue in offset pagination: Offset pagination skips or returns duplicate records if records are deleted or added between network calls. This is especially prominent in infinite scrolling and APIs.
  2. Handles big data sets efficiently: If the order by column is indexed, cursor pagination is more efficient v/s offset pagination. This is because offset scans through all the previous data unlike comparison queries. Upto 400x better performance.
  3. Easy pagination across shards.

Limitations Of Cursor Pagination

  1. Requires that the ordering is based on at least one unique sequential column (or a combination of sequential columns to be unique).
  2. Only enhances performance if the order by columns are indexed.

References

Implementations In Other Frameworks

Note

I've implemented this as separate classes (for the interface, abstract, etc.) rather than extending Paginator because several methods in the contract/abstract classes did not make sense for cursor pagination (e.g. currentPage() returning an int, or url expecting an int page).

UPDATE: I've written a blog post on this if anyone's interested to read the pros and cons of offset and cursor pagination.

@GrahamCampbell GrahamCampbell changed the title Add cursor pagination (aka keyset pagination) [8.x] Add cursor pagination (aka keyset pagination) May 2, 2021
@mfn
Copy link
Contributor

mfn commented May 2, 2021

I've very happy to see this! I do remember I added whereRowValues a long time ago in the hopes some day we would see this 🎉


Can you elaborate what's the deal with the post-processing of the base64 encoded values? I'm not sure it's obvious, I guess a doc comment would be nice.

@paras-malhotra
Copy link
Contributor Author

paras-malhotra commented May 2, 2021

I added whereRowValues a long time ago in the hopes some day we would see this

Awesome! I was glad to see it already existed when I started working on this.

Can you elaborate what's the deal with the post-processing of the base64 encoded values?

If you mean why str_replace is used after base64, that's just meant to safe url encode the values. If you mean, why base64 encoding is used, there are multiple reasons (safe encoding of parameters, ability to encode additional info without breaking APIs).

@mfn
Copy link
Contributor

mfn commented May 3, 2021

safe url encode the values

Makes sense!

If you mean, why base64 encoding is used

Thanks, that part was clear to understand.

But now that you mention it 😅: I'm always a sucker for validating the data; using encryption might be overkill and causes problems because it generates different outputs for the same inputs but I'm always a fan of making sure the payload, before I decode it, is safe enough. Could this be a suspect to some attack vectors I wonder?

At least using hmac for integrity checks might be something to consider, but I checked https://github.com/basecamp/geared_pagination/blob/master/lib/geared_pagination/cursor.rb and there seems no additional protection either. I guess "it's fine 🔥" then.

@paras-malhotra
Copy link
Contributor Author

paras-malhotra commented May 3, 2021

@mfn yep it's secure. I checked other implementations as well. We're just base64 decoding and then json decoding the cursor string (no unserialization, etc.), so it's safe. If it's unable to decode, a null cursor is returned.

@GrahamCampbell
Copy link
Member

@paras-malhotra I support this idea, and if it is not wanted in the framework core, I'd be happy to help you turn this into a package for the community to use. :)

@paras-malhotra
Copy link
Contributor Author

That would be awesome @GrahamCampbell! I'd love to have this in the core, but if that's not possible, would very much appreciate your help to turn this into a package :)

@driesvints
Copy link
Member

Ping @spawnia: this will be useful for Lighthouse as well I assume. Is there anything you can think of that can be taken in on this PR?

@gdebrauwer
Copy link
Contributor

Would this implementation also make it possible to generate/determine the cursor of a specific item? For example, if a chat app allows you to search in the messages of a chat and you want to navigate the user to a selected message somewhere in the history of the chat. And from the location of the message, the user can infinitely scroll up and down in the chat (similar to the search-function in the Slack app)

@spawnia
Copy link
Contributor

spawnia commented May 3, 2021

@paras-malhotra I am confused by the description, should the sections about advantages and limitations be about keyset pagination?

@driesvints thanks for the ping. We have had this issue in Lighthouse for quite some time, see nuwave/lighthouse#311. Nice to see that this might make it into core, definitely interesting.

The API seems suitable for our purposes. I would have to do a proof-of-concept to be sure, but CursorPaginator::nextCursor(), Cursor::encode() and Cursor::decode() seem like the essential bits that will actually simplfify our half-baked cursor implementation.

@paras-malhotra
Copy link
Contributor Author

I am confused by the description, should the sections about advantages and limitations be about keyset pagination?

Uhh, ok I'm stupid lol. Sorry about that 🤦‍♂️ Corrected now.

@paras-malhotra
Copy link
Contributor Author

paras-malhotra commented May 3, 2021

Would this implementation also make it possible to generate/determine the cursor of a specific item?

@gdebrauwer, yes. This PR takes care of that. You can construct a cursor for any item and direction like so:

use Illuminate\Pagination\Cursor;

// For single column orders, e.g. order by id.
$cursor = new Cursor(['id' => 2], true); // generate cursor for id > 2
$cursor = new Cursor(['id' => 3], false); // generate cursor for id < 3

// For multiple column orders, e.g. order by id, name
$cursor = new Cursor(['id' => 2, 'name' => 'Paras'], true); // generate cursor for (id, name) > (2, 'Paras')
$cursor = new Cursor(['id' => 3, 'name' => 'Paras'], false); // generate cursor for (id, name) < (3, 'Paras')

@paras-malhotra
Copy link
Contributor Author

Thanks @spawnia, I've incorporated your suggested changes.

@taylorotwell
Copy link
Member

taylorotwell commented May 3, 2021

@paras-malhotra Please provide a general overview of how the feature works internally so I have something to go by when reviewing it.

@GrahamCampbell
Copy link
Member

@paras-malhotra Please provide a general overview of how the feature works internally so I have something to go by when reviewing it.

I think some example URLs for the first page and second page would be very helpful. :)

@paras-malhotra
Copy link
Contributor Author

paras-malhotra commented May 3, 2021

@taylorotwell and @GrahamCampbell, sure thing. Here's how it works internally:

Step 1: Resolve the cursor

$cursor = $cursor ?: CursorPaginator::resolveCurrentCursor($cursorName);

The cursor is akin to the "page number". A cursor object contains the parameter values along with the direction as mentioned in this comment.

So, first the cursor is resolved from the request. An example of a URL would be http://laravel.test/example?cursor=eyJpZCI6MjAsIl9pc05leHQiOnRydWV9. The cursor is first json encoded, then base64 safe url encoded.

So, to decode we'd first base64 decode to get {"id":20,"_isNext":true} and then json decode. Here, id is a parameter (can also support multiple params) and _isNext is the direction (true represents forward, false backwards). For the first page, since there will be no cursor URL parameter like http://laravel.test/example, a null cursor will be resolved.

All the encoding and decoding logic is encapsulated in the Illuminate\Pagination\Cursor class.

Step 2: Ensure order by is set properly

$orderDirections = collect($this->query->orders)->pluck('direction')->unique();
if ($orderDirections->count() > 1) {
throw new CursorPaginationException('Only a single order by direction is supported in cursor pagination.');
}
if ($orderDirections->count() === 0) {
$this->enforceOrderBy();
}
if ($shouldReverse) {
$this->query->orders = collect($this->query->orders)->map(function ($order) {
$order['direction'] = ($order['direction'] === 'asc' ? 'desc' : 'asc');
return $order;
})->toArray();
}

Here, we first ensure:

  1. At least one order by condition is specified.
  2. All order by clauses are in the same direction. This is because the tuple SQL comparison e.g. (id, name) > (2, 'Paras') only supports one direction.

If the cursor points backwards (_isNext is false), then we also reverse the order of all order by clauses.

So, a forward query (from page 2 to page 3, 10 items per page) would look like:

select * from users where id > 20 order by id asc limit 11;

And a backwards query (from page 3 to page 2, 10 items per page) would look like (note direction is reversed from asc to desc):

select * from users where id < 21 order by id desc limit 11;

Also, note limit is (items per page + 1) to determine whether there is a "next" or "previous" item. This is similar to simplePaginate in a way.

Step 3: Create the CursorPaginator instance

After applying the correct order and where clauses, we fire the query and create the cursor paginator. The order by columns are passed as parameters in the options.

return $this->cursorPaginator($this->get($columns), $perPage, $cursor, [
'path' => Paginator::resolveCurrentPath(),
'cursorName' => $cursorName,
'parameters' => $parameters,
]);

Once the cursor paginator is created, we save the collection and reverse the order of the collection if the cursor is pointing backwards as here:

if (! is_null($this->cursor) && $this->cursor->isPrev()) {
$this->items = $this->items->reverse()->values();
}

This reversal is to preserve the order in next and previous. For example, the items returned from the backwards query (page 3 to page 2) in step 2 would be 20, 19, and so on whereas the items returned from a forwards query (page 1 to page 2) would be 11, 12 and so on. By reversing the order for the backwards query, we guarantee that "page 2" always retains the same order (whether accessed from a previous or next cursor).

Step 4: Compute the URLs and cursors for next and previous pages

public function nextCursor()
{
if ((is_null($this->cursor) && ! $this->hasMore) ||
(! is_null($this->cursor) && $this->cursor->isNext() && ! $this->hasMore)) {
return null;
}
return $this->getCursorForItem($this->items->last(), true);
}

Finally, we compute the next and previous cursor (and corresponding URLs). The next cursor would contain the parameters of the last paginated item in a forward direction, and the previous cursor would contain the parameters of the first paginated item in a backwards direction.

So, for example, say we're on page 2 (10 items per page). The next cursor should be {"id":20,"_isNext":true} (id of last paginated item) and the previous cursor should be {"id":11,"_isNext":false} (id of first paginated item) so that id > 20 and id < 11 queries are fired for next and previous respectively.

That's it in a nutshell. Hope the explanation was helpful! Let me know if you have any further questions.

@paras-malhotra paras-malhotra deleted the cursor_pagination branch May 6, 2021 15:01
@taylorotwell
Copy link
Member

@paras-malhotra if you could send over your docs draft that would be good.

@ahmedatef00
Copy link

ahmedatef00 commented May 7, 2021

@paras-malhotra
Copy link
Contributor Author

paras-malhotra commented May 7, 2021

@ahmedatef00, if you're asking for differences in implementation, I haven't used that package. But at first glance, it doesn't seem to support multiple order by clauses, rendering of links or URL encode values (which can be an issue if sorting is done by strings). I could be wrong here since I haven't really taken it for a spin. I'd advise you to use your own judgement.

@ahmedatef00
Copy link

@ahmedatef00, I haven't used that package but at first glance, it doesn't seem to support multiple order by clauses or base 64 encode values (which can be an issue if sorting is done by strings). However, it won't be fair for me to comment on it since I haven't taken it for a spin. I'd advise you to use your own judgement.

I just paste it for the sake of knowledge ... and Yes maybe you are right I used it before with lumen and it works fine in one single order by clause but I can't say it is good for all corner cases ... I think your's will be better and I can't wait to use it.

chu121su12 pushed a commit to chu121su12/framework that referenced this pull request May 7, 2021
* Add cursor pagination without tests

* Fix styleci

* Add cursor paginator tests

* Add support for query builder

* Fix tests

* Complete all tests for database and Eloquent builders

* Incorporate suggestions

* Fix styleci

* Fix docblocks

* move method

* Fix docblock

* Formatting

* Various formatting - method renaming.

* Add more tests

Co-authored-by: Taylor Otwell <taylorotwell@gmail.com>
@ejunker
Copy link
Contributor

ejunker commented May 7, 2021

This may be a dumb question, but does this work with API Resources? For example, will it show the cursor data in the meta object in the JSON response?

@paras-malhotra
Copy link
Contributor Author

@ejunker, I forgot to do that. 😅 Thanks for bringing it up. I've submitted PR #37315 to add cursor pagination support to API resources.

@fatalmind
Copy link

@paras-malhotra

Here's an image that describes how the duplication happens:

No big deal, but next time please give credit. https://use-the-index-luke.com/no-offset

@paras-malhotra
Copy link
Contributor Author

@fatalmind, my bad. Just saw the attribution license on your website. I've updated my post to include the source/link.

@mpyw
Copy link
Contributor

mpyw commented Jun 18, 2021

According to SQL Feature Comparison, SQLServer does not support Tuple Comparison syntax. So

(a, b, c) > (1, 2, 3)

should be rewritten to

a=1 and b=2 and c>3
or
a=1 and b>2
or
a>1

If you use SQLServer, still lampager/lampager-laravel: Rapid pagination for Laravel may help implementing cursor pagination.

@tpetry
Copy link
Contributor

tpetry commented Jun 18, 2021

Keyset pagination really does not work on SQL Server: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=dbf29cc91f08c2185bde4a521c3508e5

But i am not sure how efficient SQL Server will execute this DNF condition, maybe it's best to state in the manual that is not compatible with SQL Server.

@fatalmind
Copy link

Emulation of Row Value syntax is a hassle, but still possible with most of the benefits of Keyset pagiation (also performance).

The crutial point is to make the leftmost condition "indexable". E.g.

(a, b) < (?, ?)

becomes

a <= ? AND NOT (a = ? AND b >= ?)

If there is in index on a it can be used for the a <= ? part. If you combine the top-most conditions with or, this is usually not done by SQL engines.

See this for more details: https://use-the-index-luke.com/sql/partial-results/fetch-next-page#sb-row-values

@mpyw
Copy link
Contributor

mpyw commented Jun 25, 2021

@fatalmind

this is usually not done by SQL engines.

I've actually verified MySQL 5.6 and 5.7 represents Extra: Using index condition via EXPLAIN about the SQL above. Is the information up to date?


Index does not work (expected behavior):

a<1 and b=2 and c>3
or
a<1 and b>2
or
a>1

Index works well with Extra: Using index condition:

a=1 and b=2 and c>3
or
a=1 and b>2
or
a>1

@fatalmind
Copy link

Unfortunately, the MySQL EXPLAIN output can be pretty confusing.

"Using index condition" just means that it does the filtering directly with the information found in the index, which is better than first fetching the full row from the table and then doing the filtering (Extra: Using Where).

However, filtering in this context means checking whether or not a row matches after reading it. For performance it is actually desired to not even read the rows that we don't need. This is what is sometimes referred to access (not filter).

Unfortunately, MySQL cannot access indexes with row-value predicates. But if we phrase the WHERE condition as described in my article — which is to the very best of my knowledge still current, even for MySQL 8.0 — than the first column can be used to access the index, while the remaining ones can only be used for filtering. Of course it is better to do that filtering before accessing the table ("Using Index Condtion") than after that ("Using Where"). But using at least one column for acess is still desirable.

If you want to check this out, compare your second example with one you build like described in the mentioned article. The crutial information to watch out in the EXPLAIN output is the "ref" (and maybe "key_len") columns. They indicate what part of the index was used for access. We want as much as possible form an index on (a, b, c) to be used as access predicates. In the "ref" column this is visible by the number of items it lists (comma-separated), in the key_len column a higher value is better.

See also: https://use-the-index-luke.com/sql/explain-plan/mysql/access-filter-predicates

@mpyw
Copy link
Contributor

mpyw commented Jun 25, 2021

@fatalmind I've tried to estimate using actual production table data. How do we evaluate them? The above one looks to have larger key_len value.


explain
select * from posts use index(posts_community_id_commented_at_index)
where 
community_id=17 and commented_at="2015-12-12 08:41:05" and id>547
or community_id=17 and commented_at>"2015-12-12 08:41:05"
or community_id>17
limit 5;
Field Value
select_type SIMPLE
table posts
partitions NULL
type range
possible_keys posts_community_id_commented_at_index
key posts_community_id_commented_at_index
key_len 21
ref NULL
rows 191472
filtered 100.00
Extra Using index condition

explain
select * from posts use index(posts_community_id_commented_at_index)
where
community_id >=17
and not (community_id=17 and commented_at<"2015-12-12 08:41:05")
and not (community_id=17 and commented_at="2015-12-12 08:41:05" and id<=547)
limit 5;
Field Value
select_type SIMPLE
table posts
partitions NULL
type range
possible_keys posts_community_id_commented_at_index
key posts_community_id_commented_at_index
key_len 8
ref NULL
rows 191077
filtered 100.00
Extra Using index condition

@fatalmind
Copy link

Hi!

Your queries have one important mistake: they miss the ORDER BY clause ;)

But that doesn't change a lot in the execution plan.

Another small mistake in the second query is that commented_at<"2015-12-12 08:41:05" should acutally read commented_at<="2015-12-12 08:41:05" — unless I'm confused, which is happening quite often with these things.

Regarding performance:
The short story is: the second approach is still better, in particular also for other databases (I've comparted to SQL Server, which also lacks decent row-values support, results below).

The long story:
I've now emulated everything in my lab. Scripts for reference are attached.

In MySQL, also in old versions, the performance difference between the two approaches can be seen by looking at how many IO operations the database does to run the query. (I gave up making sense from the EXPLAIN output for this purpose). After adding the ORDER BY clause, the first query needs 6 read operations while the second only needs 5. Not a really big difference, both are basically fine.

The same experiment on SQL Server gives 6 read operations for the first, but only 3 for the second query. They are still both VERY fast compared to offset, but the second is even faster.

The same experiment on Oracle, which also doesn't have decent row-value support, gives 102 IOs for the first query but only 4 for the second.

What I want to say is that there is a pattern: the one approach is always better than the other one. While it might be marginally better in some cases, it's never slower. That's why I recommend going for this approach.

For reference & your enjoyment I'm adding five files: three the test scripts for MySQL, SQL Server and Oracle, and also the output for SQL Server & Oracle as I don't know whether you have access to these systems at the moment.

I hope this helps. And thanks for following up and coming back with reasonable questions!

keyset_demo.oracle.out.txt
keyset_demo.oracle.txt
keyset_demo.sql-server.out.txt
keyset_demo.mysql.txt
keyset_demo.sql-server.txt

@mpyw
Copy link
Contributor

mpyw commented Jun 25, 2021

@fatalmind Thank you for estimating the problem for us!

@mpyw
Copy link
Contributor

mpyw commented Jun 29, 2021

@fatalmind Interesting comment: #37762 (comment). In MySQL, due to the bug, OR-AND conditions sometimes perform faster than Tuple-Comparison ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet