Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cursor pagination over ordered queries #412

Merged
merged 9 commits into from
Jun 23, 2023

Conversation

adzialocha
Copy link
Member

@adzialocha adzialocha commented Jun 20, 2023

We paginate with the help of a unique cursor, there is one per document field.

Document {
  username <- Cursor #31c
  timestamp <- Cursor #21a
  message <- Cursor #d32
}

Document {
  username <- Cursor #a51
  ... and so on
}

When the client choses to order the resulting documents by some field (for example "timestamp") it is important that the comparison value for pagination comes from this field.

Query: Order "Chat" documents by "timestamp"

Document {
  username
  timestamp <- Compare
  message
}

Document {
  username
  timestamp <- Compare
  message
}

Document {
  username
  timestamp <- Compare
  message
}

At the same time we need to always return the last cursor for each document to the client, since this is the point where we can safely paginate to the next document.

Document {
  username
  timestamp
  message <- Cursor
}

Document {
  username
  timestamp
  message <- Cursor
}

Document {
  username
  timestamp
  message <- Cursor
}

The bug resulted from us using the cursor to determine the field, which was only accidentally sometimes the same as chosen by the ordering.

Also, this fix uncovered a second bug which was that not always we returned the last cursor per document since the SQL sometimes orders them slightly different, indeterministicly.

This PR fixes the problem by:

  1. Making sure that always the last field is selected as a cursor in convert_rows (before it's been a bit random, as the SQL query returns fields in different order sometimes)
  2. Extending the pagination query to select the correct operation field based on the cursor and ordering setting

Closes #381

📋 Checklist

  • Add tests that cover your changes
  • Add this PR to the Unreleased section in CHANGELOG.md
  • Link this PR to any issues it closes
  • New files contain a SPDX license header

@adzialocha
Copy link
Member Author

adzialocha commented Jun 20, 2023

I'm not really happy with this large query in the pagination logic. This abomination is only for the pagination logic (when sorting the result):

AND EXISTS (
  SELECT 
    operation_fields_v1.value 
  FROM 
    operation_fields_v1 
  WHERE 
    operation_fields_v1.name = 'message' 
    AND operation_fields_v1.operation_id = document_view_fields.operation_id 
    AND (
      LOWER(operation_fields_v1.value) > (
        SELECT 
          LOWER(operation_fields_v1.value) 
        FROM 
          operation_fields_v1 
          LEFT JOIN document_view_fields ON document_view_fields.operation_id = operation_fields_v1.operation_id 
        WHERE 
          document_view_fields.document_view_id = (
            SELECT 
              document_view_fields.document_view_id 
            FROM 
              operation_fields_v1 
              LEFT JOIN document_view_fields ON document_view_fields.operation_id = operation_fields_v1.operation_id 
            WHERE 
              operation_fields_v1.cursor = 'fef4b411cd17ce1018d9720dc8d6135c5fd2f65e9e2f74fed5469b28051eab45' 
            LIMIT 
              1
          ) AND operation_fields_v1.name = 'timestamp' 
        LIMIT 
          1
      ) OR (
        LOWER(operation_fields_v1.value) = (
          SELECT 
            LOWER(operation_fields_v1.value) 
          FROM 
            operation_fields_v1 
            LEFT JOIN document_view_fields ON document_view_fields.operation_id = operation_fields_v1.operation_id 
          WHERE 
            document_view_fields.document_view_id = (
              SELECT 
                document_view_fields.document_view_id 
              FROM 
                operation_fields_v1 
                LEFT JOIN document_view_fields ON document_view_fields.operation_id = operation_fields_v1.operation_id 
              WHERE 
                operation_fields_v1.cursor = 'fef4b411cd17ce1018d9720dc8d6135c5fd2f65e9e2f74fed5469b28051eab45' 
              LIMIT 
                1
            ) AND operation_fields_v1.name = 'timestamp' 
          LIMIT 
            1
        ) AND operation_fields_v1.cursor > 'fef4b411cd17ce1018d9720dc8d6135c5fd2f65e9e2f74fed5469b28051eab45'
      )
    )
)

🤯

These sub-queries return always the same result. This is thanks to the cursor logic giving us a very precise way to point at data, independent of the context.

We can simplify this immensively by doing one "pre" SQL query before the "main" SQL query to retrieve the value of the cursors. This is not only better for our brain cells but also for performance, otherwise we're running these expensive sub-queries repetitively for nothing.

The bug is fixed, I'll add this "pre" query and then it should be ready for merging.

@codecov
Copy link

codecov bot commented Jun 20, 2023

Codecov Report

Patch coverage: 93.93% and project coverage change: +0.06 🎉

Comparison is base (9acfea6) 92.64% compared to head (3715eb2) 92.70%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #412      +/-   ##
==========================================
+ Coverage   92.64%   92.70%   +0.06%     
==========================================
  Files          70       70              
  Lines        6660     6786     +126     
==========================================
+ Hits         6170     6291     +121     
- Misses        490      495       +5     
Impacted Files Coverage Δ
aquadoggo/src/db/stores/query.rs 96.43% <93.86%> (-0.28%) ⬇️
aquadoggo/src/graphql/resolvers.rs 90.90% <100.00%> (+0.13%) ⬆️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@adzialocha adzialocha marked this pull request as ready for review June 21, 2023 10:02
@adzialocha adzialocha requested a review from sandreae June 21, 2023 10:02
@adzialocha
Copy link
Member Author

adzialocha commented Jun 21, 2023

I've added more docstrings to explain this whole cursor logic better and moved all sub-SELECT's related to pagination into separate SQL "pre" queries. Now the "main" query for pagination over ordered application fields looks like that:

AND EXISTS (
  SELECT
    operation_fields_v1.value
  FROM
    operation_fields_v1
  WHERE
    operation_fields_v1.name = 'message'
    AND operation_fields_v1.operation_id = document_view_fields.operation_id
    AND (
      LOWER(operation_fields_v1.value) > LOWER($1)
      OR (
        LOWER(operation_fields_v1.value) = LOWER($1)
        AND operation_fields_v1_list.cursor > 'c7679c82e8db45bf294c5a239484b108d98a2d3943f172582a7389e09b2327c4'
    )
)

@adzialocha
Copy link
Member Author

adzialocha commented Jun 21, 2023

The PR is ready for review and merging now! I think I'll make a separate PR for moving the query.rs file into separate modules, it became VERY large (ticket here: #416)

Copy link
Member

@sandreae sandreae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real tricky area, good work unwrapping it and thanks for presenting the solution in as easy to understand as possible a way 👍

Comment on lines +490 to +517
/// Generate SQL for cursor-based pagination.
///
/// Read more about cursor-based pagination here:
/// https://brunoscheufler.com/blog/2022-01-01-paginating-large-ordered-datasets-with-cursor-based-pagination
///
/// ## Cursors
///
/// Our implementation follows the principle mentioned in that article, with a couple of
/// specialities due to our SQL table layout:
///
/// * We don't have auto incrementing `id` but `cursor` fields
/// * We can have duplicate, multiple document id and view id values since one row represents only
/// one document field. A document can consist of many fields. So pagination over document id's or
/// view id's is non-trivial and needs extra aid from our `cursor`
/// * Cursors _always_ need to point at the _last_ field of each document. This is assured by the
/// `convert_rows` method which returns that cursor to the client via the GraphQL response
///
/// ## Ordering
///
/// Pagination is strictly connected to the chosen ordering by the client of the results. We need
/// to take the ordering into account to understand which "next page" to show.
///
/// ## Pre-Queries
///
/// This method is async as it does some smaller "pre" SQL queries before the "main" query. This is
/// an optimization over the fact that cursors sometimes point at values which stay the same for
/// each SQL sub-SELECT, so we just do this query once and pass the values over into the "main"
/// query.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for sure not an easy topic to wrap ones head around, but this description really helps, at least to me this makes sense and I can understand the issue we faced and the solution.

Comment on lines +1762 to +1768
"Hello, Panda!".into(),
"Oh, howdy, Pengi!".into(),
"How are you?".into(),
"I miss Pengolina. How about you?".into(),
"I am cute and very hungry".into(),
"(°◇°) !!".into(),
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@sandreae sandreae merged commit fcbd648 into main Jun 23, 2023
9 checks passed
@adzialocha adzialocha deleted the fix-ignored-cursor-pagination branch July 7, 2023 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cursor ignored during pagination
2 participants