Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination implementation for real-life database #94

Closed
mattecapu opened this issue Jun 19, 2016 · 26 comments
Closed

Pagination implementation for real-life database #94

mattecapu opened this issue Jun 19, 2016 · 26 comments

Comments

@mattecapu
Copy link

@mattecapu mattecapu commented Jun 19, 2016

In all the examples I can find, paginated queries are made against a mockup database which is just a JS array, and thus it is simply passed through connectionFromArray to return the correct paginated result (like the Star Wars example mentioned in the README).
For a real-life database, query all records and then pass them to connectionFromPromisedArray doesn't seem to be a good solution, because it will easily brek your perfomance/crash your server as soon as you're doing anything at (even modest) scale

So what solutions should you use to avoid insane database fetching?

(I'm using a SQL database but I think a good solution to this problem applies to pretty much every not-a-js-array dbms)

@OlegIlyenko

This comment has been minimized.

Copy link

@OlegIlyenko OlegIlyenko commented Jun 19, 2016

Recently @chandu0101 created a nice and relatively big example of a relay application which uses MongoDB and sangria (scala GraphQL implementation).

If you don't mind looking at some scala code, then I would suggest to check out this function:

https://github.com/chandu0101/sri-sangria-example/blob/master/server/src/main/scala/sri/sangria/mongoserver/services/BaseService.scala#L52-L99

It creates a relay Connection which is based on MongoDB query. It also properly handles relay-style pagination and the result set, which comes from the DB, always contains only relevant documents. This particular example uses mongo's skip and limit to archive this. But I can also imagine other strategies to do this - for instance using an ID-based cursors which would be even more efficient.

Even though this example uses sangria-relay, conceptually it is very similar to reference implementation.

@GuillaumeLeclerc

This comment has been minimized.

Copy link

@GuillaumeLeclerc GuillaumeLeclerc commented Jul 7, 2016

From all the solutions I found on the internet (including this one). It seems it does not work if data is inserted in the database during the lifetime of a cursor.

Am I wrong ?

Is there a generic solution ?

@GuillaumeLeclerc

This comment has been minimized.

Copy link

@GuillaumeLeclerc GuillaumeLeclerc commented Jul 8, 2016

I thought about a solution to solve the data insertion:

  • If there is no before nor after cursor I execute the database query but I only select the Ids, not the content. generate a new queryId (uuidv4) and save the Id's I get to a redis hastable where the key is the queryId
  • If there is a cursor I extract the queryId from the cursor and get the list of Id's from the database
  • I use the query args to compute which Id's the client is interested in
  • I make a database query including the corresponding Ids
  • Return the result

I have a compliant implementation for Mongoose database I can share if anyone interested.

My only concern Is how long should I keep the queries in my redis server.

@sibelius

This comment has been minimized.

Copy link
Contributor

@sibelius sibelius commented Jul 12, 2016

@mattecapu this package relay-mongodb-connection creates relay connection from mongodb cursors, it also uses skip and limit

@mattecapu

This comment has been minimized.

Copy link
Author

@mattecapu mattecapu commented Jul 12, 2016

Thank you @sibelius and @OlegIlyenko, your links were a good starting place for understanding what was going on in the creation of a ConnectionType response.

My solution was to not use createConnection* utilities but to create a ConnectionType response by myself.

Essentially I encode information about the ordering field (i.e. the ID) in the cursor and then use that to query results with ID >= or <= than cursor. The first and last parameters are handled with some clever LIMITing and order reversing.

My actual use case was a little more complicated because I'm fetching from multiple tables. Thus after retrieving the minimum required records I'm doing an additional round of sorting and cutting.

I'd be happy to provide some code examples if someone needs them.

@sibelius

This comment has been minimized.

Copy link
Contributor

@sibelius sibelius commented Jul 12, 2016

@mattecapu I think u should provide some code examples to help people understand better how to implement a cursor

@mattecapu

This comment has been minimized.

Copy link
Author

@mattecapu mattecapu commented Jul 13, 2016

Well the point is, if you implement a ConnectionType for an endpoint you just need to

  • accept connectionArgs parameters
  • return a suitable response in the shape { edges: [{ cursor, node }], pageInfo: { hasNextPage, hasPreviousPage, startCursor, endCursor }

Cursors can be any string you want. Normally Relay defaults to arrayconnection:<id> converted to base64 to make it opaque (you shouldn't actually relay on the implementation details of the cursors).

I used <type_name>:<id> converted to base64 as cursors.

Then a request to my endpoint look like this

endpoint(after: "sdflkjsdlfkjslkdf", first: 10) {
   # stuff
}

Basically, any request described by the specification is supported.

Thus when I get the request I process it in the following way:

  1. Start from the greedy query: SELECT * FROM table
  2. If the after argument is provided, add id > parsed_cursor to the WHERE clause
  3. If the before argument is provided, add id < parsed_cursor to the WHERE clause
  4. If the first argument is provided, add ORDER BY id DESC LIMIT first+1 to the query
  5. If the last argument is provided, add ORDER BY id ASC LIMIT last+1 to the query
  6. If the last argument is provided, I reverse the order of the results
  7. If the first argument is provided then I set hasPreviousPage: false (see spec for a description of this behavior).
  8. If no less than first+1 results are returned, I set hasNextPage: true, otherwise I set it to false.
  9. If the last argument is provided then I set hasNextPage: false (see spec for a description of this behavior).
  10. If no less last+1 results are returned, I set hasPreviousPage: true, otherwise I set it to false.

Using this "algorithm", only the needed data is fetched. While after and before can be both set, I make sure first and last args are treated as mutually exclusive to avoid making a mess. The spec itself discourage making requests with both the arguments set.

Notably, I return an object with the shape described above (and in the linked spec) and I don't use the connectionFromArray helper, which expects a raw collection to slice accordingly to the args.

@wincent

This comment has been minimized.

Copy link
Contributor

@wincent wincent commented Aug 30, 2016

That's a really great write-up @mattecapu. I'm going to close this out now but that is exactly the kind of thing that would work well in the documentation (could be something in a code comment, in the README, or we could start an examples folder). Would be happy to look at a PR adding something like that.

@wincent wincent closed this Aug 30, 2016
@GuillaumeLeclerc

This comment has been minimized.

Copy link

@GuillaumeLeclerc GuillaumeLeclerc commented Aug 30, 2016

@mattecapu It seems the algorithm does not work if there was an deletion in the database between two paginated queries and/or if we want to sort by anything else than Id's right ?

@oexza

This comment has been minimized.

Copy link

@oexza oexza commented Sep 23, 2016

@mattecapu what happens when the ids are not ordered? i.e they are UUIDs, id > paresedValue will be useless.

@mattecapu

This comment has been minimized.

Copy link
Author

@mattecapu mattecapu commented Sep 24, 2016

Sorry @GuillaumeLeclerc I lost your question in a notification misunderstanding with GitHub.
I'm going to address your question together with @oexza new comment.

As laid out in the comment above, my algorithm as both the limitations you guys noticed, but we can easily generalize it to overcome them.
My idea is to use creation timestamps as "order-providers", instead of abusing of the id field, which by the way, isn't guaranteed to be sequential anyway (in my specific use case, it was).

To support dynamic data getting deleted or added inbetween queries, paginating with creationTime works like a charm, especially with Relay caching mechanism.

Eventually, we can provide a further generalization by not fixing an order-provider field but let it be dynamic, effectively allowing a lot of different orders on the data, which can come in handy. This is pretty simple to implement too, but gets complicated once you have to support dynamic data.
I supect there's no silver bullet for this use case, but I might be wrong.

@wincent I'll see what I can do! Thanks for the appreciation.

@joonhocho

This comment has been minimized.

Copy link

@joonhocho joonhocho commented Sep 30, 2016

I came across this issue and I wanted to share a Node.js library that I created a while ago.
https://github.com/joonhocho/graphql-relay-connection
It helps you easily create a custom relay connections.
You can create connections for MongoDB, simple number arrays, etc by providing few functions.
I hope this can help.

@Naoto-Ida

This comment has been minimized.

Copy link

@Naoto-Ida Naoto-Ida commented Nov 2, 2016

@mattecapu Great post! Just wanted to say it helped out a lot. We had some tweaking to do since we accept arguments other than the Relay args spec.

@Naoto-Ida

This comment has been minimized.

Copy link

@Naoto-Ida Naoto-Ida commented Nov 8, 2016

Everything was working great with @mattecapu's solution,
but I ran into one rare situation as such and am in quite a pickle...
wondering if any of you have run into this issue.

We have one query for events that returns data in a specific order.
This specific order, which is passed as part of an SQL statement, includes things like how many times that event's page has been viewed, whether it has an promotional image, etc. Basically about 5 *** DESCs in ORDER BY.

We have an events query like this:

{
  viewer {
    events(first: 1, inCountry: Japan) {
      edges {
        node {
          id
          name
        }
      }
    }
  }
}

first and last can be dealt with as addressed above; flip all of the multiple ORDER BY DESC to ORDER BY ASC.

But when you introduce after & before, it is very difficult to serve the correct data that corresponds to the given Node ID, as due to the nature of our query, it is not as simple as saying event.id > 1 or event.createAt > xxxx when after & before are passed.

@mattecapu

This comment has been minimized.

Copy link
Author

@mattecapu mattecapu commented Nov 8, 2016

@Naoto-Ida if I understood correctly and your order is deterministic, createAt attribute should do the trick.
How do you compute your cursors?

@Naoto-Ida

This comment has been minimized.

Copy link

@Naoto-Ida Naoto-Ida commented Nov 9, 2016

@mattecapu

Our ORDER statement in our SQL would include something like:
ORDER BY

  1. closest to user location
  2. amount of views
  3. whether its free, etc.
    Or something like that. It would differ drastically depending on the user and "when" you query it, so it does not work off the createAt. Sorry, maybe my example of using createAt was misleading.
    What I was trying to get at was that your method works great in situations where ordering is based on a single factor, but not when there are multiple factors in play.

We compute it by base64decoding it, and splitting it into object:ID. From there we get that one record and get the ones before/after that.

So in the end, due to there not being that many records and time constraints, we redis cached the all the event record. We fetch it when a query with the same arguments come in, then based off of the supplied cursor, would slice the total records and serve the ones before/after it.

@mattecapu

This comment has been minimized.

Copy link
Author

@mattecapu mattecapu commented Nov 9, 2016

Now I see the problem.
When I was writing my algorithm, my implicit assumption was that between two different queries, the only change that could happen to the database was addition or removal of records.
While this may be a good-enough assumption for most cases, it fails if, like in your case, there's a third way to mutate the state of your database: updating a record.
That messes up with my algorithm because I cannot tell wether a given record was changed from the last time I queried the db.
Yet, this is not a problem specific to my algorithm. In general, there's no way to do this without storing information about updates.
Likewise, without storing a createdAt attribute there's no way to tell if a given record was there when I first run the query (apart from using other time-related information stored in the record, such as a sequential ID).
So the only way to support such constraint is to change the architecture of your db, adopting some kind of time-traveling structure.
Eventually, this is not as difficult as it sounds.
For the general case, you'll have to store every update you make to your db.
So your records will now look like this

object_id | updated_attribute | new_value | timestamp

Your data will be shattered into several of this atomic operations, which aggregated togheter will give you the most recent version of it. So for example if I want to retrieve attribute name of object 34 I'll ask my db

SELECT new_value FROM mytable WHERE object_id='34' ORDER BY timestamp DESC LIMIT 1

But now If I want to know which state my db was as a given time, I'll simply exclude all updates done after a specific timestamp_

SELECT new_value FROM mytable WHERE object_id='34' AND timestamp < GIVEN_TIMESTAMP
 ORDER BY timestamp DESC LIMIT 1

Voilà, I can now run queries against any version of my DB.
Obviously this is a simplification, you'll have a lot of issues to address (or better, operations to translate) to support this kind of architecture, but it's the only/best way to support such problem constraints.

Speaking for your specific case, @Naoto-Ida, I think you could get away with a far less disruptive change: create a table views(post_id, timestamp) and calculate amountOfViews at query-time by aggregating with the foreign key of your main table. That way you can filter by timestamp and retrieve the state (and thus the order) you had at that time.

@luckydrq

This comment has been minimized.

Copy link

@luckydrq luckydrq commented Oct 31, 2017

  1. If the first argument is provided, add ORDER BY id DESC LIMIT first+1 to the query

@mattecapu why +1 to limit number? I think it is just as equal as first argument.

@Naoto-Ida

This comment has been minimized.

Copy link

@Naoto-Ida Naoto-Ida commented Oct 31, 2017

@luckydrq You add +1 to get the boolean for pageInfo.hasNextPage

@mattecapu

This comment has been minimized.

Copy link
Author

@mattecapu mattecapu commented Nov 5, 2017

@luckydrq yeah it basically peeks at the next page to see if there is one.

Re-reading my last post, it comes to me there's a less invasive way to support updates, just use an updatedAt attribute. Use it the same way you use the createdAt attribute, and update it on updates.

@pcattori

This comment has been minimized.

Copy link

@pcattori pcattori commented Jan 3, 2018

I implemented some helper functions (namely paginate) so that SQL support would be easy (ordering, filtering supported). I followed @mattecapu 's suggested approach.

usage:

import {
    GraphQLObjectType,
} from 'graphql'
import {
    connectionArgs,
} from 'graphql-relay'

import * as helpers from 'the-gist-linked-above' // not an actual npm lib yet :P

// helper
const connectionType = nodeType => connectionDefinitions({nodeType}).connectionType

const Query = new GraphQLObjectType({
    name: 'Query',
    fields: () => ({
        // ... other queries here
        things: {
            type: connectionType(Thing),
            args: connectionArgs,
            resolve: (_, paginationArgs) => {
                // you could get `orderBy` from args, but just hard-coded here for simplicity
                return helpers.paginate(models.Thing, paginationArgs, {orderBy: [['name', 'ASC']]})
            }
        },
    })
})

It's currently coupled with Sequelize, but could pretty easily be decoupled.

https://gist.github.com/pcattori/2bb645d587e45c9fdbcabf5cef7a7106

@sibelius

This comment has been minimized.

Copy link
Contributor

@sibelius sibelius commented Jan 3, 2018

We use this in production

https://github.com/entria/graphql-mongoose-loader

it solves pagination and dataloader for mongo, using mongoose.

we have the same concept for other datasources, as REST api, SQL (oracle and postgres), very easy to extend to any datasource.

@crisu83

This comment has been minimized.

Copy link

@crisu83 crisu83 commented Apr 26, 2018

@mattecapu Thanks for posting your solution here it was very helpful.

I implemented the same logic in our project on top of our GraphQL and Relay implementations in PHP.

@enumag

This comment has been minimized.

Copy link

@enumag enumag commented Oct 2, 2018

@mattecapu I read your guide how to handle GraphQL connections with SQL. It's pretty much what I came up with when I was analyzing it myself (+ some details like how to handle hasNextPage).

The problem is that the SQL query gets really complicated really fast when I add an optional sorting argument to the GraphQL connection - especially if the sorting can be a combination of fields.

Do you have any tips how to handle that and how to do it efficiently?

@mattecapu

This comment has been minimized.

Copy link
Author

@mattecapu mattecapu commented Oct 2, 2018

Hi @enumag, what do you exactly mean by 'SQl complexity'? The length in chars? Execution time? Other metrics?
I will assume you mean 'the query it writes is an ugly monster made up of a lot of chunks'. Unfortunately, my algorithm (feels weird to call it in such a way...) is very straightforward, and I really can't think of any ways to make it shorter or to improve its 'efficiency'. It really just translates between GraphQL and SQL, so I don't see room of improvement without pruning feature from your endpoint.
That said, using SQL builders means that you never deal with the raw, horrific string, and using an ORM means an even nicer interface (I use Sequelize). I would hint at this as a good practice, because it abstracts away much of the trouble.

@enumag

This comment has been minimized.

Copy link

@enumag enumag commented Oct 3, 2018

Hi @enumag, what do you exactly mean by 'SQl complexity'? The length in chars? Execution time? Other metrics?

Primarily execution time and efficient usage of indexes on that table.

Secondary is the SQL query length and number of conditions in WHERE clause but I'm already using an SQL builder so I can deal with that myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.