-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch big IN/NOT IN selections #575
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar enough with the implementation of the filters yet to say if there's high level problems with the design, but it looks good to me 💯
One super nitpicky thing: the PR description is helpful, it would be nice to have it in the commit message for future reference (sometimes I do use that :p)
} | ||
|
||
batches | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking a bit more about this, and if the list here is large (which it is if we are batching), then we are iterating over it, but if we computed the indexes for the batches, we could iterate over those and push the batches one at a time (instead of an element at a time). Here we have per-element costs like the conditional, and the last_mut
lookup (with bounds check). But that's an optimization we can do later once we have decided we do want to do this this way and the API settles down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this thing here also is not the final version. It's the first version that somehow works. Now we need some conversation about this. The first limit I'd like to lift is the problem of this not working if we have more than one IN
statement in the filters.
query-engine/connectors/sql-query-connector/src/database/operations/read.rs
Show resolved
Hide resolved
This will fix the issues when joining and fetching with a big selection of records when the database doesn't like that big of a queries anymore. What it does is it checks the filter, if it can find ONE scalar filter that is `IN`/`NOT IN` with more than 5000 items, it will split the filters into batches of at most 5000 items and run the queries in parallel. So this is going to be batched: ```graphql query { findManyUser(where: { id_lt: 10000 }) { posts(where: { id_lt: 10000 }) { id } } } ``` The batch size can be controlled via `QUERY_BATCH_SIZE` environment variable for test purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the discussed ordering issues, this looks alright.
Addresses #579 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only doubt is whether this will interact with other features, and we miss it because the default is 5000 and most tests don't have that many records in the database.
} | ||
|
||
if let Some(ref order_by) = order { | ||
records.order_by(order_by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so here we actually sort ourselves right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
We'll keep an eye on this. For now with lots of manual testing and with the integration tests, I was able to find two problems:
|
This will fix the issues when joining and fetching with a big selection of records when the database doesn't like that big of a queries anymore.
What it does is it checks the filter, if it can find ONE scalar filter that is
IN
/NOT IN
with more than 5000 items, it will split the filters into batches of at most 5000 items and run the queries in parallel.So this is going to be batched:
The batch size can be controlled via
QUERY_BATCH_SIZE
environment variable for test purposes.Hopefully on Friday I can solve the last query in a way that works and I can also write tests for this somehow.