-
-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding "priority ordering" feature to allow users specifying the order of precedence in a queue #183
base: master
Are you sure you want to change the base?
Conversation
…r of precedence in a queue. No matter the `execution_time`, given execution_time>CURRENT_TIMESTAMP, the highest priority will always be executed first e.g. ``` scheduler.schedule(onetimeTask.instanceBuilder("1").setPriority(100), Instant.now()); scheduler.schedule(onetimeTask.instanceBuilder("2").setPriority(200), Instant.now()); ```
…r of precedence in a queue. No matter the `execution_time`, given execution_time>CURRENT_TIMESTAMP, the highest priority will always be executed first e.g. ``` scheduler.schedule(onetimeTask.instanceBuilder("1").setPriority(100), Instant.now()); scheduler.schedule(onetimeTask.instanceBuilder("2").setPriority(200), Instant.now()); ```
That was fast! I don't know I have yet fully wrapped my head around this feature. One thing that I still have not thought of a good solution for, is how to index this design effectively. Do you have any thoughts there? Fetch-due query:
For high-volume use-cases I have created an index like
However, adding priority to the mix, I don't think this index will work (since it now contains two "ranges"):
I suspect the current solution will be a bit bad for high-throughput use-cases since they will not be able to index effectively. Hypothetically, let's say we have extreme cases where you have 1M executions due... 🤔 I am trying to think of a good solution here... one variant is to allow users to disable priority-sorting for high-volume cases, or going for a very basic priority feature LOW, NORMAL, HIGH and execute three different fetches, for priority HIGH->LOW.. |
Just to explore some other thoughts.
|
I've seen that you have a pretty nice framework to test changes in #175, we might want to use that in case to evaluate |
Yeah that could be used to evaluate this. Though I think we could simply create a table, populate with a couple of millions of rows, create the indices and run the |
If cardinality were as low as 3 (high, normal, low), then we might just issue 3 queries
In this case, this index would work well, since priority is locked to a single value for each query:
The downside is that we need to issue 3 selects each time we poll, but if we poll for 50+ executions each time, then the overhead will still be pretty small, and all queries will be fast due to "perfect" index |
I'm pretty sure that the index on two fields should work pretty well. Queries on two ranges shouldn't be a problem. Multiple queries are for sure less performant than one query with index and it would also be a strong limitation to users An index with |
For your use-case, what volumes are you expecting to use? Also, I think we should do a local test of performance by just inserting say 5M records with slightly randomized execution-time and priority (max cardinality 10 for now) and see what query-times we are looking at to fetch say 100 executions. And try and optimize them by creating ideal indices |
Sure, in theory the highest cardinality one should always be first in the index, which in this case is execution_time. If you imagine a index execution_time, priority, where priority has cardinality 1, the index performance should match exactly an index that just has execution_time. At the moment I don't have the capacity to create a full fledged test system, would you have time to experiment? Or could we optimize later if needed? |
I may get some time to run a couple of tests. I want to be sure that this feature will not make performance worse for those using it for high-throughput cases, or at least that there is an escape-hatch should they experience lower throughput.
I think for perfect results, the index should match the The nice thing about the current (master) select and index is that the database only has to read |
Should the results show that this feature may affect performance, then I think we could add it as an opt-in feature, where you explicitly enable it on the scheduler (e.g. |
Ok, so you would be comparing the current version with the new branch |
Yeah I suppose so. master vs this branch where all priorities = null/0 |
Sorry, I haven't had time to give this PR attention yet. I really need to finish PR #175 and time is limited unfortunately |
Hi, I wanted to check if this feature was still something that was going forward? Its something that my team would also find really useful 🙂 |
I think it is an interesting feature, but there are a couple of things higher up on the list. Could you describe your use-case? |
Sure, the scenario is that I am using the dbscheduler to publish to a few kafka topics. Some of these are just auditing topics and others are actual feature code, so I would like to prioritise feature code over auditing |
Are you anticipating high volumes (how high?) such that executions will queue up? |
At my company we run https://github.com/instructure/inst-jobs for rails applications which has a similar priority feature. It's pop query is (simplified) "SELECT * FROM jobs WHERE run_at < now() ORDER BY priority, run_at, id", which uses an index on
so at least on postgres such an index is usable without additional filtering. This queue has scaled to millions of jobs in queue and pop-able without performance issues on the pop query, so I'm reasonably confident that such an index works right. |
Hi Jacob, thanks for the input! Would you please post the index-definition and the full query-plan? I am very sceptical to postgres being able to use an index on |
Here's the actual pop query we run in prod. Note that we also have concepts of multiple queues in one table and of stranded jobs, both of which aren't really relevant here:
And here's the corresponding full plan (The innermost index scan is the real select from the jobs queue; as evidenced by the rows=160 on the other stuff that's all just operating on subset of jobs plucked by the subquery):
The index definition of the
|
@kagkarlsson - is it possible to plan a merge of this feature soon and release it? We need this capability in our application. |
I might be able to pick this up soon, but cannot give an ETA for it |
#181
No matter the
execution_time
, given execution_time>CURRENT_TIMESTAMP, the highest priority will always be executed firste.g.
@kagkarlsson I wasn't sure how to bump a major version.
Please let me know if you have any feedback!