Feature description
The current query, with 161 job queues and several thousand jobs on one queue:
# explain analyze
select jobs.queue_name, jobs.id
from "graphile_worker".jobs
where (jobs.locked_at is null or jobs.locked_at < (now() - '4 hours'::interval))
and (
jobs.queue_name is null
or
exists (
select 1
from "graphile_worker".job_queues
where job_queues.queue_name = jobs.queue_name
and (job_queues.locked_at is null or job_queues.locked_at < (now() - '4 hours'::interval))
for update
skip locked
)
)
and run_at <= now()
and attempts < max_attempts
and (null is null or jobs.task_identifier = any(null))
and (null is null or (flags ?| null) is not true)
order by priority asc, run_at asc, id asc
limit 1
for update
skip locked;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..227.88 rows=1 width=38) (actual time=242.627..242.629 rows=0 loops=1)
-> LockRows (cost=0.41..2317910.12 rows=10190 width=38) (actual time=242.625..242.626 rows=0 loops=1)
-> Index Scan using jobs_priority_run_at_id_locked_at_without_failures_idx on jobs (cost=0.41..2317808.22 rows=10190 width=38) (actual time=242.624..242.626 rows=0 loops=1)
Index Cond: (run_at <= now())
Filter: ((attempts < max_attempts) AND ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval))) AND ((queue_name IS NULL) OR (SubPlan 1)))
Rows Removed by Filter: 8859
SubPlan 1
-> LockRows (cost=0.52..194.36 rows=1 width=10) (actual time=0.026..0.026 rows=0 loops=8857)
-> Index Scan using job_queues_pkey on job_queues (cost=0.52..194.35 rows=1 width=10) (actual time=0.026..0.026 rows=0 loops=8857)
Index Cond: (queue_name = jobs.queue_name)
Filter: ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval)))
Rows Removed by Filter: 1
Planning Time: 0.250 ms
Execution Time: 242.679 ms
(14 rows)
An alternative query, on the same table:
# explain analyze
select jobs.queue_name, jobs.id
from "graphile_worker".jobs
where (jobs.locked_at is null or jobs.locked_at < (now() - '4 hours'::interval))
and (
jobs.queue_name is null
or
jobs.queue_name in (
select queue_name
from "graphile_worker".job_queues
where true
and (job_queues.locked_at is null or job_queues.locked_at < (now() - '4 hours'::interval))
for update
skip locked
)
)
and run_at <= now()
and attempts < max_attempts
and (null is null or jobs.task_identifier = any(null))
and (null is null or (flags ?| null) is not true)
order by priority asc, run_at asc, id asc
limit 1
for update
skip locked;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=3554.48..3558.87 rows=1 width=38) (actual time=13.327..13.330 rows=0 loops=1)
-> LockRows (cost=3554.48..48298.47 rows=10190 width=38) (actual time=13.326..13.328 rows=0 loops=1)
-> Index Scan using jobs_priority_run_at_id_locked_at_without_failures_idx on jobs (cost=3554.48..48196.57 rows=10190 width=38) (actual time=13.326..13.327 rows=0 loops=1)
Index Cond: (run_at <= now())
Filter: (((queue_name IS NULL) OR (hashed SubPlan 1)) AND (attempts < max_attempts) AND ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval))))
Rows Removed by Filter: 8859
SubPlan 1
-> LockRows (cost=0.00..3553.65 rows=166 width=42) (actual time=0.068..6.381 rows=158 loops=1)
-> Seq Scan on job_queues (cost=0.00..3551.99 rows=166 width=42) (actual time=0.012..5.852 rows=158 loops=1)
Filter: ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval)))
Rows Removed by Filter: 2
Planning Time: 0.486 ms
Execution Time: 13.396 ms
(13 rows)
Motivating example
We're seeing 70-100% CPU utilization from get_job on pretty big machines when there are more than a few thousand jobs that need to be skipped because their queue is locked.
Breaking changes
Shouldn't be any, although I'm not 100% sure what the behavior of for update skip locked in the subquery is
Supporting development
I [tick all that apply]:
Feature description
The current query, with 161 job queues and several thousand jobs on one queue:
An alternative query, on the same table:
Motivating example
We're seeing 70-100% CPU utilization from
get_jobon pretty big machines when there are more than a few thousand jobs that need to be skipped because their queue is locked.Breaking changes
Shouldn't be any, although I'm not 100% sure what the behavior of
for update skip lockedin the subquery isSupporting development
I [tick all that apply]: