Skip to content

Improve performance of queue locking #240

@ben-pr-p

Description

@ben-pr-p

Feature description

The current query, with 161 job queues and several thousand jobs on one queue:

# explain analyze 
select jobs.queue_name, jobs.id
from "graphile_worker".jobs
where (jobs.locked_at is null or jobs.locked_at < (now() - '4 hours'::interval))
and (
  jobs.queue_name is null
or
  exists (
    select 1
    from "graphile_worker".job_queues
    where job_queues.queue_name = jobs.queue_name
    and (job_queues.locked_at is null or job_queues.locked_at < (now() - '4 hours'::interval))
    for update
    skip locked
  )
)
and run_at <= now()
and attempts < max_attempts
and (null is null or jobs.task_identifier = any(null))
and (null is null or (flags ?| null) is not true)
order by priority asc, run_at asc, id asc
limit 1
for update
skip locked;
                                                                                       QUERY PLAN                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..227.88 rows=1 width=38) (actual time=242.627..242.629 rows=0 loops=1)
   ->  LockRows  (cost=0.41..2317910.12 rows=10190 width=38) (actual time=242.625..242.626 rows=0 loops=1)
         ->  Index Scan using jobs_priority_run_at_id_locked_at_without_failures_idx on jobs  (cost=0.41..2317808.22 rows=10190 width=38) (actual time=242.624..242.626 rows=0 loops=1)
               Index Cond: (run_at <= now())
               Filter: ((attempts < max_attempts) AND ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval))) AND ((queue_name IS NULL) OR (SubPlan 1)))
               Rows Removed by Filter: 8859
               SubPlan 1
                 ->  LockRows  (cost=0.52..194.36 rows=1 width=10) (actual time=0.026..0.026 rows=0 loops=8857)
                       ->  Index Scan using job_queues_pkey on job_queues  (cost=0.52..194.35 rows=1 width=10) (actual time=0.026..0.026 rows=0 loops=8857)
                             Index Cond: (queue_name = jobs.queue_name)
                             Filter: ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval)))
                             Rows Removed by Filter: 1
 Planning Time: 0.250 ms
 Execution Time: 242.679 ms
(14 rows)

An alternative query, on the same table:

# explain analyze 
select jobs.queue_name, jobs.id
from "graphile_worker".jobs
where (jobs.locked_at is null or jobs.locked_at < (now() - '4 hours'::interval))
and (
  jobs.queue_name is null
or
  jobs.queue_name in (
    select queue_name
    from "graphile_worker".job_queues
    where true                                   
    and (job_queues.locked_at is null or job_queues.locked_at < (now() - '4 hours'::interval))
    for update
    skip locked
  )
)
and run_at <= now()
and attempts < max_attempts
and (null is null or jobs.task_identifier = any(null))
and (null is null or (flags ?| null) is not true)
order by priority asc, run_at asc, id asc
limit 1
for update
skip locked;
                                                                                      QUERY PLAN                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=3554.48..3558.87 rows=1 width=38) (actual time=13.327..13.330 rows=0 loops=1)
   ->  LockRows  (cost=3554.48..48298.47 rows=10190 width=38) (actual time=13.326..13.328 rows=0 loops=1)
         ->  Index Scan using jobs_priority_run_at_id_locked_at_without_failures_idx on jobs  (cost=3554.48..48196.57 rows=10190 width=38) (actual time=13.326..13.327 rows=0 loops=1)
               Index Cond: (run_at <= now())
               Filter: (((queue_name IS NULL) OR (hashed SubPlan 1)) AND (attempts < max_attempts) AND ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval))))
               Rows Removed by Filter: 8859
               SubPlan 1
                 ->  LockRows  (cost=0.00..3553.65 rows=166 width=42) (actual time=0.068..6.381 rows=158 loops=1)
                       ->  Seq Scan on job_queues  (cost=0.00..3551.99 rows=166 width=42) (actual time=0.012..5.852 rows=158 loops=1)
                             Filter: ((locked_at IS NULL) OR (locked_at < (now() - '04:00:00'::interval)))
                             Rows Removed by Filter: 2
 Planning Time: 0.486 ms
 Execution Time: 13.396 ms
(13 rows)

Motivating example

We're seeing 70-100% CPU utilization from get_job on pretty big machines when there are more than a few thousand jobs that need to be skipped because their queue is locked.

Breaking changes

Shouldn't be any, although I'm not 100% sure what the behavior of for update skip locked in the subquery is

Supporting development

I [tick all that apply]:

  • am interested in building this feature myself
  • am interested in collaborating on building this feature
  • am willing to help testing this feature before it's released
  • am willing to write a test-driven test suite for this feature (before it exists)
  • am a Graphile sponsor ❤️
  • have an active support or consultancy contract with Graphile

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions