Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair may stuck when many failed jobs #122

Closed
andrii-suse opened this issue Jan 20, 2022 · 6 comments
Closed

Repair may stuck when many failed jobs #122

andrii-suse opened this issue Jan 20, 2022 · 6 comments

Comments

@andrii-suse
Copy link

  • Minion version: all
  • Perl version: all
  • Operating system: all

Steps to reproduce the behavior

Minion::Backend::Pg::repair doesn't scale well

Expected behavior

Should scale well with number of rows to delete and number of inactive / failed tasks

Actual behavior

Performance decreases exponentially - Backend must do X times array lookups over Y rows, where
X - number of finished jobs to be deleted
Y - number of inactive / failed jobs

Analysis:

The query used for removing old jobs is similar to

select id FROM minion_jobs AS j
     WHERE (finished <= NOW() - INTERVAL '2 days' AND NOT EXISTS (
       SELECT 1 FROM minion_jobs WHERE parents @> ARRAY[j.id] AND state != 'finished'
     ) AND state = 'finished') OR (expires <= NOW() AND state = 'inactive');

Explain analyze in live environment shows

Seq Scan on minion_jobs j  (cost=0.00..526712612.52 rows=6242 width=8) (actual time=216.507..1707.577 rows=9366 loops=1)
Execution Time: 1712.016 ms

but the cost grows further with number of rows to remove and rows in failed + inactive jobs. It may stuck for hours with large enough number of old and failed jobs.

Proposed solution:

Rewrite query to use left join, so it should scale well with increased number of rows to process

select j.id FROM minion_jobs AS j 
    LEFT JOIN minion_jobs children ON children.state != 'finished' and array_length(children.parents,1)>0 and j.id = any(children.parents)
     WHERE (j.finished <= NOW() - INTERVAL '2 day' AND children.id IS NULL
     AND j.state = 'finished') OR (j.expires <= NOW() AND j.state = 'inactive');

Explain analyze in live environment shows:

 Nested Loop Left Join  (cost=2000.00..100802.42 rows=1 width=8) (actual time=171.405..252.822 rows=9458 loops=1)
 Execution Time: 257.431 ms
@kraih
Copy link
Member

kraih commented Jan 20, 2022

Interesting. I'll have to double check, but it looks like a good optimization.

@kraih
Copy link
Member

kraih commented Jan 20, 2022

Full explain analyze results for one of my larger Minion setups:

# explain analyze select id FROM minion_jobs AS j WHERE (finished <= NOW() - INTERVAL '2 days' AND NOT EXISTS ( SELECT 1 FROM minion_jobs WHERE parents @> ARRAY[j.id] AND state != 'finished' ) AND state = 'finished') OR (expires <= NOW() AND state = 'inactive');
                                                                                      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on minion_jobs j  (cost=0.00..15210224159.25 rows=517279 width=8) (actual time=0.280..89508.393 rows=1026181 loops=1)
   Filter: (((finished <= (now() - '2 days'::interval)) AND (NOT (SubPlan 1)) AND (state = 'finished'::minion_state)) OR ((expires <= now()) AND (state = 'inactive'::minion_state)))
   Rows Removed by Filter: 10818
   SubPlan 1
     ->  Bitmap Heap Scan on minion_jobs  (cost=86.99..14629.99 rows=1 width=0) (actual time=0.086..0.086 rows=0 loops=1026182)
           Recheck Cond: (parents @> ARRAY[j.id])
           Filter: (state <> 'finished'::minion_state)
           Rows Removed by Filter: 1
           Heap Blocks: exact=873921
           ->  Bitmap Index Scan on minion_jobs_parents_idx  (cost=0.00..86.99 rows=5198 width=0) (actual time=0.084..0.084 rows=1 loops=1026182)
                 Index Cond: (parents @> ARRAY[j.id])
 Planning Time: 1.512 ms
 Execution Time: 89563.174 ms
(13 rows)
# explain analyze select j.id FROM minion_jobs AS j LEFT JOIN minion_jobs children ON children.state != 'finished' and array_length(children.parents,1)>0 and j.id = any(children.parents) WHERE (j.finished <= NOW() - INTERVAL '2 day' AND children.id IS NULL AND j.state = 'finished') OR (j.expires <= NOW() AND j.state = 'inactive');
                                                                                            QUERY PLAN                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=1000.00..144554.87 rows=1 width=8) (actual time=140.286..13444.584 rows=1026181 loops=1)
   Join Filter: (j.id = ANY (children.parents))
   Rows Removed by Join Filter: 8295507
   Filter: (((j.finished <= (now() - '2 days'::interval)) AND (children.id IS NULL) AND (j.state = 'finished'::minion_state)) OR ((j.expires <= now()) AND (j.state = 'inactive'::minion_state)))
   Rows Removed by Filter: 10831
   ->  Seq Scan on minion_jobs j  (cost=0.00..51092.56 rows=1039656 width=28) (actual time=0.028..418.200 rows=1037011 loops=1)
   ->  Materialize  (cost=1000.00..49276.93 rows=1 width=35) (actual time=0.000..0.001 rows=8 loops=1037011)
         ->  Gather  (cost=1000.00..49276.93 rows=1 width=35) (actual time=19.233..140.259 rows=8 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Parallel Seq Scan on minion_jobs children  (cost=0.00..48276.83 rows=1 width=35) (actual time=17.599..135.238 rows=3 loops=3)
                     Filter: ((state <> 'finished'::minion_state) AND (array_length(parents, 1) > 0))
                     Rows Removed by Filter: 345668
 Planning Time: 0.569 ms
 Execution Time: 13491.001 ms
(15 rows)

And for completeness, this is what the table looks like currently:

# \d minion_jobs
                                      Table "public.minion_jobs"
  Column  |           Type           | Collation | Nullable |                 Default                 
----------+--------------------------+-----------+----------+-----------------------------------------
 id       | bigint                   |           | not null | nextval('minion_jobs_id_seq'::regclass)
 args     | jsonb                    |           | not null | 
 created  | timestamp with time zone |           | not null | now()
 delayed  | timestamp with time zone |           | not null | 
 finished | timestamp with time zone |           |          | 
 priority | integer                  |           | not null | 
 result   | jsonb                    |           |          | 
 retried  | timestamp with time zone |           |          | 
 retries  | integer                  |           | not null | 0
 started  | timestamp with time zone |           |          | 
 state    | minion_state             |           | not null | 'inactive'::minion_state
 task     | text                     |           | not null | 
 worker   | bigint                   |           |          | 
 queue    | text                     |           | not null | 'default'::text
 attempts | integer                  |           | not null | 1
 parents  | bigint[]                 |           |          | '{}'::bigint[]
 notes    | jsonb                    |           | not null | '{}'::jsonb
 expires  | timestamp with time zone |           |          | 
 lax      | boolean                  |           | not null | false
Indexes:
    "minion_jobs_pkey" PRIMARY KEY, btree (id)
    "minion_jobs_expires_idx" btree (expires)
    "minion_jobs_notes_idx" gin (notes)
    "minion_jobs_parents_idx" gin (parents)
    "minion_jobs_state_priority_id_idx" btree (state, priority DESC, id)
Check constraints:
    "args" CHECK (jsonb_typeof(args) = 'array'::text)
    "minion_jobs_notes_check" CHECK (jsonb_typeof(notes) = 'object'::text)
Triggers:
    minion_jobs_notify_workers_trigger AFTER INSERT OR UPDATE OF retries ON minion_jobs FOR EACH ROW EXECUTE FUNCTION minion_jobs_notify_workers()

@ilmari
Copy link

ilmari commented Jan 20, 2022

Visualisations for the original plan and the updated one.

@ilmari
Copy link

ilmari commented Jan 20, 2022

It would be interesting to see what the result is of using UNION ALL instead of OR:

SELECT j.id
  FROM minion_jobs AS j 
  LEFT JOIN minion_jobs children
    ON children.state != 'finished' AND array_length(children.parents,1)>0 AND j.id = ANY(children.parents)
 WHERE j.state = 'finished' AND j.finished <= NOW() - INTERVAL '2 day' AND children.id IS NULL
UNION ALL
SELECT j.id FROM minion_jobs AS j
 WHERE j.state = 'inactive' AND j.expires <= NOW();

This might also benefit from indexes on (expires) WHERE state = 'inactive' and (finished) WHERE state = 'finished'.

@kraih
Copy link
Member

kraih commented Jan 20, 2022

Results for the UNION ALL version:

# explain analyze SELECT j.id FROM minion_jobs AS j LEFT JOIN minion_jobs children ON children.state != 'finished' AND array_length(children.parents,1)>0 AND j.id = ANY(children.parents) WHERE j.state = 'finished' AND j.finished <= NOW() - INTERVAL '2 day' AND children.id IS NULL UNION ALL SELECT j.id FROM minion_jobs AS j WHERE j.state = 'inactive' AND j.expires <= NOW();
                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=1000.00..136994.01 rows=2 width=8) (actual time=129.375..11762.892 rows=1026222 loops=1)
   ->  Nested Loop Left Join  (cost=1000.00..136987.78 rows=1 width=8) (actual time=129.374..11676.382 rows=1026222 loops=1)
         Join Filter: (j.id = ANY (children.parents))
         Rows Removed by Join Filter: 4104888
         Filter: (children.id IS NULL)
         ->  Seq Scan on minion_jobs j  (cost=0.00..61625.54 rows=1041423 width=8) (actual time=0.056..575.302 rows=1026222 loops=1)
               Filter: ((state = 'finished'::minion_state) AND (finished <= (now() - '2 days'::interval)))
               Rows Removed by Filter: 10833
         ->  Materialize  (cost=1000.00..49326.67 rows=1 width=35) (actual time=0.000..0.000 rows=4 loops=1026222)
               ->  Gather  (cost=1000.00..49326.66 rows=1 width=35) (actual time=22.857..129.393 rows=4 loops=1)
                     Workers Planned: 2
                     Workers Launched: 2
                     ->  Parallel Seq Scan on minion_jobs children  (cost=0.00..48326.56 rows=1 width=35) (actual time=16.749..124.557 rows=1 loops=3)
                           Filter: ((state <> 'finished'::minion_state) AND (array_length(parents, 1) > 0))
                           Rows Removed by Filter: 345684
   ->  Index Scan using minion_jobs_state_priority_id_idx on minion_jobs j_1  (cost=0.42..6.20 rows=1 width=8) (actual time=0.077..0.078 rows=0 loops=1)
         Index Cond: (state = 'inactive'::minion_state)
         Filter: (expires <= now())
         Rows Removed by Filter: 1
 Planning Time: 0.674 ms
 Execution Time: 11806.823 ms

@andrii-suse
Copy link
Author

andrii-suse commented Jan 20, 2022

It would be interesting to see what the result is of using UNION ALL instead of OR

It didn't do significant difference in my case, but I am glad that you are looking at it as well.
But since query takes too long in extreme cases - it can split it into two and delete expired and completed jobs in separate queries.

@kraih kraih closed this as completed in 1230261 Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants