[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939

sisve · 2018-01-27T21:46:26Z

Laravel Version: 5.6
PHP Version: irrelevant
Database Driver & Version: irrelevant

Description:

The new redis blocking pop functionality introduced in #22284 has issues if there are problems between the BLPOP and the ZADD. This means that any failures in the worker at this point will lose the jobs that have been popped from the queue. These jobs will not be retried by another worker since they were never pushed to the reserved queue.

The problematic code is in RedisQueue::blockingPop.

framework/src/Illuminate/Queue/RedisQueue.php

Lines 225 to 241 in 63b6108

    
           protected function blockingPop($queue) 
        
           { 
        
               $rawBody = $this->getConnection()->blpop($queue, $this->blockFor); 
        
               if (! is_null($rawBody)) { 
        
                   $payload = json_decode($rawBody[1], true); 
        
                   $payload['attempts']++; 
        
                   $reserved = json_encode($payload); 
        
                   $this->getConnection()->zadd($queue.':reserved', [ 
        
                       $reserved => $this->availableAt($this->retryAfter), 
        
                   ]); 
        
                   return [$rawBody[1], $reserved]; 
        
               } 
        
               return [null, null]; 
        
           }

taylorotwell · 2018-01-29T14:40:08Z

So what is the solution?

taylorotwell · 2018-01-29T14:43:37Z

Just throwing ideas out there, can we use MULTI and EXEC?

taylorotwell · 2018-01-29T14:44:37Z

Apparently not:

taylorotwell · 2018-01-29T14:45:06Z

Seems like our only option would be to go full Lua script with it.

halaei · 2018-01-29T16:48:39Z

Beside documentation, as I mentioned here, the code to fix will need to use brpoplpush command. It will add another Redis list key for each queue, it will add 1 or 2 extra requests per pop, and it will reverse the order of jobs in the Redis list.

halaei · 2018-01-29T16:49:55Z

Please take a look at https://redis.io/commands/rpoplpush#pattern-reliable-queue

taylorotwell · 2018-01-29T17:42:47Z

Ah, hmm, we may need to back this out then if we can't make it reliable. Any thoughts? If we need to use brpoplpush that is fine with me. But, it sounds like reversing the order of jobs in the Redis list is a major breaking change? Or we would document that changing from non-blocking to blocking requires you to drain your queues first so the order is correct.

taylorotwell · 2018-01-29T17:44:09Z

I personally feel like data loss on a queue is going to be critical in basically 99% of applications.

sisve · 2018-01-30T08:06:37Z

If we change the ordering of the queues we should probably change both the blocking and the non-blocking order. I don't think that we should require people to drain their queues just to try out the new features.

I don't see a huge problem with changing the order of the lists between 5.5 and 5.6. Sure, I can make up some scenarios where it will cause problems, but those involve a mix of Laravel versions pushing and popping to the queue, and not enough workers to ever drain it. And that is a totally different problem anyway.

I think it's enough to document that we change the order of the redis queue in the upgrade guide. It's an implementation detail and only affects people if they have jobs in the queue when upgrading.

halaei · 2018-01-30T08:33:05Z

Helper scripts can be provided to do the migration and reverse the orders.

taylorotwell · 2018-02-05T15:25:51Z

Looks like I'll just need to document this for 5.6 release I guess.

taylorotwell · 2018-02-05T15:29:53Z

Added this warning for now:

@halaei do you intend to work on fixing the potential data loss issue or should others try to take a look at it?

halaei · 2018-02-05T16:14:22Z

I work on it and send a PR in 2 days.

sisve · 2018-02-10T10:51:02Z

An update;

The original PR did a check for a configuration setting of blocking_for > 0 to activate this new feature. That meant it was an opt-in feature.
A later commit changed this behavior to check for non-null values, allowing a blocking_for = 0.
Horizon defaults to a blocking_for=0 in code if the entry is missing in the configuration. This affects all upgraded installations. I have no numbers on how many these are, and it depends on if people updated their config/queue.php files to add the blocking_for entry. There is no explicit mention of this in the upgrade guide.

All these three factors, together, becomes a growing problem of reported lost/paused jobs in Horizon.

On an unrelated note, if we're using blocking_for=0 (or any high value), how does that work with the SIGTERM handling to cleanly shut down a worker? I imagine that the signal is sent, but it isn't processed until the BLPOP returns. However, if this takes too long time, then supervisor would have issued a SIGKILL to forcefully shut down the worker.

Assuming it happens during the BLPOP ; does redis know that the call was interrupted or would it still pop a job, send it away, and perhaps realize the connection is dead too late? Do we know what happens here?
If this happens between the BLPOP and the ZADD, the job is lost. This is basically a form of the reported problem.
If it happens after ZADD the job is retried later.

There are several questions about the current implementation, but there is a new PR that uses BRPOPLPUSH at #23057 that may fix this. I'm not knowledgeable enough about BRPOPLPUSH to evaluate that approach, do we have any takers that can review that PR and add some input?

halaei · 2018-10-08T12:42:47Z

I wrote a Redis module to fix this issue and add some other values. Please check lqrm and its php driver. Here is an intro digest:

This is a Redis module that implements a laravel compatible queue driver for redis with the following improvements over the original Laravel 5.7 PHP-Lua driver:
1. Blocking` pop is now more reliable than before.
2. Blocking pop now works on delayed and reserved jobs as well.
3. Timer for delayed and reserved jobs is server side, with milliseconds precision. This means you don't need to worry about syncing your php and redis servers, in case your projects are distributed accross different servers. Moreover, this makes retry_after and block_for configurations independent of each other.
4. Laravel queue can now be available for other programming languages and frameworks as well. Feel free to port it to your favorite ones.

Will be great if you give it a try and/or send feedback - it might still have some bugs though.

matt-allan · 2019-01-17T16:15:55Z

There is a pretty simple solution that I think will work. It's explained in the redis docs here.

Basically when we push a job we also push a notification key:

MULTI
RPUSH queues:default '{"id":12345,"attempts":1,"job":"say_hello","data":{"name":"Matt"}}'
RPUSH queues:default:notify x
EXEC

The worker does a blocking pop on the notification key. Once the notification key returns an element we know a job is ready. Once a job is ready we use the existing Lua script and get all of the same atomicity guarantees.

LOOP forever
    WHILE EVAL(pop.lua, queues:default, queues:default:reserved, expiration) returns elements
        ... process elements ...
    END
    BRPOP queues:default:notify
END

In this example queues:default:notify is a new list used only for notifying workers a new job was pushed. There would be a new list for each queue using the pattern queues:{{name}}:notify.

There is a chance the worker might crash between popping the notification element and popping the job onto the reserved queue. That is ok - once the worker restarts it will attempt to pop the job again and nothing important is lost.

The only downside I can think of with this approach is it will take some care when upgrading. If you update the workers before the producers they will be blocking waiting for a notification key that is never set. The worker will eventually work the job once block_for times out, so job processing won't stop completely. I think in most cases you would upgrade all of your servers at the same time, so this would be a rare edge case.

driesvints · 2019-02-12T14:44:45Z

This will be fixed in 5.8: #27336

Thanks for everyone's help with this!

sisve mentioned this issue Jan 29, 2018

[5.6] blocking pop from redis queues #22284

Merged

themsaid added the bug label Jan 29, 2018

halaei mentioned this issue Feb 7, 2018

[5.6] brpoplpush for redis queues #23057

Closed

josiasmontag mentioned this issue Feb 10, 2018

block_for should default to null laravel/horizon#294

Merged

halaei mentioned this issue Oct 8, 2018

add laravel queue redis module redis/redis-doc#998

Merged

matt-allan mentioned this issue Jan 18, 2019

[5.8] Make the redis queue blocking pop atomic #27227

Closed

driesvints closed this as completed Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939

[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939

sisve commented Jan 27, 2018

taylorotwell commented Jan 29, 2018

taylorotwell commented Jan 29, 2018

taylorotwell commented Jan 29, 2018

taylorotwell commented Jan 29, 2018

halaei commented Jan 29, 2018

halaei commented Jan 29, 2018

taylorotwell commented Jan 29, 2018 •

edited

Loading

taylorotwell commented Jan 29, 2018

sisve commented Jan 30, 2018

halaei commented Jan 30, 2018

taylorotwell commented Feb 5, 2018

taylorotwell commented Feb 5, 2018

halaei commented Feb 5, 2018

sisve commented Feb 10, 2018 •

edited

Loading

halaei commented Oct 8, 2018

matt-allan commented Jan 17, 2019

driesvints commented Feb 12, 2019

[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939

[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939

Comments

sisve commented Jan 27, 2018

Description:

taylorotwell commented Jan 29, 2018

taylorotwell commented Jan 29, 2018

taylorotwell commented Jan 29, 2018

taylorotwell commented Jan 29, 2018

halaei commented Jan 29, 2018

halaei commented Jan 29, 2018

taylorotwell commented Jan 29, 2018 • edited Loading

taylorotwell commented Jan 29, 2018

sisve commented Jan 30, 2018

halaei commented Jan 30, 2018

taylorotwell commented Feb 5, 2018

taylorotwell commented Feb 5, 2018

halaei commented Feb 5, 2018

sisve commented Feb 10, 2018 • edited Loading

halaei commented Oct 8, 2018

matt-allan commented Jan 17, 2019

driesvints commented Feb 12, 2019

taylorotwell commented Jan 29, 2018 •

edited

Loading

sisve commented Feb 10, 2018 •

edited

Loading