Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939

Closed
sisve opened this issue Jan 27, 2018 · 17 comments
Labels

Comments

@sisve
Copy link
Contributor

sisve commented Jan 27, 2018

  • Laravel Version: 5.6
  • PHP Version: irrelevant
  • Database Driver & Version: irrelevant

Description:

The new redis blocking pop functionality introduced in #22284 has issues if there are problems between the BLPOP and the ZADD. This means that any failures in the worker at this point will lose the jobs that have been popped from the queue. These jobs will not be retried by another worker since they were never pushed to the reserved queue.

The problematic code is in RedisQueue::blockingPop.

protected function blockingPop($queue)
{
$rawBody = $this->getConnection()->blpop($queue, $this->blockFor);
if (! is_null($rawBody)) {
$payload = json_decode($rawBody[1], true);
$payload['attempts']++;
$reserved = json_encode($payload);
$this->getConnection()->zadd($queue.':reserved', [
$reserved => $this->availableAt($this->retryAfter),
]);
return [$rawBody[1], $reserved];
}
return [null, null];
}

@taylorotwell
Copy link
Member

So what is the solution?

@themsaid themsaid added the bug label Jan 29, 2018
@taylorotwell
Copy link
Member

Just throwing ideas out there, can we use MULTI and EXEC?

@taylorotwell
Copy link
Member

Apparently not:

image

@taylorotwell
Copy link
Member

Seems like our only option would be to go full Lua script with it.

@halaei
Copy link
Contributor

halaei commented Jan 29, 2018

Beside documentation, as I mentioned here, the code to fix will need to use brpoplpush command. It will add another Redis list key for each queue, it will add 1 or 2 extra requests per pop, and it will reverse the order of jobs in the Redis list.

@halaei
Copy link
Contributor

halaei commented Jan 29, 2018

Please take a look at https://redis.io/commands/rpoplpush#pattern-reliable-queue

@taylorotwell
Copy link
Member

taylorotwell commented Jan 29, 2018

Ah, hmm, we may need to back this out then if we can't make it reliable. Any thoughts? If we need to use brpoplpush that is fine with me. But, it sounds like reversing the order of jobs in the Redis list is a major breaking change? Or we would document that changing from non-blocking to blocking requires you to drain your queues first so the order is correct.

@taylorotwell
Copy link
Member

I personally feel like data loss on a queue is going to be critical in basically 99% of applications.

@sisve
Copy link
Contributor Author

sisve commented Jan 30, 2018

If we change the ordering of the queues we should probably change both the blocking and the non-blocking order. I don't think that we should require people to drain their queues just to try out the new features.

I don't see a huge problem with changing the order of the lists between 5.5 and 5.6. Sure, I can make up some scenarios where it will cause problems, but those involve a mix of Laravel versions pushing and popping to the queue, and not enough workers to ever drain it. And that is a totally different problem anyway.

I think it's enough to document that we change the order of the redis queue in the upgrade guide. It's an implementation detail and only affects people if they have jobs in the queue when upgrading.

@halaei
Copy link
Contributor

halaei commented Jan 30, 2018

Helper scripts can be provided to do the migration and reverse the orders.

@taylorotwell
Copy link
Member

Looks like I'll just need to document this for 5.6 release I guess.

@taylorotwell
Copy link
Member

Added this warning for now:

image

@halaei do you intend to work on fixing the potential data loss issue or should others try to take a look at it?

@halaei
Copy link
Contributor

halaei commented Feb 5, 2018

I work on it and send a PR in 2 days.

@sisve
Copy link
Contributor Author

sisve commented Feb 10, 2018

An update;

  1. The original PR did a check for a configuration setting of blocking_for > 0 to activate this new feature. That meant it was an opt-in feature.
  2. A later commit changed this behavior to check for non-null values, allowing a blocking_for = 0.
  3. Horizon defaults to a blocking_for=0 in code if the entry is missing in the configuration. This affects all upgraded installations. I have no numbers on how many these are, and it depends on if people updated their config/queue.php files to add the blocking_for entry. There is no explicit mention of this in the upgrade guide.

All these three factors, together, becomes a growing problem of reported lost/paused jobs in Horizon.

On an unrelated note, if we're using blocking_for=0 (or any high value), how does that work with the SIGTERM handling to cleanly shut down a worker? I imagine that the signal is sent, but it isn't processed until the BLPOP returns. However, if this takes too long time, then supervisor would have issued a SIGKILL to forcefully shut down the worker.

  1. Assuming it happens during the BLPOP ; does redis know that the call was interrupted or would it still pop a job, send it away, and perhaps realize the connection is dead too late? Do we know what happens here?
  2. If this happens between the BLPOP and the ZADD, the job is lost. This is basically a form of the reported problem.
  3. If it happens after ZADD the job is retried later.

There are several questions about the current implementation, but there is a new PR that uses BRPOPLPUSH at #23057 that may fix this. I'm not knowledgeable enough about BRPOPLPUSH to evaluate that approach, do we have any takers that can review that PR and add some input?

@halaei
Copy link
Contributor

halaei commented Oct 8, 2018

I wrote a Redis module to fix this issue and add some other values. Please check lqrm and its php driver. Here is an intro digest:

This is a Redis module that implements a laravel compatible queue driver for redis with the following improvements over the original Laravel 5.7 PHP-Lua driver:
1. Blocking` pop is now more reliable than before.
2. Blocking pop now works on delayed and reserved jobs as well.
3. Timer for delayed and reserved jobs is server side, with milliseconds precision. This means you don't need to worry about syncing your php and redis servers, in case your projects are distributed accross different servers. Moreover, this makes retry_after and block_for configurations independent of each other.
4. Laravel queue can now be available for other programming languages and frameworks as well. Feel free to port it to your favorite ones.

Will be great if you give it a try and/or send feedback - it might still have some bugs though.

@matt-allan
Copy link
Contributor

There is a pretty simple solution that I think will work. It's explained in the redis docs here.

Basically when we push a job we also push a notification key:

MULTI
RPUSH queues:default '{"id":12345,"attempts":1,"job":"say_hello","data":{"name":"Matt"}}'
RPUSH queues:default:notify x
EXEC

The worker does a blocking pop on the notification key. Once the notification key returns an element we know a job is ready. Once a job is ready we use the existing Lua script and get all of the same atomicity guarantees.

LOOP forever
    WHILE EVAL(pop.lua, queues:default, queues:default:reserved, expiration) returns elements
        ... process elements ...
    END
    BRPOP queues:default:notify
END

In this example queues:default:notify is a new list used only for notifying workers a new job was pushed. There would be a new list for each queue using the pattern queues:{{name}}:notify.

There is a chance the worker might crash between popping the notification element and popping the job onto the reserved queue. That is ok - once the worker restarts it will attempt to pop the job again and nothing important is lost.

The only downside I can think of with this approach is it will take some care when upgrading. If you update the workers before the producers they will be blocking waiting for a notification key that is never set. The worker will eventually work the job once block_for times out, so job processing won't stop completely. I think in most cases you would upgrade all of your servers at the same time, so this would be a rare edge case.

@driesvints
Copy link
Member

This will be fixed in 5.8: #27336

Thanks for everyone's help with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants