Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job never requeued after raising unhandled error with until_and_while_executing? #322

Closed
aharpervc opened this issue Aug 9, 2018 · 5 comments

Comments

@aharpervc
Copy link

aharpervc commented Aug 9, 2018

This is with sidekiq 5.1.3 & sidekiq-unique-jobs 6.0.6

Describe the bug

It appears that if a worker throws an exception with lock: :until_and_while_executing, that it won't ever be re-queued.

Expected behavior

I'm not 100% on my expectations, but I think I would expect that the lock is released if the job fails, so it can be re-run later.

Current behavior

TestWorker.perform_async initially is queued and runs and throws an exception, but subsequent calls do not run the job.

Worker class

class TestWorker
  include Sidekiq::Worker

  sidekiq_options retry: false,
                            lock: :until_and_while_executing,
                            log_duplicate_payload: true

  def perform(args)
    warn "executed #{jid} with #{args}"
    raise "oops: #{jid}"
  end
end

Additional context

In a rails c with a clear redis instance and the above job, run TestWorker twice (precise timing doesn't matter):

[10] pry(main)> TestWorker.perform_async 1
=> "9a8ce18dc559b505e96bb337"
[11] pry(main)> TestWorker.perform_async 1
=> "d26847f256eed30ff551d8d4"

sidekiq log:

2018-08-09T22:48:13.173Z 3053 TID-oxywnrq6x TestWorker JID-9a8ce18dc559b505e96bb337 INFO: start
executed 9a8ce18dc559b505e96bb337 with 1
2018-08-09T22:48:13.460Z 3053 TID-oxywnrq6x TestWorker JID-9a8ce18dc559b505e96bb337 INFO: fail: 0.287 sec
2018-08-09T22:48:13.461Z 3053 TID-oxywnrq6x WARN: {"context":"Job raised exception","job":{"class":"TestWorker","args":[1],"retry":false,"queue":"default","lock":"until_and_while_executing","log_duplicate_payload":true,"jid":"9a8ce18dc559b505e96bb337","created_at":1533854893.150855,"lock_timeout":0,"lock_expiration":null,"unique_prefix":"uniquejobs","unique_args":[1],"unique_digest":"uniquejobs:ec6e24f1b3e8d7b96486e4a4ef276ee0:RUN","enqueued_at":1533854893.172593},"jobstr":"{\"class\":\"TestWorker\",\"args\":[1],\"retry\":false,\"queue\":\"default\",\"lock\":\"until_and_while_executing\",\"log_duplicate_payload\":true,\"jid\":\"9a8ce18dc559b505e96bb337\",\"created_at\":1533854893.150855,\"lock_timeout\":0,\"lock_expiration\":null,\"unique_prefix\":\"uniquejobs\",\"unique_args\":[1],\"unique_digest\":\"uniquejobs:ec6e24f1b3e8d7b96486e4a4ef276ee0\",\"enqueued_at\":1533854893.172593}"}
2018-08-09T22:48:13.461Z 3053 TID-oxywnrq6x WARN: RuntimeError: oops: 9a8ce18dc559b505e96bb337

(...snipped the stack trace...)



2018-08-09T22:48:51.925Z 3053 TID-oxywnrnkp TestWorker JID-d26847f256eed30ff551d8d4 INFO: start
2018-08-09T22:48:51.929Z 3053 TID-oxywnrnkp TestWorker JID-d26847f256eed30ff551d8d4 INFO: done: 0.004 sec

Note how the first attempt (correctly) throws the exception, but the second attempt never gets that far.

@mhenrixon
Copy link
Owner

Unless you specify retry: false Sidekiq will retry your jobs 25 times. There is no way of really knowing where the job is without searching for it. If I removed the job and it was retried you would end up with infinite number of duplicate jobs because they would all be on the retry queue.

I believe that is a situation you would want to avoid so the locks need to stay in the case of an error. Most people do not disable retries (since it needs to be done explicitly). If you want to clear the lock when the job dies please see: https://github.com/mhenrixon/sidekiq-unique-jobs#cleanup-dead-locks

@aharpervc
Copy link
Author

aharpervc commented Aug 10, 2018

As you can see, I did specify retry: false, so I am not expecting the job to be automatically re-run.

The case I'm concerned about is when a job throws an exception (due to external reasons), then the external reasons are fixed, and the job is resubmitted, but rejected (due to a lock?). I don't think the lock ought to be held if the job throws an exception in this case, because the implication of a :until_and_while_executing lock is that if the job is not pending & not executing there won't be a lock. I think that an errored job should qualify for that case.

I understand that the concern is that the lock should be kept if sidekiq automatically retries the job, however in the case of retry: false or when sidekiq has retried n times, the lock may still be held forever. That seems bad.


Also, this is how v5 worked, so I consider this change a regression. Here's the behavior with 5.0.10:

class TestWorker
  include Sidekiq::Worker

  sidekiq_options retry: false,
                  unique: :until_and_while_executing, # <-- switched back to `unique`
                  log_duplicate_payload: true

  def perform(args)
    warn "executed #{jid} with #{args}"
    raise "oops: #{jid}"
  end
end
[1] pry(main)> TestWorker.perform_async 1
=> "87a9f468cd6d1ffe6bb08ab0"
[2] pry(main)> TestWorker.perform_async 1
=> "b9e9225e4f50d93a1af3c86f"
2018-08-10T13:34:01.187Z 689 TID-oxaoftool TestWorker JID-87a9f468cd6d1ffe6bb08ab0 INFO: start
executed 87a9f468cd6d1ffe6bb08ab0 with 1
2018-08-10T13:34:01.457Z 689 TID-oxaoftool TestWorker JID-87a9f468cd6d1ffe6bb08ab0 INFO: fail: 0.269 sec
2018-08-10T13:34:01.457Z 689 TID-oxaoftool WARN: {"context":"Job raised exception","job":{"class":"TestWorker","args":[1],"retry":false,"queue":"default","unique":"until_and_while_executing","log_duplicate_payload":true,"jid":"87a9f468cd6d1ffe6bb08ab0","created_at":1533908040.592081,"unique_prefix":"uniquejobs","unique_args":[1],"unique_digest":"uniquejobs:ec6e24f1b3e8d7b96486e4a4ef276ee0","enqueued_at":1533908041.18694},"jobstr":"{\"class\":\"TestWorker\",\"args\":[1],\"retry\":false,\"queue\":\"default\",\"unique\":\"until_and_while_executing\",\"log_duplicate_payload\":true,\"jid\":\"87a9f468cd6d1ffe6bb08ab0\",\"created_at\":1533908040.592081,\"unique_prefix\":\"uniquejobs\",\"unique_args\":[1],\"unique_digest\":\"uniquejobs:ec6e24f1b3e8d7b96486e4a4ef276ee0\",\"enqueued_at\":1533908041.18694}"}
2018-08-10T13:34:01.460Z 689 TID-oxaoftool WARN: RuntimeError: oops: 87a9f468cd6d1ffe6bb08ab0

(...snip...)

2018-08-10T13:34:07.028Z 689 TID-oxaoft79t TestWorker JID-b9e9225e4f50d93a1af3c86f INFO: start
executed b9e9225e4f50d93a1af3c86f with 1
2018-08-10T13:34:07.034Z 689 TID-oxaoft79t TestWorker JID-b9e9225e4f50d93a1af3c86f INFO: fail: 0.006 sec
2018-08-10T13:34:07.035Z 689 TID-oxaoft79t WARN: {"context":"Job raised exception","job":{"class":"TestWorker","args":[1],"retry":false,"queue":"default","unique":"until_and_while_executing","log_duplicate_payload":true,"jid":"b9e9225e4f50d93a1af3c86f","created_at":1533908047.026419,"unique_prefix":"uniquejobs","unique_args":[1],"unique_digest":"uniquejobs:ec6e24f1b3e8d7b96486e4a4ef276ee0","enqueued_at":1533908047.02765},"jobstr":"{\"class\":\"TestWorker\",\"args\":[1],\"retry\":false,\"queue\":\"default\",\"unique\":\"until_and_while_executing\",\"log_duplicate_payload\":true,\"jid\":\"b9e9225e4f50d93a1af3c86f\",\"created_at\":1533908047.026419,\"unique_prefix\":\"uniquejobs\",\"unique_args\":[1],\"unique_digest\":\"uniquejobs:ec6e24f1b3e8d7b96486e4a4ef276ee0\",\"enqueued_at\":1533908047.02765}"}
2018-08-10T13:34:07.038Z 689 TID-oxaoft79t WARN: RuntimeError: oops: b9e9225e4f50d93a1af3c86f

@aharpervc
Copy link
Author

aharpervc commented Aug 10, 2018

edit2:

retry: 0 (not false) + a death handler that deletes the job is a functional work around. However, I still think the described issue is a concerning regression that should be investigated.

Back on 6.0.6, I tried adding a death handler to manually clean up the job per your recommendation, however, it does not seem to be called when the job dies. (Should this be a separate GH issue?)

Here's my initializer, which I have confirmed is otherwise working. I expect to see something in the logs or my breakpoint hit, but nothing seems to happen.

Sidekiq.configure_server do |config|
  config.death_handlers << -> (job, _ex) do
    puts '*' * 100
    puts job.inspect
    puts _ex.inspect
    byebug
    SidekiqUniqueJobs::Digests.del(digest: job['unique_digest']) if job['unique_digest']
  end
end

@mhenrixon
Copy link
Owner

@aharpervc it is not a regression it is a bug that got fixed. If you read my initial reply you would come to the conclusion that the only reason your scenario worked like you want it to in v5 was because any number of duplicates would be allowed in those same scenarios. The lock was ALWAYS removed even when it was retried meaning unlimited number of the same job could potentially execute simultaneously.

Surely you can't mean fixing that bug is a regression?

@aharpervc
Copy link
Author

aharpervc commented Aug 10, 2018

Surely you can't mean fixing that bug is a regression?

Ha, I'm never going to be mad about fixed bugs. I meant that in v6, until_and_while_executing keeps lock(s) even when not queued and not executing, which is different than v5 and that is very surprising. This is particularly concerning with retry: false, since sidekiq will not fire a death handler in that case.

I have a proposal; for retry: false and lock: :until_and_while_executing, interpret exceptions as a signal to release the lock. That seems pretty safe, right?

E: lock_expiration does apply here, so that's another option... as long as that is set, the lock(s) will expire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants