Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed jobs waiting to be retried are not considered when fetching uniqueness #394

Closed
axos88 opened this issue May 15, 2019 · 16 comments · Fixed by #709
Closed

Failed jobs waiting to be retried are not considered when fetching uniqueness #394

axos88 opened this issue May 15, 2019 · 16 comments · Fixed by #709
Assignees
Labels

Comments

@axos88
Copy link

axos88 commented May 15, 2019

Describe the bug
If a job fails, and is waiting to be retried, adding a new job with similar arguments should trigger the conflict resolution mechanism.

I have an job which - based on its arguments - calculates and saves an expensive operation to the database, that I would not like to enqueue multiple times (since it's output is only based on the input, and I want to make the calculation as few times as possible...), but if it is already being executed, allow it t be enqueued.
Curently I am using this config:

    sidekiq_options lock: :until_executing, on_conflict: :replace

   # only use the id for uniqueness checking, and ignore the actual data to be aggregated
    def self.unique_args(args)
      [ args[0] ]
    end

until_executed is not okay due to the requirement for it to be reexecuted if it is enqueued while it is already running.

Now the issue here is that the lock is removed when the job is scheduled to be run, but it is not readded if it fails.

Expected behavior
The system should re-add the lock if the job fails, and trigger the deduplication mechanism if a similar job has been enqueued in the meantime. If you think about it this is not very different than adding a job to the end of the enqueued jobs list, other than it contains some meta-information about the failure, and how many times it was tried.
Not only does it get duplicated, but the execution order is undeterministic, sometimes the job that is scheduled later will be executed first, and when the job from the retry queue is rescheduled, it is executed as well, overwriting the fresh data with old one.

Current behavior
Jobs get duplicated

@mhenrixon
Copy link
Owner

That is a really good point! I missed this, of course we need to catch any error and re-lock the job when any type of error is raised. There is a race condition though, it is possible then that another job has been scheduled in the meantime so it would be up to you to handle that with a conflict resolution.

@mhenrixon mhenrixon self-assigned this May 15, 2019
@mhenrixon mhenrixon added the bug label May 15, 2019
@axos88
Copy link
Author

axos88 commented May 15, 2019

Yeah. I see two possible ways to solve that
a) handle the job being added to the retry queue the same way as a new job being added (with the drawback outlined above)
b) reschedule the failed job (and by that I mean give the lock to the failed job), and simulate what would happen if the new job would be added afterwards (basically the same as removing and re-adding the new job). This could cause issues if somebody hangs onto the job id of the new job, which depending on the implementation could change, or simply go away. Same thing with hanging on to the old job, if the new job replaces it, the job with that ID goes away, and somebody might be hanging onto it.

@mhenrixon
Copy link
Owner

There is already functionality to both replace and reschedule jobs but retry queue or not doesn't matter except in this particular case. All other job types keeps being locked until done.

@axos88
Copy link
Author

axos88 commented Jan 9, 2020

Hi. It's there any progress on this?

@mhenrixon
Copy link
Owner

Not really @axos88, I've been out most of December due to back pain. Then before Christmas my wife had surgery and I've been taking care of the kids since then. This year I started a couple of new gigs so I haven't had a chance to dig into sidekiq unique jobs for a while.

I'll get there shortly though.

@axos88
Copy link
Author

axos88 commented Jan 9, 2020

Well I guess it's only upwards from there... Hang tight :)

@axos88
Copy link
Author

axos88 commented Aug 29, 2020

Hi there. Has there been any progress on this?

@mhenrixon
Copy link
Owner

Hi @axos88, I did so many bugfixes in 2020 for v7. Which version are you on at the moment?

@axos88-da
Copy link

Not sure, I'll have try to upgrade and check if the issue is still present.

@mhenrixon
Copy link
Owner

Closing due to inactivity! 😁

See #571 for an upgrade to v7.0.1 and report back if you are still experiencing the same problem. I'll get to them one at a time but I need some participation 🤪.

@axos88
Copy link
Author

axos88 commented Apr 30, 2022

Rewiving this old one. This is still an issue on 7.0.4.

Any jobs that are in the failed queue will not take part in validation.

Steps to reproduce:

class FailingJob
  include Sidekiq::Worker

  sidekiq_options lock: :until_executing, on_conflict: :replace

  def perform(id)
    raise NotImplementedError
  end
end

Execute the following in for example a rails console, while the sidekiq worker is running:

  FailingJob.perform_async
  sleep 1
  FailingJob.perform_async

Expected: The job should not duplicate
Actualk: It does.

@mhenrixon
Copy link
Owner

I guess I should at the very least try to put the lock back. I'll look into it.

This type of lock will never be hundred percent perfect though since the lock is already released when the server starts processing.

The only thing I can think of is to maintain my own sidekiq fetcher that handles the queues differently so that a job like yours would be preferred over a job that came in later but both have the same lock.

Maybe some ordered queues or something. Interesting problem to solve.

Thank you for the additional information, was very helpful.

@axos88-da
Copy link

Yes, putting the lock back in would already help a lot in my case.

@mhenrixon
Copy link
Owner

@axos88
Copy link
Author

axos88 commented May 20, 2022

@mhenrixon , I think the exceptions are now swallowed. lib/sidekiq_unique_jobs/lock/until_executing.rb:35 should rethrow the exception, should it not?

@mhenrixon
Copy link
Owner

@axos88 indeed it should, PR coming up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants