Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

until_and_while_executing and lock_ttl: jobs silently dropped #788

Closed
fernandomm opened this issue Jun 6, 2023 · 3 comments
Closed

until_and_while_executing and lock_ttl: jobs silently dropped #788

fernandomm opened this issue Jun 6, 2023 · 3 comments

Comments

@fernandomm
Copy link

Describe the bug
When using lock_ttl and until_and_while_executing locks, jobs are silently dropped if lock expires before the job can be processed.

Expected behavior
It should allow jobs to be processed, even if their uniqueness can no longer be ensured.

Current behavior
Jobs are dropped here.

Worker class

SidekiqUniqueJobs.reflect do |on|
  on.unlock_failed do |job|
    puts "unlock failed #{job.inspect}"
  end
end

class SampleJob
  include Sidekiq::Worker
  
  # Low ttl to make it faster to reproduce the problem
  sidekiq_options lock_ttl: 5.seconds, lock: :until_and_while_executing, on_conflict: { client: :log, server: :reschedule }

  def perform
    puts "Performing"
    sleep 10.seconds
    puts "Done"
  end
end

Run:

SampleJob.perform_async
=> "8f8f9d6c06907277113bbfc0"

Wait 5 seconds before starting a sidekiq worker. When it starts processing, you will get this:

2023-06-06T09:46:38.605Z pid=14404 tid=x9ave8dnk INFO: Starting processing, hit Ctrl-C to stop
2023-06-06T09:46:38.607Z pid=14404 tid=x9awjx00s class=SampleJob jid=8f8f9d6c06907277113bbfc0 INFO: start
unlock failed {"retry"=>3, "queue"=>"default", "lock_ttl"=>5, "lock"=>"until_and_while_executing", "on_conflict"=>{"client"=>"log", "server"=>"reschedule"}, "args"=>[], "class"=>"SampleJob", "jid"=>"8f8f9d6c06907277113bbfc0", "created_at"=>1686044788.863775, "lock_timeout"=>0, "lock_prefix"=>"uniquejobs", "lock_args"=>[], "lock_digest"=>"uniquejobs:36ac16731f732f365a70a02cbb3e7b70", "enqueued_at"=>1686044788.882391}
2023-06-06T09:46:38.695Z pid=14404 tid=x9awjx00s class=SampleJob jid=8f8f9d6c06907277113bbfc0 elapsed=0.088 INFO: done

If the job is processed before the lock expires, it works as expected:

SampleJob.perform_async
=> "f7f7b5477b5cb9a8457e8d2c"
2023-06-06T09:46:43.703Z pid=14404 tid=x9awjx00s class=SampleJob jid=f7f7b5477b5cb9a8457e8d2c INFO: start
Performing
Done
2023-06-06T09:46:53.893Z pid=14404 tid=x9awjx00s class=SampleJob jid=f7f7b5477b5cb9a8457e8d2c elapsed=10.191 INFO: done

At this moment the only work around that I found was:

SidekiqUniqueJobs.reflect do |on|
  on.unlock_failed do |job|
     job['class'].constantize.set(queue: job['queue']).perform_async(*job['args'])
  end
end
@mhenrixon
Copy link
Owner

I get your problem but if I allow that then there is no meaning of having the gem. I mean, this gem is about preventing duplicate jobs.

Given a TTL there is simply no way to sort this out when the TTL expires the lock does too.

If you want to make sure no job is lost I would recommend removing the TTL, delegating to the reaper for cleanup and using a conflict resolution to ensure the job isn't ignored.

@fernandomm
Copy link
Author

But for the until_and_while_executing lock, isn't it better to allow the job to run and only drop it ( or apply the defined conflict strategy ) if there is another job running?

I mean, the lock for this job expired during the "until_executing" phase. I guess that trying to run it won't cause problems because the "while_executing" part will ensure that it's unique.

If you want to make sure no job is lost I would recommend removing the TTL, delegating to the reaper for cleanup and using a conflict resolution to ensure the job isn't ignored.

We previously tried that, but as least for our use case with thousands of jobs enqueued/scheduled, the reaper is too slow to be effective. When we force it to run by adjusting timeouts/count, it takes 12+ hours to check all keys.

@mhenrixon
Copy link
Owner

This will be improved with reaper performance by only checking digests/locks known to have a chance of expiring.

It was brought to my attention that Sidekiq only checks in every 10 seconds, and during those seconds, jobs might appear to be missing, and I clean up active locks.

This should be fixed in the main branch, but I need help testing my assumptions. #830, in particular, was affecting you @fernandomm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants