Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

while_executing + raise let non-unique jobs to be executed #534

Closed
x-yuri opened this issue Sep 23, 2020 · 6 comments
Closed

while_executing + raise let non-unique jobs to be executed #534

x-yuri opened this issue Sep 23, 2020 · 6 comments
Assignees

Comments

@x-yuri
Copy link

x-yuri commented Sep 23, 2020

Describe the bug
The bug is supposedly fixed here (7.0.0.beta1). I'm documenting it here for those struggling with sidekiq-unique-jobs-6.x. Try sidekiq-unique-jobs-7.0.0.beta*.

Let's say you queue 3 non-unique jobs simultaneously.

Expected behavior
Only the first job starts. The rest fails. (Later they get retried.)

Current behavior
The first and the third jobs start.

Worker class

class MyWorker
  include Sidekiq::Worker
  sidekiq_options lock: :while_executing, on_conflict: :raise
  def perform; sleep 5; end
end

Additional context
I wonder if there's a way for it to not pollute the log with exceptions...

UPD There is:

sidekiq_options lock: :while_executing, lock_timeout: 10, lock_ttl: 1100, on_conflict: :raise

It will wait 10 seconds till the other non-unique jobs finish. I.e. your job might be delayed by 10 seconds. If it can't acquire a lock in that time, it will raise an exception, the job will fail, and will be retried later by sidekiq.

From what I can see, timeout defines how much time it waits for a lock, and ttl, how much time it waits for a promise to be resolved. As such ttl should be greater then timeout, and ttl is somehow in... centiseconds?.. O.o So I did:

ttl = (timeout + 1) * 100  # add 1 second just in case
@x-yuri x-yuri closed this as completed Sep 23, 2020
@x-yuri
Copy link
Author

x-yuri commented Sep 23, 2020

@mhenrixon Can you possibly confirm? Or explain what timeout and ttl affects? How to choose the values?

@mhenrixon
Copy link
Owner

mhenrixon commented Sep 27, 2020

Please read https://github.com/mhenrixon/sidekiq-unique-jobs#while-executing

If you configure a ttl then all jobs that are duplicate will wait for the same ttl as you configured before they are dropped. TTL is seconds not centiseconds. I chose to use pttl under the hood for better exactness while hiding that from the configuration to avoid very large numbers and difficulty understanding the number.

@x-yuri
Copy link
Author

x-yuri commented Sep 27, 2020

@mhenrixon I might be missing something, but let's give it a try (sidekiq-unique-jobs-7.0.0.beta23):

class TransferDataJob
  include Sidekiq::Worker
  sidekiq_options lock: :while_executing, lock_timeout: 10
  def perform(params); sleep 5; end
end

In pry:

[3] pry(main)> TransferDataJob.perform_async({}); TransferDataJob.perform_async({})
=> "f370374241fb6fdbe01e9759"

sidekiq:

2020-09-27T18:14:55.943Z pid=1723 tid=grjdcev7j class=TransferDataJob
  jid=13e18692fca4a644876feaf7 INFO: start
2020-09-27T18:14:55.945Z pid=1723 tid=grjdcf077 class=TransferDataJob
  jid=f370374241fb6fdbe01e9759 INFO: start
2020-09-27T18:14:57.961Z pid=1723 tid=grjdcev7j class=TransferDataJob
  jid=13e18692fca4a644876feaf7 uniquejobs=server
  while_executing=uniquejobs:44f411c7d971922eec2a3518bbbc2c45:RUN:RUN WARN:
  Timed out after 10s while waiting for primed token
  (digest: uniquejobs:44f411c7d971922eec2a3518bbbc2c45:RUN:RUN, job_id: 13e18692fca4a644876feaf7)
2020-09-27T18:15:00.962Z pid=1723 tid=grjdcf077 class=TransferDataJob
  jid=f370374241fb6fdbe01e9759 elapsed=5.017 INFO: done
2020-09-27T18:15:00.963Z pid=1723 tid=grjdcev7j class=TransferDataJob
  jid=13e18692fca4a644876feaf7 elapsed=5.02 INFO: done

It says it timed out after 10 seconds but from the timestamps we can see that it happened after 2 seconds. By putting a debugging statement into the code we can see that what is passed to the value() method of a promise is 2. config.ttl is taken from the lock_ttl option, which defaults to nil. Now, drift(val) is (val.to_i * CLOCK_DRIFT_FACTOR).to_i + 2, where CLOCK_DRIFT_FACTOR is 0.01. In our case (0 * 0.01) + 2 == 2. The timeout parameter (value() function) is specified in seconds, as such concurrent-ruby waits 2 seconds.

lock_timeout is used here. Our job takes 5 seconds to complete, but concurrent-ruby waits only 2 seconds, then times out. As such the warning is misleading. After 2 seconds, not after 10 seconds.

So we need to increase value()'s timeout to at last 5 seconds. That can be achieved by specifying the lock_ttl option. Let's try sidekiq_options lock_timeout: 10, lock_ttl: 10. The result is as before, because (10 * 0.01) + 2 == 2. We want (x * 0.01) + 2 = 10, which gives x = 800. And with sidekiq_options lock_timeout: 10, lock_ttl: 800 we see:

2020-09-27T19:02:56.540Z pid=1824 tid=gq1pjep2w class=TransferDataJob
  jid=10feb5a2b5a73cdc57c38057 INFO: start
2020-09-27T19:02:56.540Z pid=1824 tid=gq1pjencc class=TransferDataJob
  jid=adc77ab191b2a4a8b2b35f57 INFO: start
2020-09-27T19:03:01.552Z pid=1824 tid=gq1pjep2w class=TransferDataJob
  jid=10feb5a2b5a73cdc57c38057 elapsed=5.012 INFO: done
2020-09-27T19:03:06.557Z pid=1824 tid=gq1pjencc class=TransferDataJob
  jid=adc77ab191b2a4a8b2b35f57 elapsed=10.017 INFO: done

Bingo. So 800 is 800 centiseconds. 800 * 0.01 = 8 seconds, (800 * 0.01) + 2 = 10 seconds. Do you still think I'm wrong?

Speaking of the README:

Unless this job is configured with a lock_timeout: nil or lock_timeout: > 0, ...

Which translates to:

if lock_timeout && lock_timeout <= 0

"Unless" works only with simple conditions. I'd rather say:

If lock_timeout is not specified, or less than or equal to 0, ...

or:

If lock_timeout is not specified, or not positive, ...

or just:

Unless lock_timeout is greater than 0, ...

@x-yuri
Copy link
Author

x-yuri commented Sep 27, 2020

Meanwhile let me reopen this issue...

@x-yuri x-yuri reopened this Sep 27, 2020
@mhenrixon
Copy link
Owner

There was a number of fixes in beta since then. I'd be interested in knowing if it is still a problem.

This particular situation I haven't looked into but I did fix a number of things since then with the runtime locks (while executing)

@mhenrixon
Copy link
Owner

#571 v7.0.1 is highly recommended. I'm happy to reopen the issue and keep it v7 specific if you are still having trouble with v7.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants