while_executing + raise let non-unique jobs to be executed #534

x-yuri · 2020-09-23T06:26:50Z

Describe the bug
The bug is supposedly fixed here (7.0.0.beta1). I'm documenting it here for those struggling with sidekiq-unique-jobs-6.x. Try sidekiq-unique-jobs-7.0.0.beta*.

Let's say you queue 3 non-unique jobs simultaneously.

Expected behavior
Only the first job starts. The rest fails. (Later they get retried.)

Current behavior
The first and the third jobs start.

Worker class

class MyWorker
  include Sidekiq::Worker
  sidekiq_options lock: :while_executing, on_conflict: :raise
  def perform; sleep 5; end
end

Additional context
I wonder if there's a way for it to not pollute the log with exceptions...

UPD There is:

sidekiq_options lock: :while_executing, lock_timeout: 10, lock_ttl: 1100, on_conflict: :raise

It will wait 10 seconds till the other non-unique jobs finish. I.e. your job might be delayed by 10 seconds. If it can't acquire a lock in that time, it will raise an exception, the job will fail, and will be retried later by sidekiq.

From what I can see, timeout defines how much time it waits for a lock, and ttl, how much time it waits for a promise to be resolved. As such ttl should be greater then timeout, and ttl is somehow in... centiseconds?.. O.o So I did:

ttl = (timeout + 1) * 100  # add 1 second just in case

The text was updated successfully, but these errors were encountered:

x-yuri · 2020-09-23T13:26:30Z

@mhenrixon Can you possibly confirm? Or explain what timeout and ttl affects? How to choose the values?

mhenrixon · 2020-09-27T09:58:55Z

Please read https://github.com/mhenrixon/sidekiq-unique-jobs#while-executing

If you configure a ttl then all jobs that are duplicate will wait for the same ttl as you configured before they are dropped. TTL is seconds not centiseconds. I chose to use pttl under the hood for better exactness while hiding that from the configuration to avoid very large numbers and difficulty understanding the number.

x-yuri · 2020-09-27T19:21:14Z

@mhenrixon I might be missing something, but let's give it a try (sidekiq-unique-jobs-7.0.0.beta23):

class TransferDataJob
  include Sidekiq::Worker
  sidekiq_options lock: :while_executing, lock_timeout: 10
  def perform(params); sleep 5; end
end

In pry:

[3] pry(main)> TransferDataJob.perform_async({}); TransferDataJob.perform_async({})
=> "f370374241fb6fdbe01e9759"

sidekiq:

2020-09-27T18:14:55.943Z pid=1723 tid=grjdcev7j class=TransferDataJob
  jid=13e18692fca4a644876feaf7 INFO: start
2020-09-27T18:14:55.945Z pid=1723 tid=grjdcf077 class=TransferDataJob
  jid=f370374241fb6fdbe01e9759 INFO: start
2020-09-27T18:14:57.961Z pid=1723 tid=grjdcev7j class=TransferDataJob
  jid=13e18692fca4a644876feaf7 uniquejobs=server
  while_executing=uniquejobs:44f411c7d971922eec2a3518bbbc2c45:RUN:RUN WARN:
  Timed out after 10s while waiting for primed token
  (digest: uniquejobs:44f411c7d971922eec2a3518bbbc2c45:RUN:RUN, job_id: 13e18692fca4a644876feaf7)
2020-09-27T18:15:00.962Z pid=1723 tid=grjdcf077 class=TransferDataJob
  jid=f370374241fb6fdbe01e9759 elapsed=5.017 INFO: done
2020-09-27T18:15:00.963Z pid=1723 tid=grjdcev7j class=TransferDataJob
  jid=13e18692fca4a644876feaf7 elapsed=5.02 INFO: done

It says it timed out after 10 seconds but from the timestamps we can see that it happened after 2 seconds. By putting a debugging statement into the code we can see that what is passed to the value() method of a promise is 2. config.ttl is taken from the lock_ttl option, which defaults to nil. Now, drift(val) is (val.to_i * CLOCK_DRIFT_FACTOR).to_i + 2, where CLOCK_DRIFT_FACTOR is 0.01. In our case (0 * 0.01) + 2 == 2. The timeout parameter (value() function) is specified in seconds, as such concurrent-ruby waits 2 seconds.

lock_timeout is used here. Our job takes 5 seconds to complete, but concurrent-ruby waits only 2 seconds, then times out. As such the warning is misleading. After 2 seconds, not after 10 seconds.

So we need to increase value()'s timeout to at last 5 seconds. That can be achieved by specifying the lock_ttl option. Let's try sidekiq_options lock_timeout: 10, lock_ttl: 10. The result is as before, because (10 * 0.01) + 2 == 2. We want (x * 0.01) + 2 = 10, which gives x = 800. And with sidekiq_options lock_timeout: 10, lock_ttl: 800 we see:

2020-09-27T19:02:56.540Z pid=1824 tid=gq1pjep2w class=TransferDataJob
  jid=10feb5a2b5a73cdc57c38057 INFO: start
2020-09-27T19:02:56.540Z pid=1824 tid=gq1pjencc class=TransferDataJob
  jid=adc77ab191b2a4a8b2b35f57 INFO: start
2020-09-27T19:03:01.552Z pid=1824 tid=gq1pjep2w class=TransferDataJob
  jid=10feb5a2b5a73cdc57c38057 elapsed=5.012 INFO: done
2020-09-27T19:03:06.557Z pid=1824 tid=gq1pjencc class=TransferDataJob
  jid=adc77ab191b2a4a8b2b35f57 elapsed=10.017 INFO: done

Bingo. So 800 is 800 centiseconds. 800 * 0.01 = 8 seconds, (800 * 0.01) + 2 = 10 seconds. Do you still think I'm wrong?

Speaking of the README:

Unless this job is configured with a lock_timeout: nil or lock_timeout: > 0, ...

Which translates to:

if lock_timeout && lock_timeout <= 0

"Unless" works only with simple conditions. I'd rather say:

If lock_timeout is not specified, or less than or equal to 0, ...

or:

If lock_timeout is not specified, or not positive, ...

or just:

Unless lock_timeout is greater than 0, ...

x-yuri · 2020-09-27T19:24:10Z

Meanwhile let me reopen this issue...

mhenrixon · 2021-01-22T19:21:19Z

There was a number of fixes in beta since then. I'd be interested in knowing if it is still a problem.

This particular situation I haven't looked into but I did fix a number of things since then with the runtime locks (while executing)

mhenrixon · 2021-01-22T21:22:33Z

#571 v7.0.1 is highly recommended. I'm happy to reopen the issue and keep it v7 specific if you are still having trouble with v7.0.1

x-yuri closed this as completed Sep 23, 2020

x-yuri reopened this Sep 27, 2020

mhenrixon self-assigned this Jan 22, 2021

mhenrixon added bug v7.x labels Jan 22, 2021

mhenrixon closed this as completed Jan 22, 2021

This was referenced Mar 12, 2021

Bump sidekiq-unique-jobs from 6.0.25 to 7.0.4 mastodon/mastodon#15878

Merged

Bump sidekiq-unique-jobs from 6.0.25 to 7.0.4 postxiami/mastodon#6

Closed

Bump sidekiq-unique-jobs from 6.0.25 to 7.0.4 YoheiZuho/mastodon#333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

while_executing + raise let non-unique jobs to be executed #534

while_executing + raise let non-unique jobs to be executed #534

x-yuri commented Sep 23, 2020 •

edited

x-yuri commented Sep 23, 2020

mhenrixon commented Sep 27, 2020 •

edited

x-yuri commented Sep 27, 2020

x-yuri commented Sep 27, 2020

mhenrixon commented Jan 22, 2021

mhenrixon commented Jan 22, 2021

while_executing + raise let non-unique jobs to be executed #534

while_executing + raise let non-unique jobs to be executed #534

Comments

x-yuri commented Sep 23, 2020 • edited

x-yuri commented Sep 23, 2020

mhenrixon commented Sep 27, 2020 • edited

x-yuri commented Sep 27, 2020

x-yuri commented Sep 27, 2020

mhenrixon commented Jan 22, 2021

mhenrixon commented Jan 22, 2021

x-yuri commented Sep 23, 2020 •

edited

mhenrixon commented Sep 27, 2020 •

edited