Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock until_and_while_executing not working as expected #613

Closed
carolinesalib opened this issue Jun 19, 2021 · 9 comments
Closed

Lock until_and_while_executing not working as expected #613

carolinesalib opened this issue Jun 19, 2021 · 9 comments

Comments

@carolinesalib
Copy link

Describe the bug
We are upgrading sidekiq-unique-jobs gem from version 5.0.10 to 7.0.12 and I'm having a hard time trying to figure out if the behaviour change on lock until_and_while_executing is expected.

We have a mutation job that relies on database changes, so we want it to be unique but also not lose any data. Ideally, it would enqueue the first job, the second job (waiting for the first one to finish) and skip all other duplicated jobs.

Expected behavior
Representation of the job on old version + logs:

class DumbJob < ApplicationSidekiqWorker
  sidekiq_options unique: :until_and_while_executing,

  def perform(id)
    Clogger.log("Sleep start")
    sleep(20.seconds)
    Clogger.log("Sleep stop")
  end
end

image

Current behavior
When we try to enqueue multiple duplicated jobs, only one is executed, and we end up losing data. See example:

class DumbJob < ApplicationSidekiqWorker
  sidekiq_options lock: :until_and_while_executing,
                  debug_lua: true,
                  lock_info: true,
                  log_duplicate: true,
                  on_conflict: :log

  def perform(id)
    Clogger.log("Sleep start")
    sleep(20.seconds)
    Clogger.log("Sleep stop")
  end
end

image

I tested different combinations of sidekiq_options + until_and_while_executing lock and all of them had weird side effects. Sometimes it worked fine on the first attempt but on the second attempt the job didn't get enqueued when there was one in progress. Like on this example:

class DumbJob < ApplicationSidekiqWorker
  sidekiq_options lock: :until_and_while_executing
                  # # queue: 'default',
                  # debug_lua: true,
                  # # lock_timeout: 0,
                  # lock_info: false,
                  # # lock_limit: 1,
                  # log_duplicate: true,
                  # on_conflict: :log

  def perform(id)
    Clogger.log("Sleep start")
    sleep(20.seconds)
    Clogger.log("Sleep stop")
  end
end

image

Additional context
Looks like I can get the desired behaviour if I config my job as the following:

class DumbJob < ApplicationSidekiqWorker

  sidekiq_options lock: :until_executed,
                  lock_timeout: 10,
                  lock_ttl: nil,
                  lock_limit: 1

  def perform
    Clogger.log("Sleep start")
    sleep(10.seconds)
    Clogger.log("Sleep stop")
  end
end

Anyway, I'm still curious to understand if there is a bug on until_and_while_executing or maybe I configured something bad on my end. Let me know if you need more context or to test different scenarios.

I appreciate all the work you've been doing on this gem. Thank you so much!

@carolinesalib
Copy link
Author

Update: I couldn't get the expected behaviour with the config below. I don't lose any data but I can't actually schedule jobs and the console doesn't respond until the first one is done. I can record a video I just don't know if is any relevant.

class DumbJob < ApplicationSidekiqWorker

  sidekiq_options lock: :until_executed,
                  lock_timeout: 10,
                  lock_ttl: nil,
                  lock_limit: 1

  def perform
    Clogger.log("Sleep start")
    sleep(10.seconds)
    Clogger.log("Sleep stop")
  end
end

We are still testing different options and getting different outcomes every time, even on different projects (aka myapp on sidekiq-uniq-jobs fork).

@mhenrixon
Copy link
Owner

Sounds like until_executing could be something for you.

It isn't as fancy but admittedly until_and_while_executing is not fully there yet in terms of stability.

It seems to vary greatly based on factors I haven't been able to isolate yet.

For some projects it works great but for projects there are issues.

Your examples above is helpful. I will use them to try and replicate something in tests.

The problem is that my tests tell me everything is working fine and unless I can find a way to replicate your problem in a test I am at a loss for what to do.

The joy and fallacies with multi threading 🙄

@mhenrixon
Copy link
Owner

I found a few persistent issues @carolinesalib, could you give #616 a try and see if it helps you?

@mhenrixon
Copy link
Owner

mhenrixon commented Jun 29, 2021

It is likely, that your problems will at least to some degree go away if you upgrade to https://github.com/mhenrixon/sidekiq-unique-jobs/releases/tag/v7.1.0. I found some very nasty and difficult to find bugs thanks to your detailed reports @carolinesalib

@carolinesalib
Copy link
Author

@mhenrixon Oh, that's great news! I'll give it a try today and I'll let you know 🎉 Thank you!!

@carolinesalib
Copy link
Author

Just to give you an update, I still have problems with until_and_while_executing. As a summary, in case you want to take a second look:

# Job (using myapp on sidekiq-unique-jobs code)
class UntilAndWhileExecuting10Job
  include Sidekiq::Worker

  sidekiq_options lock: :until_and_while_executing, on_conflict: :log

  def perform
    SidekiqUniqueJobs.logger.info("jesus")
    sleep 20
    SidekiqUniqueJobs.logger.info("christ")
  end
end

First time I run the job (and sometimes for some reason I have to rename the job class) it works as expected. Job fb0b85acfd395cd3da8b4d1e starts while 97682aadc0adb38e4e1f4e7c is running, and it gets executed right after 97682aadc0adb38e4e1f4e7c is done.

2021-06-30T15:15:00.360Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c INFO: start
2021-06-30T15:15:00.435Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b DEBUG: Executed unlock.lua in 0ms
2021-06-30T15:15:00.435Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed queue.lua in 0ms
2021-06-30T15:15:00.436Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed lock.lua in 1ms
2021-06-30T15:15:00.436Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN INFO: jesus
2021-06-30T15:15:03.360Z pid=92171 tid=1nln class=UntilAndWhileExecuting10Job jid=fb0b85acfd395cd3da8b4d1e INFO: start
2021-06-30T15:15:08.609Z pid=92171 tid=1nhr uniquejobs=reaper INFO: Nothing to delete; exiting.
2021-06-30T15:15:18.618Z pid=92171 tid=1nhr uniquejobs=reaper INFO: Nothing to delete; exiting.
2021-06-30T15:15:20.441Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN INFO: christ
2021-06-30T15:15:20.442Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed unlock.lua in 0ms
2021-06-30T15:15:20.443Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=97682aadc0adb38e4e1f4e7c elapsed=20.083 INFO: done
2021-06-30T15:15:20.493Z pid=92171 tid=1nln class=UntilAndWhileExecuting10Job jid=fb0b85acfd395cd3da8b4d1e uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b DEBUG: Executed unlock.lua in 1ms
2021-06-30T15:15:20.493Z pid=92171 tid=1nln class=UntilAndWhileExecuting10Job jid=fb0b85acfd395cd3da8b4d1e uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed queue.lua in 0ms
2021-06-30T15:15:20.494Z pid=92171 tid=1nln class=UntilAndWhileExecuting10Job jid=fb0b85acfd395cd3da8b4d1e uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed lock.lua in 1ms
2021-06-30T15:15:20.494Z pid=92171 tid=1nln class=UntilAndWhileExecuting10Job jid=fb0b85acfd395cd3da8b4d1e uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN INFO: jesus

On the second attempt, job 6eb0531db6cd39ffd6b2ad04 starts while 85b50180cc9fbfbd14824f43 is running, but instead of wait the first one to finish, it starts to execute right away and get skipped. So we end up truly executing only the first one.

2021-06-30T15:16:14.608Z pid=92171 tid=1nsz class=UntilAndWhileExecuting10Job jid=85b50180cc9fbfbd14824f43 INFO: start
2021-06-30T15:16:14.612Z pid=92171 tid=1nsz class=UntilAndWhileExecuting10Job jid=85b50180cc9fbfbd14824f43 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b DEBUG: Executed unlock.lua in 0ms
2021-06-30T15:16:14.612Z pid=92171 tid=1nsz class=UntilAndWhileExecuting10Job jid=85b50180cc9fbfbd14824f43 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed queue.lua in 0ms
2021-06-30T15:16:14.613Z pid=92171 tid=1nsz class=UntilAndWhileExecuting10Job jid=85b50180cc9fbfbd14824f43 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed lock.lua in 1ms
2021-06-30T15:16:14.613Z pid=92171 tid=1nsz class=UntilAndWhileExecuting10Job jid=85b50180cc9fbfbd14824f43 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN INFO: jesus
2021-06-30T15:16:16.115Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=6eb0531db6cd39ffd6b2ad04 INFO: start
2021-06-30T15:16:16.119Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=6eb0531db6cd39ffd6b2ad04 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b DEBUG: Executed unlock.lua in 1ms
2021-06-30T15:16:16.121Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=6eb0531db6cd39ffd6b2ad04 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN DEBUG: Executed queue.lua in 1ms
2021-06-30T15:16:16.121Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=6eb0531db6cd39ffd6b2ad04 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN INFO: Skipping job with id (6eb0531db6cd39ffd6b2ad04) because lock_digest: (uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN) already exists
2021-06-30T15:16:16.121Z pid=92171 tid=1nhv class=UntilAndWhileExecuting10Job jid=6eb0531db6cd39ffd6b2ad04 elapsed=0.007 INFO: done
2021-06-30T15:16:18.645Z pid=92171 tid=1nhr uniquejobs=reaper INFO: Deleting batch with 1 digests
2021-06-30T15:16:18.646Z pid=92171 tid=1nnv INFO: (2021-06-30 09:16:18 -0600) Execution successfully returned 1
2021-06-30T15:16:28.647Z pid=92171 tid=1nhr uniquejobs=reaper INFO: Nothing to delete; exiting.
2021-06-30T15:16:34.614Z pid=92171 tid=1nsz class=UntilAndWhileExecuting10Job jid=85b50180cc9fbfbd14824f43 uniquejobs=server until_and_while_executing=uniquejobs:50c35433fd43577ffb1c3491a446334b:RUN INFO: christ
2021-06-30T15:16:34.615Z pid=92171 tid=1nsz class=UntilAndWhileExecuting10Job jid=85b50180cc9fbfbd14824f43 elapsed=20.007 INFO: done

Maybe I need to try different locks but people at the office are not a fan of changing locks on many jobs. Turns out we use this lock a lot. So what I've been doing is trying to understand better what's going on and maybe write a test that reproduces that.

@mhenrixon
Copy link
Owner

As pointed out in #617, ArturT found a bug. I accidentally inverted the timeout so that 0 means indefinitely instead of not at all like it should be. The bug was fixed in v7.1.1 which might be highly relevant to your scenario.

The second job 6eb0531db6cd39ffd6b2ad04 should behave exactly that way. That is intentional, if you want to retry that job you'd have to pick another conflict strategy. For example raise which will allow your job to be retried gracefully.

@mhenrixon
Copy link
Owner

To achieve what you are after @carolinesalib you'd have to use something like this:

# Job (using myapp on sidekiq-unique-jobs code)
class UntilAndWhileExecuting10Job
  include Sidekiq::Worker

  sidekiq_options lock: :until_and_while_executing, on_conflict: { 
    client: :log, server: :raise
  } 

  def perform
    SidekiqUniqueJobs.logger.info("jesus")
    sleep 20
    SidekiqUniqueJobs.logger.info("christ")
  end
end

@carolinesalib
Copy link
Author

@mhenrixon thanks!! I did some testing and I think that would do ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants