until_and_while_executing not entering perform method on initial run #824

tangopium · 2024-01-02T13:09:39Z

If I have a worker configured to be locked until_and_while_executing and I schedule a worker while sidekiq is not running and then start sidekiq, the worker seems to start and finish (in the console), but never enters the "perform" method. I noticed that if I turn of the reaper, this issue doesn't happen. I'm starting sidekiq immediately after I scheduled the job (so it's not eliminated because if the TTL)

class TestWorker
    include Sidekiq::Worker

    sidekiq_options lock: :until_and_while_executing, on_conflict: :log

    def initialize
        puts "Initialize test worker"
        super
    end

    def perform
        puts "Test worker"
    end
end

The logs look like the following:

2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: Booted Rails 7.1.2 application in development environment
2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]
2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: See LICENSE and the LGPL-3.0 for licensing details.
2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: Upgrade to Sidekiq Pro for more features and support: https://sidekiq.org
2024-01-02T13:05:41.151Z pid=1 tid=64l INFO: Sidekiq 7.2.0 connecting to Redis with options {:size=>40, :pool_name=>"default", :url=>"redis://redis:6379/2"}
2024-01-02T13:05:41.169Z pid=1 tid=64l uniquejobs=reaper INFO: Starting Reaper
2024-01-02T13:05:41.170Z pid=1 tid=64l INFO: Starting processing, hit Ctrl-C to stop
2024-01-02T13:05:41.175Z pid=1 tid=11bd class=TestWorker jid=a5e3aac198d488050bb5fb95 INFO: start
2024-01-02T13:05:41.182Z pid=1 tid=10mx uniquejobs=reaper INFO: Nothing to delete; exiting.
2024-01-02T13:05:41.185Z pid=1 tid=10mx uniquejobs=reaper INFO: Deleting batch with 1 digests
2024-01-02T13:05:41.186Z pid=1 tid=10mx INFO: (2024-01-02 14:05:41 +0100) Execution successfully returned 1
Initialize test worker
2024-01-02T13:05:41.228Z pid=1 tid=11bd class=TestWorker jid=a5e3aac198d488050bb5fb95 elapsed=0.053 INFO: done

This does not happen with for example until_executed

In some rare cases the worker is running through:

2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Booted Rails 7.1.2 application in development environment
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: See LICENSE and the LGPL-3.0 for licensing details.
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Upgrade to Sidekiq Pro for more features and support: https://sidekiq.org
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Sidekiq 7.2.0 connecting to Redis with options {:size=>40, :pool_name=>"default", :url=>"redis://redis:6379/2"}
2024-01-02T13:20:59.496Z pid=1 tid=64l uniquejobs=reaper INFO: Starting Reaper
2024-01-02T13:20:59.496Z pid=1 tid=64l INFO: Starting processing, hit Ctrl-C to stop
2024-01-02T13:20:59.502Z pid=1 tid=11s1 class=TestWorker jid=37ca7144dc92870517af7293 INFO: start
2024-01-02T13:20:59.510Z pid=1 tid=10mx uniquejobs=reaper INFO: Nothing to delete; exiting.
2024-01-02T13:20:59.511Z pid=1 tid=10mx uniquejobs=reaper INFO: Nothing to delete; exiting.
Initialize test worker
Test worker
2024-01-02T13:20:59.561Z pid=1 tid=11s1 class=TestWorker jid=37ca7144dc92870517af7293 elapsed=0.059 INFO: done

I looks to me like a raise condition.

Sidenote: I went down to Sidekiq 6.5.12 and Sidekiq Unique Jobs 7.1.12 and was able to reproduce this bug.

The text was updated successfully, but these errors were encountered:

mhenrixon · 2024-01-02T16:57:35Z

This is the weirdest issue I have come across so far.

I had this happen once in a blank project (example app in this repository).

Fixed it and haven't been able yo replicate it since.

Other people did have this and then I didnt hear from them again.

tangopium · 2024-01-03T08:14:50Z

I was digging deeper into this issue. The following occurs:

Sidekiq Worker starts
Worker picks up a job and removes it from the queue
Job starts initializing
Reaper gets started
Reaper finds lock and check if it is an orphan or expired
Reaper doesn't find the corresponding job, considers it an orphan and removes it
Worker doesn't enter perform method because the lock is missing and finishes (I couldn't verify this step in the code, but thats what I can observ)
About 10 seconds later, Sidekiq Worker updates status of job and puts it to the "Processing queue"

The main problem here, is that the Sidekiq worker pushes the Job status with a big delay. It takes up to 10 seconds until a job appears as "processing" (see the Sidekiq source here & here). This leads to a situation where job "disappears" for that period.

By the way, this can affect even jobs during normal runtime. If a worker picks them up and the reaper runs in the 10 seconds delay time, we are in the same situation.

I already thought about if this is a design error on Sidekiqs side, but I guess they are using the heartbeat method to reduce pressure on the DB and allow very high throughputs.

A possible solution I see here: the heartbeat timing is hardcoded in Sidekiq, so what we could do, is to maintain (supposedly) orphaned locks for at least 10 seconds in memory and run a re-check after that time. If they are still orphans, they can be removed. What do you think?

tangopium · 2024-01-03T10:35:56Z

I prepared a simple dockerized repository to reproduce the error: https://github.com/tangopium/sidekiq-delay-issue

DmitryRibalka · 2024-01-05T19:49:46Z

@tangopium i can confirm the issue. I have a task with the same options as yours example.
Mine is run by cron every 10 min and duration can be 10 sec or even 1 hr.
Instead of run it every time i see this log:

19:00:03 sidekiq.1 \| 2024-01-05T19:00:03.213Z pid=39 tid=9oj uniquejobs=client until_and_while_executing=uniquejobs:fa970d55f6eb3db2764e6a05573900fe INFO: Skipping job with id (bece4b439233f82bc547866d) because lock_digest: (uniquejobs:fa970d55f6eb3db2764e6a05573900fe) already exists

also i see in the web ui at 'sidekiq/locks' the lock which is broken i think. it started hours ago. why it is not released have no idea

sidekiq 7.2, ruby 3.2

DmitryRibalka · 2024-01-05T20:54:35Z

I double checked the config and found these line missing:

Sidekiq.configure_server do |config|

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

and looks like the lock exipres as it should

tangopium · 2024-01-05T22:02:23Z

@DmitryRibalka thanks for sharing your experience. The error you faced doesn't look like the one I described and is probably caused by the missing configuration you've mentioned

@mhenrixon Any update regarding my question? Did you have time to look into it?

mhenrixon · 2024-01-06T11:02:37Z

Thank you, @tangopium! This makes it a little more challenging.

So we need a queue for these situations with a timestamp to compare with.

Fantastic research, and it makes sense.

I will have a look at optimizing a few places as well. I am primarily using Lua already for that reason.

I haven't looked yet because I feel beat about rechecking this issue, but you gave me hope.

tangopium mentioned this issue Jan 3, 2024

Jobs disappears for 10 seconds when Worker is picking it up sidekiq/sidekiq#6153

Closed

mhenrixon added the bug label Feb 7, 2024

mhenrixon mentioned this issue Feb 7, 2024

fix: ensure a new lock isn't conflicting with itself #830

Merged

mhenrixon closed this as completed in #830 Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

until_and_while_executing not entering perform method on initial run #824

until_and_while_executing not entering perform method on initial run #824

tangopium commented Jan 2, 2024 •

edited

mhenrixon commented Jan 2, 2024

tangopium commented Jan 3, 2024 •

edited

tangopium commented Jan 3, 2024

DmitryRibalka commented Jan 5, 2024 •

edited

DmitryRibalka commented Jan 5, 2024 •

edited

tangopium commented Jan 5, 2024

mhenrixon commented Jan 6, 2024

until_and_while_executing not entering perform method on initial run #824

until_and_while_executing not entering perform method on initial run #824

Comments

tangopium commented Jan 2, 2024 • edited

mhenrixon commented Jan 2, 2024

tangopium commented Jan 3, 2024 • edited

tangopium commented Jan 3, 2024

DmitryRibalka commented Jan 5, 2024 • edited

DmitryRibalka commented Jan 5, 2024 • edited

tangopium commented Jan 5, 2024

mhenrixon commented Jan 6, 2024

tangopium commented Jan 2, 2024 •

edited

tangopium commented Jan 3, 2024 •

edited

DmitryRibalka commented Jan 5, 2024 •

edited

DmitryRibalka commented Jan 5, 2024 •

edited