Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

until_and_while_executing not entering perform method on initial run #824

Closed
tangopium opened this issue Jan 2, 2024 · 7 comments · Fixed by #830
Closed

until_and_while_executing not entering perform method on initial run #824

tangopium opened this issue Jan 2, 2024 · 7 comments · Fixed by #830
Labels

Comments

@tangopium
Copy link

tangopium commented Jan 2, 2024

If I have a worker configured to be locked until_and_while_executing and I schedule a worker while sidekiq is not running and then start sidekiq, the worker seems to start and finish (in the console), but never enters the "perform" method. I noticed that if I turn of the reaper, this issue doesn't happen. I'm starting sidekiq immediately after I scheduled the job (so it's not eliminated because if the TTL)

class TestWorker
    include Sidekiq::Worker

    sidekiq_options lock: :until_and_while_executing, on_conflict: :log

    def initialize
        puts "Initialize test worker"
        super
    end

    def perform
        puts "Test worker"
    end
end

The logs look like the following:

2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: Booted Rails 7.1.2 application in development environment
2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]
2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: See LICENSE and the LGPL-3.0 for licensing details.
2024-01-02T13:05:41.150Z pid=1 tid=64l INFO: Upgrade to Sidekiq Pro for more features and support: https://sidekiq.org
2024-01-02T13:05:41.151Z pid=1 tid=64l INFO: Sidekiq 7.2.0 connecting to Redis with options {:size=>40, :pool_name=>"default", :url=>"redis://redis:6379/2"}
2024-01-02T13:05:41.169Z pid=1 tid=64l uniquejobs=reaper INFO: Starting Reaper
2024-01-02T13:05:41.170Z pid=1 tid=64l INFO: Starting processing, hit Ctrl-C to stop
2024-01-02T13:05:41.175Z pid=1 tid=11bd class=TestWorker jid=a5e3aac198d488050bb5fb95 INFO: start
2024-01-02T13:05:41.182Z pid=1 tid=10mx uniquejobs=reaper INFO: Nothing to delete; exiting.
2024-01-02T13:05:41.185Z pid=1 tid=10mx uniquejobs=reaper INFO: Deleting batch with 1 digests
2024-01-02T13:05:41.186Z pid=1 tid=10mx INFO: (2024-01-02 14:05:41 +0100) Execution successfully returned 1
Initialize test worker
2024-01-02T13:05:41.228Z pid=1 tid=11bd class=TestWorker jid=a5e3aac198d488050bb5fb95 elapsed=0.053 INFO: done

This does not happen with for example until_executed

In some rare cases the worker is running through:

2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Booted Rails 7.1.2 application in development environment
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Running in ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: See LICENSE and the LGPL-3.0 for licensing details.
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Upgrade to Sidekiq Pro for more features and support: https://sidekiq.org
2024-01-02T13:20:59.477Z pid=1 tid=64l INFO: Sidekiq 7.2.0 connecting to Redis with options {:size=>40, :pool_name=>"default", :url=>"redis://redis:6379/2"}
2024-01-02T13:20:59.496Z pid=1 tid=64l uniquejobs=reaper INFO: Starting Reaper
2024-01-02T13:20:59.496Z pid=1 tid=64l INFO: Starting processing, hit Ctrl-C to stop
2024-01-02T13:20:59.502Z pid=1 tid=11s1 class=TestWorker jid=37ca7144dc92870517af7293 INFO: start
2024-01-02T13:20:59.510Z pid=1 tid=10mx uniquejobs=reaper INFO: Nothing to delete; exiting.
2024-01-02T13:20:59.511Z pid=1 tid=10mx uniquejobs=reaper INFO: Nothing to delete; exiting.
Initialize test worker
Test worker
2024-01-02T13:20:59.561Z pid=1 tid=11s1 class=TestWorker jid=37ca7144dc92870517af7293 elapsed=0.059 INFO: done

I looks to me like a raise condition.

Sidenote: I went down to Sidekiq 6.5.12 and Sidekiq Unique Jobs 7.1.12 and was able to reproduce this bug.

@mhenrixon
Copy link
Owner

This is the weirdest issue I have come across so far.

I had this happen once in a blank project (example app in this repository).

Fixed it and haven't been able yo replicate it since.

Other people did have this and then I didnt hear from them again.

@tangopium
Copy link
Author

tangopium commented Jan 3, 2024

I was digging deeper into this issue. The following occurs:

  1. Sidekiq Worker starts
  2. Worker picks up a job and removes it from the queue
  3. Job starts initializing
  4. Reaper gets started
  5. Reaper finds lock and check if it is an orphan or expired
  6. Reaper doesn't find the corresponding job, considers it an orphan and removes it
  7. Worker doesn't enter perform method because the lock is missing and finishes (I couldn't verify this step in the code, but thats what I can observ)
  8. About 10 seconds later, Sidekiq Worker updates status of job and puts it to the "Processing queue"

The main problem here, is that the Sidekiq worker pushes the Job status with a big delay. It takes up to 10 seconds until a job appears as "processing" (see the Sidekiq source here & here). This leads to a situation where job "disappears" for that period.

By the way, this can affect even jobs during normal runtime. If a worker picks them up and the reaper runs in the 10 seconds delay time, we are in the same situation.

I already thought about if this is a design error on Sidekiqs side, but I guess they are using the heartbeat method to reduce pressure on the DB and allow very high throughputs.

A possible solution I see here: the heartbeat timing is hardcoded in Sidekiq, so what we could do, is to maintain (supposedly) orphaned locks for at least 10 seconds in memory and run a re-check after that time. If they are still orphans, they can be removed. What do you think?

@tangopium
Copy link
Author

I prepared a simple dockerized repository to reproduce the error: https://github.com/tangopium/sidekiq-delay-issue

@DmitryRibalka
Copy link

DmitryRibalka commented Jan 5, 2024

@tangopium i can confirm the issue. I have a task with the same options as yours example.
Mine is run by cron every 10 min and duration can be 10 sec or even 1 hr.
Instead of run it every time i see this log:

19:00:03 sidekiq.1 \| 2024-01-05T19:00:03.213Z pid=39 tid=9oj uniquejobs=client until_and_while_executing=uniquejobs:fa970d55f6eb3db2764e6a05573900fe INFO: Skipping job with id (bece4b439233f82bc547866d) because lock_digest: (uniquejobs:fa970d55f6eb3db2764e6a05573900fe) already exists
 
 also i see in the web ui at 'sidekiq/locks' the lock which is broken i think. it started hours ago. why it is not released have no idea
 
 sidekiq 7.2, ruby 3.2
 

@DmitryRibalka
Copy link

DmitryRibalka commented Jan 5, 2024

I double checked the config and found these line missing:

Sidekiq.configure_server do |config|

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

and looks like the lock exipres as it should

@tangopium
Copy link
Author

@DmitryRibalka thanks for sharing your experience. The error you faced doesn't look like the one I described and is probably caused by the missing configuration you've mentioned

@mhenrixon Any update regarding my question? Did you have time to look into it?

@mhenrixon
Copy link
Owner

Thank you, @tangopium! This makes it a little more challenging.

So we need a queue for these situations with a timestamp to compare with.

Fantastic research, and it makes sense.

I will have a look at optimizing a few places as well. I am primarily using Lua already for that reason.

I haven't looked yet because I feel beat about rechecking this issue, but you gave me hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants