Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to allow ActiveJob to manage retries on app-level first #103

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lagr
Copy link

@lagr lagr commented Dec 20, 2023

First off: a big thank you to all contributors that made this Gem possible! 馃殌

Motivation

Our company ran into some unfortunate circumstances regarding retries and failing jobs with an app that relies on Rails' ActiveJob and (depending on the tenant) Sidekiq/Redis or Cloudtasker/CloudTask as the backend:
due to the behavior already described in

some jobs were retried endlessly in the CloudTask-based stacks.

Proposed solution

My suggestion would be to let a config option retry_mechanism control whether ActiveJobs executions counter is overridden with the value of CloudTask's RetryCounter header.
As implemented, ActiveJob should be able to handle retries using retry_on and only after those are used up let the backend handle further retries.

Drawbacks of the proposed solution

  • config.max_retries will be without effect in the implementation at hand if ActiveJob is chosen, as executions is not "handed back" to Cloudtasker to update once ActiveJob is done retrying and bubbles errors up for the backend to handle.
    If it is acceptable to introduce a retry counter variable that does not clash with AJ's executions, e.g. like Sidekiq's retry_count, this could be a possible way out.

  • The behavior of first handling retries on application level and then on backend/adapter level is probably quite unintuitive and must be documented to prevent people from accidentally shooting themselves in the foot.

  • a Rails-specific config variable is introduced that has no relevance for projects that do not even use the CloudtaskerAdapter

Alternative

A more elaborate mechanism could provide the means to configure both app-level and backend-level retries at the same time. This would include either a "hand-off" of the execution/retry counters when the responsibility changes or separate means of tracking and applying executions/retries.


It would be great if you could consider including this change and/or sharing your thoughts on how this contribution can be improved.

@alachaum
Copy link
Member

alachaum commented Jan 8, 2024

@lagr thanks a lot for this PR and the detailed description 馃槈

I quickly scanned through the code and it looks good. I need to spend some time playing with it locally to evaluate the drawbacks as well as try to find potential edge cases (and refresh my memory on the internals of ActiveJob retry mechanisms)

I'll try to spend some time on it this week.

@alachaum
Copy link
Member

alachaum commented Mar 31, 2024

@lagr So I've finally found some time to play with this. I think there is still a bit of tweaking to do to make it work.

When testing locally, I have the following behavior:

  • I used the examples/rails project, which I plugged to Cloud Tasks using ngrok
  • I configured the ExampleJob of the project to retry_on NameError and put an undefined variable inside the perform method
  • Without options, retry_on will make 5 attempts and wait for 3s between each
  • The first 5 attempts led to the job being rescheduled by ActiveJob, with no exception trace in the logs
  • After that, the job keeps retrying with exception traces in the logs. This loops forever (until the execution max is hit on the Cloud Tasks side)

See below:
Screenshot 2024-03-31 at 18 02 33

Config-wise, my view is to be more aggressive and always use ActiveJob's retry mechanism for ActiveJob jobs (= get rid of the retry_mechanism config option). People using ActiveJob should expect it to be working as per the docs after all. Keeping a mixed behavior will lead to confusion.

@lagr
Copy link
Author

lagr commented Apr 15, 2024

@alachaum Thank you for looking into this! 馃檹 I may need some time to figure out how to properly detect whether to use the ActiveJob mechanism or not, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants