Add option to allow `ActiveJob` to manage retries on app-level first #103

lagr · 2023-12-20T09:15:41Z

First off: a big thank you to all contributors that made this Gem possible! 🚀

Motivation

Our company ran into some unfortunate circumstances regarding retries and failing jobs with an app that relies on Rails' ActiveJob and (depending on the tenant) Sidekiq/Redis or Cloudtasker/CloudTask as the backend:
due to the behavior already described in

Inconsistent executions count with ActiveJob.retry_on #41 ,

some jobs were retried endlessly in the CloudTask-based stacks.

Proposed solution

My suggestion would be to let a config option retry_mechanism control whether ActiveJobs executions counter is overridden with the value of CloudTask's RetryCounter header.
As implemented, ActiveJob should be able to handle retries using retry_on and only after those are used up let the backend handle further retries.

Drawbacks of the proposed solution

config.max_retries will be without effect in the implementation at hand if ActiveJob is chosen, as executions is not "handed back" to Cloudtasker to update once ActiveJob is done retrying and bubbles errors up for the backend to handle.
If it is acceptable to introduce a retry counter variable that does not clash with AJ's executions, e.g. like Sidekiq's retry_count, this could be a possible way out.
The behavior of first handling retries on application level and then on backend/adapter level is probably quite unintuitive and must be documented to prevent people from accidentally shooting themselves in the foot.
a Rails-specific config variable is introduced that has no relevance for projects that do not even use the CloudtaskerAdapter

Alternative

A more elaborate mechanism could provide the means to configure both app-level and backend-level retries at the same time. This would include either a "hand-off" of the execution/retry counters when the responsibility changes or separate means of tracking and applying executions/retries.

It would be great if you could consider including this change and/or sharing your thoughts on how this contribution can be improved.

alachaum · 2024-01-08T10:52:54Z

@lagr thanks a lot for this PR and the detailed description 😉

I quickly scanned through the code and it looks good. I need to spend some time playing with it locally to evaluate the drawbacks as well as try to find potential edge cases (and refresh my memory on the internals of ActiveJob retry mechanisms)

I'll try to spend some time on it this week.

alachaum · 2024-03-31T16:11:21Z

@lagr So I've finally found some time to play with this. I think there is still a bit of tweaking to do to make it work.

When testing locally, I have the following behavior:

I used the examples/rails project, which I plugged to Cloud Tasks using ngrok
I configured the ExampleJob of the project to retry_on NameError and put an undefined variable inside the perform method
Without options, retry_on will make 5 attempts and wait for 3s between each
The first 5 attempts led to the job being rescheduled by ActiveJob, with no exception trace in the logs
After that, the job keeps retrying with exception traces in the logs. This loops forever (until the execution max is hit on the Cloud Tasks side)

See below:

Config-wise, my view is to be more aggressive and always use ActiveJob's retry mechanism for ActiveJob jobs (= get rid of the retry_mechanism config option). People using ActiveJob should expect it to be working as per the docs after all. Keeping a mixed behavior will lead to confusion.

lagr · 2024-04-15T07:21:34Z

@alachaum Thank you for looking into this! 🙏 I may need some time to figure out how to properly detect whether to use the ActiveJob mechanism or not, then.

Add option to allow ActiveJob to manage retries on app-level first

5d3f7b1

lagr mentioned this pull request Dec 20, 2023

Inconsistent executions count with ActiveJob.retry_on #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to allow `ActiveJob` to manage retries on app-level first #103

Add option to allow `ActiveJob` to manage retries on app-level first #103

lagr commented Dec 20, 2023

alachaum commented Jan 8, 2024

alachaum commented Mar 31, 2024 •

edited

lagr commented Apr 15, 2024

Add option to allow ActiveJob to manage retries on app-level first #103

Are you sure you want to change the base?

Add option to allow ActiveJob to manage retries on app-level first #103

Conversation

lagr commented Dec 20, 2023

Motivation

Proposed solution

Drawbacks of the proposed solution

Alternative

alachaum commented Jan 8, 2024

alachaum commented Mar 31, 2024 • edited

lagr commented Apr 15, 2024

Add option to allow `ActiveJob` to manage retries on app-level first #103

Add option to allow `ActiveJob` to manage retries on app-level first #103

alachaum commented Mar 31, 2024 •

edited