Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Retryable Tasks #183

Closed
maxfierke opened this issue Nov 5, 2017 · 8 comments
Closed

RFC: Retryable Tasks #183

maxfierke opened this issue Nov 5, 2017 · 8 comments

Comments

@maxfierke
Copy link
Collaborator

@maxfierke maxfierke commented Nov 5, 2017

  • Start Date: 2017-11-04
  • RFC PR: (leave this empty)
  • ember-concurrency issue: (leave this empty)

Summary

Sometimes networks are bad. Sometimes computers do bad things. Sometimes solar flares happen. Failure happens. EC already has some great derived state around failure and allows task authors to handle failure as they see fit. This is great for fatal errors where we just want to log the error, maybe show a message to user, and maybe even give them an option to retry the operation. But what about intermittent failure? What about routine, boring failures? The failures we might be able to recover from without user intervention if we just wait a bit and then retry? This RFC proposes adding additional functionality to allow for automatic retries of tasks.

Motivation

As long as web apps need the network, network errors will exist. Sometimes they're blips. In the case of mobile internet, intermittent network errors are just part of the reality of such distributed connectivity. There are use cases where user intervention should not be needed and the user may not even need to know that something has gone wrong underneath.

Some use-cases for automatic retrying of tasks:

  • Long-polling tasks
    • Background auto-saves
  • Telemetry/analytics
  • Integrations w/ 3rd party APIs that don't quite have five 9s (or three)
  • Applications that deal with periodic resource contention
    • e.g. app talks to a cloud storage API in files might be locked for a few seconds while some other system is working on it.
  • Anything async that is prone to transient errors
    • e.g. transient Geolocation API failures while user is in transit

Consider one approach to solving this now:

import Ember from 'ember';
import { task } from 'ember-concurrency';
import { getWidgets } from '../utils/widgets';

const { Component, run: { later } } = Ember;
const RETRY_DELAYS = [500, 3000, 30000, 50000];

export default Component.extend({
    _retryAttempt: 0,

    fetchLatest: task(function* () {
        try {
            // Call to unreliable 3rd party API
            const data = yield getWidgets();
            this.set('_retryAttempt', 0);
            return data;
        } catch (e) {
            // Ewwww...
            const retryAttempt = this.get('_retryAttempt');
            
            if (retryAttempt < RETRY_DELAYS.length) {
                later(this, () => {
                    this.get('fetchLatest').perform();
                }, RETRY_DELAYS[retryAttempt]);
                this.incrementProperty('_retryAttempt');
            }

            throw e; 
        } 
    }).enqueue()
});

We're adding more state that we have to remember to reset and such once a retry is successful. We're also adding additional boilerplate around the error handling.

Detailed design

Scheduler additions

The scheduler includes a concept of buffer policies, which underpin the various property modifiers like .drop(), .enqueue(), etc. and dictate how the scheduler should control the scheduling of additional task instances.

Based on this pattern, this RFC proposes adding a concept of retry policies allowing for additional property modifiers for adding configurable retry behavior a la the resque-retry gem.

Proposed public API additions

e.g. use from an app developer perspective:

import Ember from 'ember';
import { task } from 'ember-concurrency';
import DS from 'ember-data'; 

const { AdapterError } = DS; 

// ... other imports

export default Component.extend({
    backgroundTask: task(function* () {
        // If this fails, it will be retired once after 5s.
        yield someUnreliableAsyncThing(); 
    }).drop().retryable({ delay: 5000 }),
    
    backgroundTaskWithRetryReasons: task(function* () {
        // If this fails and the reason is a
        // SomeError or AdapterError, it will be retried.
        yield someUnreliableAsyncThing(); 
    }).drop().retryable({ reasons: [SomeError, AdapterError] }),
    
    backgroundTaskWithExponentialBackoff: task(function* () {
        // If this fails, it will be retried after
        // 2.5s, 10s, 50s, and 3m.
        yield someUnreliableAsyncThing();
    }).drop().retryable({ delay: [2500, 10000, 500000, 180000] })
}); 

As shown in the example, the proposed public API addition would be a .retryable task property modifier that would attach a retry policy to the task.

An interface definition for a retry policy:

interface RetryPolicy {
    delay: number | Array<number>;
    reasons: Array<string> | Array<Error>; // maybe just Array<any>?
}

Perhaps, without arguments .retryable() might have a default delay of [500, 1500, 3000, 6000] or something considered sensible.

Additionally, it might be nice to add some derived state. A useful piece of derived state might be isRetrying, which would complement isRunning, but explicitly indicate whether the task was retrying, rather than simply running normally. The user may want to know something is being retried and such a property would make it easier to for application developers to present separate messaging, for example. In addition, a retryCount read-only property could be added to indicate the number of times a task has been retried.

Interactions with buffer policies (& other property modifiers)

Retry policies should work alongside existing property modifiers such as .drop(), .keepLatest(), etc. and work alongside them in expected ways. Tasks retried automatically via .retryable() should be scheduled according to any attached buffer policy modifiers, so that Tasks continue to work as users expect.

When the retry timer is up:

  • Tasks with .drop() should drop the attempt to retry the task if maxConcurrency has been reached.
  • Tasks with .keepLatest() should continue the latest currently running task (if any) and enqueue the retry.
  • Tasks with .restartable() should cancel any running tasks and continue with the retry.
  • Tasks with .enqueue() should add the retry to the queue
  • Tasks without other modifiers should begin executing the retry.

Like .drop() and others, .retryable() should not be able to be used with .group().

How We Teach This

We could document it similar to how other property modifiers are documented. i.e. an animated, interactive example showing a visual representation of the retry behavior. Everyone loves animation.

Drawbacks

It adds additional complexity to the scheduler and adds another property modifier that needs to be documented, tested, and supported.

Alternatives

The primary alternative would be continue to allow application developers to build such behavior themselves (possibly using EC tasks!).

Retryable tasks could also be implemented as a task wrapper by wrapping the task computed property or by providing some reusable pattern via encapsulated tasks.

In some cases, application authors can also use Service Workers to handle network failures and return cached responses or enqueue failures for later retry, effectively masking the failure from the application. However, Service Workers come with their own set of complexities & limitations, and do not address non-network use-cases.

Unresolved questions

  • Does this belong in EC or an add-on?
    • If add-on, probably not be possible via public API if going for the wrapper approach. For example, the wrapper around task may require access to taskFn to wrap the generator function.
@xtagon

This comment has been minimized.

Copy link

@xtagon xtagon commented Nov 8, 2017

In one of my applications I accomplish this by wrapping each of the tasks inner actions (each yield basically) in a promise that auto-retries on failure using this addon: https://github.com/GavinJoyce/ember-backoff

There may be advantages to retries on the task level instead. My use case is a batch upload queue which runs each individual upload task concurrently with ember-concurrency, and each upload task consists of multiple sub-actions (that are not their own tasks, but could be) that are each wrapped and automatically retried on failure.

I will be keeping an eye on this RFC 👍

@maxfierke

This comment has been minimized.

Copy link
Collaborator Author

@maxfierke maxfierke commented Nov 18, 2017

After I started working on a proof-of-concept for this RFC using the scheduler-based approach, I ran into some troubles with handling and retrying failed TaskInstances gracefully within the scheduler (was trying to hook into onFinalize, which seemed like it was a bit too late to handle gracefully at the at point). I started going in slightly different direction and handling errors raised in the TaskInstance's generator itself. I think it's a bit cleaner of an approach but the way I have it written right now means that the retries are handled transparently within the same TaskInstance.

Here's the branch for the PoC: https://github.com/maxfierke/ember-concurrency/tree/mf-retry_policies

There's still a lot untested (how cancellation works, interaction with buffer policies), but a there's a quick smoke test of a retryable task.

Changes from RFC

Interface for a retry policy

interface RetryPolicy {
   constructor(options: any);
   shouldRetry(taskInstance: TaskInstance, reason: any): boolean;
   retry(taskInstance: TaskInstance): void;
   reset(taskInstance: TaskInstance): void;
   cancel(taskInstance: TaskInstance): void;
}

Interactions with buffer policies (& other property modifiers)

Since the PoC handles failures and retries them transparently, there is no effect on buffer policies or how TaskInstances are otherwise scheduled. TaskInstances being retried or awaiting retry will appear as running TaskInstances to the scheduler.

Questions

  • Does the approach taken in the above PoC make more sense?
    • Is a transparent retry better than creating a new TaskInstance?

Based on the PoC, I think there is a way forward on matching the approach of the original RFC, as well. Given that the task is accessible via the TaskInstance, could probably go with an approach in which the failed TaskInstance is cancelled and then a new one is scheduled to be performed via Task#perform.

@sukima

This comment has been minimized.

Copy link

@sukima sukima commented Nov 19, 2017

You could very easily do this in an addon in the same manor that ember-concurrency-test-waiter does. For example a basic implementation could be done like this:

// addon/retryable.js
import { assert } from '@ember/debug';
import { later } from '@ember/runloop';

const RETRY_DELAYS = [500, 3000, 30000, 50000];

export default function retryable(taskProperty) {
  assert('retryable() will only work with ember-concurrency >=0.7.19 -- please upgrade', taskProperty.taskFn);

  let originalTaskFn = taskProperty.taskFn;
  let retryAttempt = 0;

  taskProperty.taskFn = function *(...args) {
    try {
      return yield * originalTaskFn.apply(this, args);
      retryAttempt = 0;
    } catch {
      if (retryAttempt < RETRY_DELAYS.length) {
        later(this, () => {
          taskProperty.perform(...args);
        }, RETRY_DELAYS[retryAttempt]);
        retryAttempt++;
      }
      throw e; 
    }
  };

  return taskProperty;
}

Obviously this could be expanded to be smarter and offer more options. I'm not seeing why this needs to be in ember-concurrency core. I think doing this in an addon would prove to be more useful and having a working addon would make it easier if ember-concurrency were to incorporate these ideas into core.

@sukima

This comment has been minimized.

Copy link

@sukima sukima commented Nov 19, 2017

And to make the above a property like you described in the same way emebr-concurrency-test-waiter does:

// addon/define-modifier.js
import retryable from './retryable';
import { TaskProperty } from 'ember-concurrency/-task-property';

export default function defineModifier() {
  TaskProperty.prototype.retryable = function () {
    return retryable(this);
  };
}
@maxfierke

This comment has been minimized.

Copy link
Collaborator Author

@maxfierke maxfierke commented Nov 19, 2017

@sukima could you expand on your thoughts behind keeping it out of core e-c?

I'm not opposed to the add-on approach. It has some definite benefits, e.g. allowing some more experimentation outside of e-c itself. However, I'm wondering if there's a hybrid approach here that would allow for different retry strategies to live outside of e-c in an add-on. I'm thinking adding hooks in e-c, rather than wrapping tasks and extending the prototype in the add-on.

One revision to the API might be to have the .retryable() property modifier just take an instance of an object corresponding to the RetryPolicy interface. TaskInstance would use the the logic from the PoC branch to hook into the retry policy, but e-c would only ship with the NoOpRetryPolicy. This would allow for all the real retry strategies to live elsewhere (and also allow people to write their own).

e.g.

import Component from '@ember/component';
import { ExponentialBackOffPolicy } from 'ember-concurrency-retry-strategies';

const policy = new ExponentialBackOffPolicy({
  multiplier: 1.2,
  minInterval: 2,
  maxInterval: 30
});

export default Component.extend({
  // [...]

  doStuff: task(function* () {
    yield flakyThing();
  }).drop().retryable(policy),

  // [...]
});
@sukima

This comment has been minimized.

Copy link

@sukima sukima commented Nov 20, 2017

My point was that you can accomplish all of this via an addon. I demonstrated a PoC on how to go about doing so and offered a well known example.

The intent was to say that the RFC could be accomplished in an addon, refined as an addon, and only after it has been battle tested then the discussion with the core members of E-C could discuss if they wanted to assimilate the addon into E-C ot leave it as an addon. That is all.

Are there blockers that require core support that I missed?

@maxfierke

This comment has been minimized.

Copy link
Collaborator Author

@maxfierke maxfierke commented Nov 20, 2017

Nope, I don't think so. I was mostly curious about the why rather than the how. It sounds like your concern is with scope creep of E-C, which is totally valid 👍 (though, I will let @machty make the determination of what is/isn't within scope).

Not so much a blocker with your approach to the add-on, but the reliance on a private module import to add the property modifier to the TaskProperty prototype gives me pause, which is why I suggested a more modest addition to core (relative to the original RFC) of a hook into TaskInstance. (but, I may be getting ahead of myself even more, suggesting hooks into E-C; topic for a different RFC, maybe)

I appreciate the feedback! I'll play around with the add-on approach a bit 😎

@maxfierke

This comment has been minimized.

Copy link
Collaborator Author

@maxfierke maxfierke commented Mar 19, 2018

Finally got around to working this into an addon called ember-concurrency-retryable based on the approach suggested.

Haven't worked in any derivable state (not sure if it's possible yet via an addon), but the important behavior is there, and it's built on e-c's timeout and generators, so it should play nicely with cancellation.

Going to close out the RFC for now

@maxfierke maxfierke closed this Mar 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.