Only identify specific exit codes as retryable error #518

wenzhel101 · 2018-03-30T18:32:52Z

The criteria to decide permanent error vs retryable error is not correct. Basically, the current exit code range[128, 255] for retryable errors is too broad and it will misclassify some permanent errors as retryable. For example:

user code exits with negative code(e.g. sys.exit(-1)). The container exit code will be 255(exit_code % 256).
user code was killed by SIGSEGV, SIGABRT, etc. Those cases are mostly caused by the problem in mem allocation from user code.

The proposal is to only allow retry for specific error codes that are more likely to be caused by transient issues(e.g. VM was rescheduled or VM was deleted by mistake):
130 = (128+2) Container terminated by Control-C
137 = (128+9) Container received a SIGKILL
143 = (128+15) Container received a SIGTERM

The list can be extended if we see other retryable cases.

The safe approach is to classify all the non-zero exit code as permanent errors.

This change is

coveralls · 2018-03-30T18:44:25Z

Coverage increased (+1.1%) to 49.643% when pulling 96d933e on 0olwzo0:master into ff2958a on kubeflow:master.

wenzhel101 · 2018-03-30T19:22:06Z

/assign @jlewi

jlewi · 2018-03-30T20:49:15Z

If we treat all errors as permanent then how do we deal with retryable errors like the gRPC server going down?

Or the process being killed by SIGTERM because a node became unhealthy?

wenzhel101 · 2018-03-30T22:06:30Z

It depends that how to define the retryable error. The gRPC server going down can be related to user code. Then it makes some sense to say it's still user error(though it's not permanent). And for the SIGTERM case, it's also hard to tell if the node was manually shutdown by user(user error) or by the service like GCE(retryable?).
I agree that we may not be able to handle it perfectly. It depends on how frequently the edge cases happen. IMO, If all of the edge cases are not 'likely' to happen, it's fine to treat them all as permanent errors.

jlewi · 2018-03-30T22:51:32Z

I can say from experience that retryable errors are an issue and treating all errors is permanent as a no go.

On Cloud VMs can die and this will cause workers to die. Treating this as a permanent error and failing the job is not the right thing to do.

Do you agree can we close this PR?

wenzhel101 · 2018-04-02T07:07:42Z

I updated the PR to only allow to retry for some known cases from CloudML Engine. And I think the list can be extended if we see other data points.
What do you think, @jlewi?

jose5918 · 2018-04-03T00:07:34Z

Doesn't it make more sense to treat all errors as retryable and classify certain ones we know are permanent. Basically the reverse of what is being proposed here. Kubernetes already has the CrashLoopBackOff for things that are failing and it's possible to put a limit on the time the operator lets a job stay in a failed state.

jlewi · 2018-04-03T01:37:15Z

The criteria to decide permanent error vs retryable error is not correct. Basically, the current exit code range[128, 255] for retryable errors is too broad and it will misclassify some permanent errors as retryable.

What's an example of an error code that would be missclassified using the current schema?

I don't think this list is sufficient
130 = (128+2) Container terminated by Control-C
137 = (128+9) Container received a SIGKILL
143 = (128+15) Container received a SIGTERM

TensorFlow is a distributed system. So when one process goes down; e.g. SIGTERM because VM goes down. This error can propogate to other TF workers e.g. because gRPC now encounters errors causing exceptions to be thrown.

These exceptions need to be caught and turned into exit codes that indicate the proper behavior e.g.
permanent or retryable.

So I think users need a schema that gives them the ability to indicate whether an exit is retryable or not.

The original schema was chosen for simplicity, symmetry and to give users the ability to define an error as retryable or permanent

1-128 - permanent
129-255 - retryable

This automatically classifies most unix triggered failures as retryable which I believe is the correct behavior. Does anyone have a counter example?

We treat OOMs as permanent but we detect those based on K8s/Docker signals not exit codes.

/cc @gaocegege @ScorpioCPH

gaocegege · 2018-04-03T03:34:43Z

I am not sure if it is the conventions from Google so I did not leave comments here.

Personally, I agree with @jlewi . And if we need to support different policies for different clouds. There are two options:

Support multiple cloud environments adding cloud environment name #451
Support more restart policies, e.g. one for CloudML

FYI There is a doc about sys.exit in python.

Most systems require it to be in the range 0–127, and produce undefined results otherwise. Some systems have a convention for assigning specific meanings to specific exit codes, but these are generally underdeveloped

wenzhel101 · 2018-04-03T05:11:01Z

This automatically classifies most unix triggered failures as retryable which I believe is the correct
behavior. Does anyone have a counter example?

It's not true that all unix triggered failures are retryable. An example we have seen on CloudML Engine is SIGSEGV, which is related to user code and won't recover after retry. In this case, keeping the job running is a wrong behavior and the issue may not be noticed by users(TFJob is RUNNING).
We also saw some users do sys.exit(-1) to exit from their programs. They meant to stop executing the program but it will be retried using tf-operator.

To me, from the cloud service perspective, keeping the resources running after a misclassification of retryable error is not a good idea(but it may be not true for non-cloud environment). I agree with @gaocegege, supporting custom restart policy is a better option. But I still don't think [128, 255] is the right default range for exit code to retry.

jlewi · 2018-04-03T17:33:39Z

Thanks that's a good example.

Does anyone know off hand what exit code Python uses for an unhandled exception? I think the default behavior for that should be retyable.

If we want to be explicit about exist codes then I think we should do the following

Define exit codes corresponding to user defined retryable and permanent errors
Define behavior for exit codes corresponding to relevant unix signals
Explicitly define all other exit codes as undefined and not make any guarantees.

jlewi · 2018-04-10T12:36:23Z

Any thoughts?

wenzhel101 · 2018-04-11T01:12:01Z

Define exit codes corresponding to user defined retryable and permanent errors

Does it mean that users provide a predefined list of exit code to retry?

Define behavior for exit codes corresponding to relevant unix signals

The question is how to get the full list of the retryable signals. Do you think SIGTERM and SIGKILL are good enough as the start point?

Explicitly define all other exit codes as undefined and not make any guarantees.

What do you mean by 'not make any guarantees'? Is it to identify them as permanent error?

gaocegege · 2018-04-11T02:42:35Z

@ddysher I think so. We should have a plan to support customized restartpolicy in v1alpha2.

jlewi · 2018-04-11T13:34:03Z

What do you mean by 'not make any guarantees'? Is it to identify them as permanent error?
I mean the behavior is undefined; i.e. we don't specify whether an exit code will be treated as retryable or permanent. Its left to the implementation to decide.

I think in addition to SIGTERM and SIGKILL we should figure out what exit code Python by default uses for unhandled exceptions and map that to retryable errors.

We should also pick two exit codes one to correspond to user defined retryable and one to correspond to user defined exit codes.

@ddysher Yes I think the behavior should be the same for v1alpha2. We should define function IsRetryableExitCode so that we can use the same code in both implementations.

wenzhel101 · 2018-04-11T20:19:43Z

We should also pick two exit codes one to correspond to user defined retryable and one to correspond to user defined exit codes.

How about SIGUSR1 & SIGUSR2?

jlewi · 2018-04-12T21:56:26Z

SGTM

wenzhel101 · 2018-04-16T07:31:14Z

FYI, I found this link is helpful.
Here is my proposal:

Permanent Error:
- 1: general errors
- 2: Misuse of shell builtins
- 126: Command invoked cannot execute
- 127: Command not found
- 128: Invalid argument to exit
- 139: terminated by SIGSEGV(Invalid memory reference)
Retryable Error:
- 130: SIGINT(interrupted from keyboard CTRL-C)
- 137: terminated by SIGKILL
- 143: terminated by SIGTERM
- 138: corresponds to SIGUSR1. Reserved in tf-operator for user specified retryable errors.
Others are undefined, no guarantee about the behavior(currently handle them as permanent error).

I think it's hard to define the behavior for every sys signal, so it's better to start with known ones. What do you guys think?

jlewi · 2018-04-17T00:30:04Z

This looks good to me.

@gaocegege @ScorpioCPH thoughts?

gaocegege · 2018-04-17T02:22:31Z

SGTM

wenzhel101 · 2018-04-19T01:16:16Z

Updated the PR to address the discussion. PTAL!

jlewi · 2018-04-19T01:23:14Z

pkg/trainer/training.go

-		// We don't want to retry for both cases.
-		// More info about exit status can be found in:
-		// https://www.gnu.org/software/bash/manual/html_node/Exit-Status.html
+	if s.ExitCode == 1 || s.ExitCode == 2 || s.ExitCode == 126 ||


Can we make this a utility function IsRetryableExitCode? I'd like to be able to use the same function in the v1alpha1 and v1alpha2 controllers.

Done, Move it to pkg/util/train/train_util.go

gaocegege

LGTM, and I agree with jlewi to add a function in utility package.

gaocegege · 2018-04-19T03:46:39Z

/ok-to-test

wenzhel101 · 2018-04-20T00:30:21Z

Addressed the comment, PTAL!

gaocegege · 2018-04-20T02:24:11Z

@jlewi LGTY?

jlewi · 2018-04-20T23:39:42Z

Thank you so much this is a great change.
Apologies it took so long. We are trying to ramp up as a community and need to grow the pool of reviewers.

/lgtm
/approver

k8s-ci-robot · 2018-04-20T23:39:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Should return ReplicaStateFailed if container exit code is not 0 * Update the criteria for retryable errors. * Reformat * Reformat * Reformat * Fix lint error. * Handle the exit code more explicitly. * Reformat. * Create a util func for IsRetryableExitCode.

Should return ReplicaStateFailed if container exit code is not 0

a90ef0a

k8s-ci-robot requested review from jimexist and lluunn March 30, 2018 18:32

k8s-ci-robot added needs-ok-to-test size/L labels Mar 30, 2018

k8s-ci-robot assigned jlewi Mar 30, 2018

wenzhel101 added 2 commits April 1, 2018 23:43

Merge branch 'master' of https://github.com/kubeflow/tf-operator

05e24c3

Update the criteria for retryable errors.

3d5afd0

k8s-ci-robot added size/M and removed size/L labels Apr 2, 2018

wenzhel101 added 3 commits April 1, 2018 23:48

Reformat

cffefda

Reformat

d8be766

Reformat

93f18a5

wenzhel101 changed the title ~~Should return ReplicaStateFailed if container exit code is not 0~~ Only identify specific exit codes as retryable error Apr 2, 2018

Fix lint error.

b83ef32

k8s-ci-robot requested review from gaocegege and ScorpioCPH April 3, 2018 01:37

Merge branch 'master' of https://github.com/kubeflow/tf-operator

49e5552

wenzhel101 added 3 commits April 18, 2018 17:51

Handle the exit code more explicitly.

d7e4517

Merge branch 'master' of https://github.com/kubeflow/tf-operator

3b0ca6c

Reformat.

2b61e4b

jlewi reviewed Apr 19, 2018

View reviewed changes

gaocegege reviewed Apr 19, 2018

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test label Apr 19, 2018

Create a util func for IsRetryableExitCode.

96d933e

k8s-ci-robot added size/L and removed size/M labels Apr 19, 2018

wenzhel101 mentioned this pull request Apr 19, 2018

[v1alpha2] The state of distributed model training. #544

Closed

k8s-ci-robot added the lgtm label Apr 20, 2018

k8s-ci-robot added the approved label Apr 20, 2018

k8s-ci-robot merged commit c4ad789 into kubeflow:master Apr 20, 2018

mateusz-ciesielski mentioned this pull request Jan 25, 2019

TensorFlow Training (TFJob) - exitCode behaviour kubeflow/website#443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only identify specific exit codes as retryable error #518

Only identify specific exit codes as retryable error #518

wenzhel101 commented Mar 30, 2018 •

edited

Loading

coveralls commented Mar 30, 2018 •

edited

Loading

wenzhel101 commented Mar 30, 2018

jlewi commented Mar 30, 2018

wenzhel101 commented Mar 30, 2018

jlewi commented Mar 30, 2018

wenzhel101 commented Apr 2, 2018

jose5918 commented Apr 3, 2018

jlewi commented Apr 3, 2018

gaocegege commented Apr 3, 2018

wenzhel101 commented Apr 3, 2018

jlewi commented Apr 3, 2018

jlewi commented Apr 10, 2018

wenzhel101 commented Apr 11, 2018

gaocegege commented Apr 11, 2018

jlewi commented Apr 11, 2018

wenzhel101 commented Apr 11, 2018

jlewi commented Apr 12, 2018

wenzhel101 commented Apr 16, 2018

jlewi commented Apr 17, 2018

gaocegege commented Apr 17, 2018

wenzhel101 commented Apr 19, 2018

jlewi Apr 19, 2018

wenzhel101 Apr 19, 2018

gaocegege left a comment

gaocegege commented Apr 19, 2018

wenzhel101 commented Apr 20, 2018

gaocegege commented Apr 20, 2018

jlewi commented Apr 20, 2018

k8s-ci-robot commented Apr 20, 2018

Only identify specific exit codes as retryable error #518

Only identify specific exit codes as retryable error #518

Conversation

wenzhel101 commented Mar 30, 2018 • edited Loading

coveralls commented Mar 30, 2018 • edited Loading

wenzhel101 commented Mar 30, 2018

jlewi commented Mar 30, 2018

wenzhel101 commented Mar 30, 2018

jlewi commented Mar 30, 2018

wenzhel101 commented Apr 2, 2018

jose5918 commented Apr 3, 2018

jlewi commented Apr 3, 2018

gaocegege commented Apr 3, 2018

wenzhel101 commented Apr 3, 2018

jlewi commented Apr 3, 2018

jlewi commented Apr 10, 2018

wenzhel101 commented Apr 11, 2018

gaocegege commented Apr 11, 2018

jlewi commented Apr 11, 2018

wenzhel101 commented Apr 11, 2018

jlewi commented Apr 12, 2018

wenzhel101 commented Apr 16, 2018

jlewi commented Apr 17, 2018

gaocegege commented Apr 17, 2018

wenzhel101 commented Apr 19, 2018

jlewi Apr 19, 2018

Choose a reason for hiding this comment

wenzhel101 Apr 19, 2018

Choose a reason for hiding this comment

gaocegege left a comment

Choose a reason for hiding this comment

gaocegege commented Apr 19, 2018

wenzhel101 commented Apr 20, 2018

gaocegege commented Apr 20, 2018

jlewi commented Apr 20, 2018

k8s-ci-robot commented Apr 20, 2018

wenzhel101 commented Mar 30, 2018 •

edited

Loading

coveralls commented Mar 30, 2018 •

edited

Loading