Skip to content

Conversation

@hao1939
Copy link
Contributor

@hao1939 hao1939 commented Jul 18, 2019

Features: if node loses, mark job as "Unknown" or "NotFound" (dependent on the pod status), then resubmit.

Known issues:

  1. May need to reduce the timeout. It takes about 7 minutes before pod turns into "Unknown", we may don't need to wait so long. (Also notice, if we reduce the timeout, we would be more likely to kill the job by mistake if there are problem on kubelet.)
  2. If node resume before we resubmit the job (it takes about 10 minutes or more), the job will end up with status 'failed'.

@hao1939 hao1939 changed the title [DONOT MERGE] Refine job status: Unknown pod Refine job status: Unknown pod Jul 23, 2019
@Anbang-Hu Anbang-Hu merged commit aeb2744 into microsoft:jobmanager Jul 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants