[Runtimes] Wait for pods deletion when removing runtime resources #910
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We had a bug in which we seen a job that was aborted and moved to aborted state, eventually reaching completed state
After investigation we found out we had a race condition - when we abort a job we're deleting its runtime resources, apparently the k8s package doesn't wait until the resources are actually removed, so what happened here is that we sent request to abort, k8s got request to delete the pod and started the process, the API changed the job state to aborted, the job finished before the pod actually got down, sending a request to the API to change its state to completed... I added logic to wait for the pod to actually be gone, and only then updating the job state
This change required to change the response payload of the runtime resources grouped by job response so that instead of it being a dict with the job uid as the key and the runtime resources as the values (
{job-uid -> runtime-resources}
), there is another level of project, so the first level dict keys are project names, the values are dicts, which their keys are the uids and the values are the runtime resources ({project-name -> {job-uid -> runtime-resources}}
)Fixes https://jira.iguazeng.com/browse/ML-397