Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtimes] Wait for pods deletion when removing runtime resources #910

Merged
merged 16 commits into from May 4, 2021

Conversation

Hedingber
Copy link
Contributor

@Hedingber Hedingber commented May 2, 2021

We had a bug in which we seen a job that was aborted and moved to aborted state, eventually reaching completed state

After investigation we found out we had a race condition - when we abort a job we're deleting its runtime resources, apparently the k8s package doesn't wait until the resources are actually removed, so what happened here is that we sent request to abort, k8s got request to delete the pod and started the process, the API changed the job state to aborted, the job finished before the pod actually got down, sending a request to the API to change its state to completed... I added logic to wait for the pod to actually be gone, and only then updating the job state

This change required to change the response payload of the runtime resources grouped by job response so that instead of it being a dict with the job uid as the key and the runtime resources as the values ({job-uid -> runtime-resources}), there is another level of project, so the first level dict keys are project names, the values are dicts, which their keys are the uids and the values are the runtime resources ({project-name -> {job-uid -> runtime-resources}})
Fixes https://jira.iguazeng.com/browse/ML-397

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant