[Runtimes] Wait for pods deletion when removing runtime resources #910

Hedingber · 2021-05-02T22:27:43Z

We had a bug in which we seen a job that was aborted and moved to aborted state, eventually reaching completed state

After investigation we found out we had a race condition - when we abort a job we're deleting its runtime resources, apparently the k8s package doesn't wait until the resources are actually removed, so what happened here is that we sent request to abort, k8s got request to delete the pod and started the process, the API changed the job state to aborted, the job finished before the pod actually got down, sending a request to the API to change its state to completed... I added logic to wait for the pod to actually be gone, and only then updating the job state

This change required to change the response payload of the runtime resources grouped by job response so that instead of it being a dict with the job uid as the key and the runtime resources as the values ({job-uid -> runtime-resources}), there is another level of project, so the first level dict keys are project names, the values are dicts, which their keys are the uids and the values are the runtime resources ({project-name -> {job-uid -> runtime-resources}})
Fixes https://jira.iguazeng.com/browse/ML-397

required some name changes + supporting multi project grouped by list runtime resources + supporting group by in dask list runtime resources enrichment

…ds_deletion

Hedingber added 16 commits May 2, 2021 21:55

Add local to runtime kinds enum and add abortable runtimes list

3b596d0

abort only abortable

24f6066

lint

50ea6ba

Add to frontend spec

2095c73

Add test

fa86f08

linting

b92bba6

Fixing tests

5aa6b98

Adding logic to wait for resources deletion

363ed71

required some name changes + supporting multi project grouped by list runtime resources + supporting group by in dask list runtime resources enrichment

formatting

7b36bb5

Merge branch 'development' of github.com:mlrun/mlrun into wait_for_po…

6926bc6

…ds_deletion

Fix tests and code

735c4ca

linting

e5fe992

fix typos

2ee9c33

Add logs

7ed0bea

Merge branch 'development' of github.com:mlrun/mlrun into wait_for_po…

9aaf8e6

…ds_deletion

Change interval

2349c66

eran-nussbaum mentioned this pull request May 3, 2021

Impl [Jobs] Pods: Always show pods, not only when pending mlrun/ui#537

Merged

Hedingber merged commit d5d3b1d into mlrun:development May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Runtimes] Wait for pods deletion when removing runtime resources #910

[Runtimes] Wait for pods deletion when removing runtime resources #910

Hedingber commented May 2, 2021 •

edited

[Runtimes] Wait for pods deletion when removing runtime resources #910

[Runtimes] Wait for pods deletion when removing runtime resources #910

Conversation

Hedingber commented May 2, 2021 • edited

Hedingber commented May 2, 2021 •

edited