[Runtimes] Don't resolve completion of runs per `execution` #2888

alonmr · 2023-01-09T13:47:16Z

https://jira.iguazeng.com/browse/ML-3119
With mpijobs we update the run state by monitoring the run (mlrun/api/main.py:_monitor_runs)
If the client is watching the logs, we may get a faster response from the API since the endpoint
checks the Launcher status if the run object is not in a terminal state.
This change affects all runtimes since the execution no longer decides on run completion and only the API will.
Next phase:
Add a workers section in the run object status that will track the state of each worker separately.

…lete_run_status_fix

quaark

Looks great mate! Just a couple really minor things

mlrun/execution.py

tests/api/runtimes/test_mpijob.py

Tankilevitch

Generally this is very sensitive change, I feel like some of the comments you added are not reflecting exactly what you implemented. such as stating the only the API will update the state while you have update_run_state on the client side as well as runs which can be executed locally.

Please add tests that checks the update state of a local run as well as try to be as much explainable in your comments to avoid future misunderstandings.

mlrun/execution.py

mlrun/runtimes/base.py

Tankilevitch · 2023-01-17T14:42:27Z

mlrun/runtimes/base.py

+            # TODO: add a 'workers' section in run objects state, each worker will update its state while
+            #  the run state will be resolved by the server.
+            if kind != "mpijob":
+                logger.debug(
+                    "Updating run state to completed", kind=kind, last_state=last_state
+                )
+                updates = {
+                    "status.last_update": now_date().isoformat(),
+                    "status.state": "completed",
+                }


not a fan of the base knowing about a special case of a class that inherits from it. I would suggest using some helper function for that ( where most of the classes will have the base implementation and the mpi a different one )

I agree with you but we can't do this
keep in mind when you are running on the mpi worker you don't know your kind is mpi (your actual kind is local)

but you check that your kind isn't mpijob so I am not sure I follow why you can't resolve that from the kind

tests/system/runtimes/test_mpijob.py

theSaarco

Looks good to me - I have small comments/questions, but in general the change seems solid.

mlrun/runtimes/base.py

theSaarco · 2023-01-18T14:13:31Z

mlrun/runtimes/base.py

+
+            # TODO: add a 'workers' section in run objects state, each worker will update its state while
+            #  the run state will be resolved by the server.
+            if kind != "mpijob":


A question on this - is the situation we saw in mpi only relevant to mpi, or is there some other runtime that may see a similar issue? I assume spark is off the hook since we use the CRD status for its status, what about a job with iterations? Or dask?

Dask - mlrun is not running on the workers so we're fine
Spark - I don't know if we run mlrun on the executer / driver it may be a problem
Iterations - each iteration has its own RunObject so we're good
there is also the task_genrator / hyper run which I'm not familiar with and should be considerd

We covered spark offline - only 1 execution in the driver so we're good
with respect to hyper run there is an execution to each RunObject (all the child results are collected in RunList)

…lete_run_status_fix

Tankilevitch

Looks good minor comments

mlrun/execution.py

Tankilevitch · 2023-01-18T16:12:27Z

mlrun/execution.py

+        """
+        Modify and store the run state or mark an error
+        This method allows to set the run state to completed in the DB which is discouraged.
+        Completion of runs should be decided in the server.

-        :param state:   set run state
+        :param state:   set execution state


This is confusing, you use run_state and execution_state, lets be aligned if we changed the state param to be execution state

Tankilevitch · 2023-01-18T16:16:13Z

mlrun/runtimes/base.py

+            # TODO: add a 'workers' section in run objects state, each worker will update its state while
+            #  the run state will be resolved by the server.
+            if kind != "mpijob":
+                logger.debug(
+                    "Updating run state to completed", kind=kind, last_state=last_state
+                )
+                updates = {
+                    "status.last_update": now_date().isoformat(),
+                    "status.state": "completed",
+                }


but you check that your kind isn't mpijob so I am not sure I follow why you can't resolve that from the kind

…lete_run_status_fix

alonmr added 23 commits January 9, 2023 15:45

[MPIJob] Resolve run state from Launcher instead of Worker

f8216fb

lint

c41c859

mock update run

5f6a25e

unit test start

ab2b66a

Merge branch 'development' of github.com:mlrun/mlrun into mpijob_comp…

4a0cfed

…lete_run_status_fix

tests and fixes

8368391

Merge branch 'development' of github.com:mlrun/mlrun into mpijob_comp…

066dca8

…lete_run_status_fix

fix

a8f0cf8

copyright

2ec77bc

more tests

d16fbc9

Merge branch 'development' of github.com:mlrun/mlrun into mpijob_comp…

0746d5b

…lete_run_status_fix

Merge branch 'development' of github.com:mlrun/mlrun into mpijob_comp…

f2839fa

…lete_run_status_fix

.

2d7ab42

fix hyper run

72e420f

fix

915548c

initial sys test

e1ff331

initial sys test

7e0e02d

.

595f4a0

.

165ae0f

systest

c32f079

systest

1ce6f19

.

6ceb703

copyright

88ce8be

quaark requested changes Jan 17, 2023

View reviewed changes

mlrun/execution.py Outdated Show resolved Hide resolved

mlrun/execution.py Outdated Show resolved Hide resolved

tests/api/runtimes/test_mpijob.py Outdated Show resolved Hide resolved

Adam cr

0d502cb

Tankilevitch suggested changes Jan 17, 2023

View reviewed changes

alonmr changed the title ~~[MPIJob] Resolve run state from Launcher instead of Worker~~ [Runtimes] Don't resolve completion of runs per executions Jan 17, 2023

alonmr changed the title ~~[Runtimes] Don't resolve completion of runs per executions~~ [Runtimes] Don't resolve completion of runs per execution Jan 17, 2023

alonmr changed the title ~~[Runtimes] Don't resolve completion of runs per execution~~ [Runtimes] Don't resolve completion of runs per execution Jan 17, 2023

partial tom review

29f3a8d

theSaarco approved these changes Jan 18, 2023

View reviewed changes

alonmr added 4 commits January 18, 2023 16:34

minor

9963a94

tom comment

485ab8a

Merge branch 'development' of github.com:mlrun/mlrun into mpijob_comp…

ff7ca07

…lete_run_status_fix

comment

4c6db77

Tankilevitch suggested changes Jan 18, 2023

View reviewed changes

alonmr added 8 commits January 19, 2023 09:57

tom cr

bfeb8a7

get runtime handler

5576a22

Merge branch 'development' of github.com:mlrun/mlrun into mpijob_comp…

0bd1cd4

…lete_run_status_fix

why tun state is created

630ac35

log self kind

4456aa3

log self kind

665140c

fixes

b270e71

comments

f3312a2

Tankilevitch merged commit 1229312 into mlrun:development Jan 21, 2023

Tankilevitch mentioned this pull request Mar 8, 2023

[Docs] Add v1.3.0 change log #3194

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Runtimes] Don't resolve completion of runs per `execution` #2888

[Runtimes] Don't resolve completion of runs per `execution` #2888

alonmr commented Jan 9, 2023 •

edited

quaark left a comment

Tankilevitch left a comment

Tankilevitch Jan 17, 2023

alonmr Jan 18, 2023

Tankilevitch Jan 18, 2023

theSaarco left a comment

theSaarco Jan 18, 2023

alonmr Jan 18, 2023

alonmr Jan 18, 2023

Tankilevitch left a comment

Tankilevitch Jan 18, 2023

Tankilevitch Jan 18, 2023

[Runtimes] Don't resolve completion of runs per execution #2888

[Runtimes] Don't resolve completion of runs per execution #2888

Conversation

alonmr commented Jan 9, 2023 • edited

quaark left a comment

Choose a reason for hiding this comment

Tankilevitch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theSaarco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tankilevitch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Runtimes] Don't resolve completion of runs per `execution` #2888

[Runtimes] Don't resolve completion of runs per `execution` #2888

alonmr commented Jan 9, 2023 •

edited