Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtimes] Don't resolve completion of runs per execution #2888

Merged
merged 37 commits into from Jan 21, 2023

Conversation

alonmr
Copy link
Member

@alonmr alonmr commented Jan 9, 2023

https://jira.iguazeng.com/browse/ML-3119
With mpijobs we update the run state by monitoring the run (mlrun/api/main.py:_monitor_runs)
If the client is watching the logs, we may get a faster response from the API since the endpoint
checks the Launcher status if the run object is not in a terminal state.
This change affects all runtimes since the execution no longer decides on run completion and only the API will.
Next phase:
Add a workers section in the run object status that will track the state of each worker separately.

Copy link
Member

@quaark quaark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great mate! Just a couple really minor things

mlrun/execution.py Outdated Show resolved Hide resolved
mlrun/execution.py Outdated Show resolved Hide resolved
tests/api/runtimes/test_mpijob.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Tankilevitch Tankilevitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this is very sensitive change, I feel like some of the comments you added are not reflecting exactly what you implemented. such as stating the only the API will update the state while you have update_run_state on the client side as well as runs which can be executed locally.

Please add tests that checks the update state of a local run as well as try to be as much explainable in your comments to avoid future misunderstandings.

mlrun/execution.py Outdated Show resolved Hide resolved
mlrun/execution.py Show resolved Hide resolved
mlrun/execution.py Outdated Show resolved Hide resolved
mlrun/execution.py Outdated Show resolved Hide resolved
mlrun/execution.py Outdated Show resolved Hide resolved
mlrun/execution.py Outdated Show resolved Hide resolved
mlrun/runtimes/base.py Outdated Show resolved Hide resolved
Comment on lines 961 to 970
# TODO: add a 'workers' section in run objects state, each worker will update its state while
# the run state will be resolved by the server.
if kind != "mpijob":
logger.debug(
"Updating run state to completed", kind=kind, last_state=last_state
)
updates = {
"status.last_update": now_date().isoformat(),
"status.state": "completed",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a fan of the base knowing about a special case of a class that inherits from it. I would suggest using some helper function for that ( where most of the classes will have the base implementation and the mpi a different one )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you but we can't do this
keep in mind when you are running on the mpi worker you don't know your kind is mpi (your actual kind is local)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you check that your kind isn't mpijob so I am not sure I follow why you can't resolve that from the kind

tests/system/runtimes/test_mpijob.py Show resolved Hide resolved
tests/system/runtimes/test_mpijob.py Outdated Show resolved Hide resolved
@alonmr alonmr changed the title [MPIJob] Resolve run state from Launcher instead of Worker [Runtimes] Don't resolve completion of runs per executions Jan 17, 2023
@alonmr alonmr changed the title [Runtimes] Don't resolve completion of runs per executions [Runtimes] Don't resolve completion of runs per execution Jan 17, 2023
@alonmr alonmr changed the title [Runtimes] Don't resolve completion of runs per execution [Runtimes] Don't resolve completion of runs per execution Jan 17, 2023
Copy link
Member

@theSaarco theSaarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - I have small comments/questions, but in general the change seems solid.

mlrun/runtimes/base.py Outdated Show resolved Hide resolved

# TODO: add a 'workers' section in run objects state, each worker will update its state while
# the run state will be resolved by the server.
if kind != "mpijob":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question on this - is the situation we saw in mpi only relevant to mpi, or is there some other runtime that may see a similar issue? I assume spark is off the hook since we use the CRD status for its status, what about a job with iterations? Or dask?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dask - mlrun is not running on the workers so we're fine
Spark - I don't know if we run mlrun on the executer / driver it may be a problem
Iterations - each iteration has its own RunObject so we're good
there is also the task_genrator / hyper run which I'm not familiar with and should be considerd

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We covered spark offline - only 1 execution in the driver so we're good
with respect to hyper run there is an execution to each RunObject (all the child results are collected in RunList)

Copy link
Contributor

@Tankilevitch Tankilevitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good minor comments

mlrun/execution.py Outdated Show resolved Hide resolved
mlrun/execution.py Show resolved Hide resolved
Comment on lines 879 to 884
"""
Modify and store the run state or mark an error
This method allows to set the run state to completed in the DB which is discouraged.
Completion of runs should be decided in the server.

:param state: set run state
:param state: set execution state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing, you use run_state and execution_state, lets be aligned if we changed the state param to be execution state

Comment on lines 961 to 970
# TODO: add a 'workers' section in run objects state, each worker will update its state while
# the run state will be resolved by the server.
if kind != "mpijob":
logger.debug(
"Updating run state to completed", kind=kind, last_state=last_state
)
updates = {
"status.last_update": now_date().isoformat(),
"status.state": "completed",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you check that your kind isn't mpijob so I am not sure I follow why you can't resolve that from the kind

@Tankilevitch Tankilevitch merged commit 1229312 into mlrun:development Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants