New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Runtimes] Don't resolve completion of runs per execution
#2888
[Runtimes] Don't resolve completion of runs per execution
#2888
Conversation
…lete_run_status_fix
…lete_run_status_fix
…lete_run_status_fix
…lete_run_status_fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great mate! Just a couple really minor things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this is very sensitive change, I feel like some of the comments you added are not reflecting exactly what you implemented. such as stating the only the API will update the state while you have update_run_state
on the client side as well as runs which can be executed locally.
Please add tests that checks the update state of a local run as well as try to be as much explainable in your comments to avoid future misunderstandings.
mlrun/runtimes/base.py
Outdated
# TODO: add a 'workers' section in run objects state, each worker will update its state while | ||
# the run state will be resolved by the server. | ||
if kind != "mpijob": | ||
logger.debug( | ||
"Updating run state to completed", kind=kind, last_state=last_state | ||
) | ||
updates = { | ||
"status.last_update": now_date().isoformat(), | ||
"status.state": "completed", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a fan of the base
knowing about a special case of a class that inherits from it. I would suggest using some helper function for that ( where most of the classes will have the base implementation and the mpi a different one )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you but we can't do this
keep in mind when you are running on the mpi worker you don't know your kind is mpi (your actual kind is local)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you check that your kind isn't mpijob
so I am not sure I follow why you can't resolve that from the kind
Launcher
instead of Worker
execution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - I have small comments/questions, but in general the change seems solid.
mlrun/runtimes/base.py
Outdated
|
||
# TODO: add a 'workers' section in run objects state, each worker will update its state while | ||
# the run state will be resolved by the server. | ||
if kind != "mpijob": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question on this - is the situation we saw in mpi only relevant to mpi, or is there some other runtime that may see a similar issue? I assume spark is off the hook since we use the CRD status for its status, what about a job with iterations? Or dask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dask - mlrun is not running on the workers so we're fine
Spark - I don't know if we run mlrun on the executer / driver it may be a problem
Iterations - each iteration has its own RunObject so we're good
there is also the task_genrator / hyper run which I'm not familiar with and should be considerd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We covered spark offline - only 1 execution in the driver so we're good
with respect to hyper run there is an execution to each RunObject (all the child results are collected in RunList)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good minor comments
mlrun/execution.py
Outdated
""" | ||
Modify and store the run state or mark an error | ||
This method allows to set the run state to completed in the DB which is discouraged. | ||
Completion of runs should be decided in the server. | ||
|
||
:param state: set run state | ||
:param state: set execution state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing, you use run_state and execution_state, lets be aligned if we changed the state
param to be execution state
mlrun/runtimes/base.py
Outdated
# TODO: add a 'workers' section in run objects state, each worker will update its state while | ||
# the run state will be resolved by the server. | ||
if kind != "mpijob": | ||
logger.debug( | ||
"Updating run state to completed", kind=kind, last_state=last_state | ||
) | ||
updates = { | ||
"status.last_update": now_date().isoformat(), | ||
"status.state": "completed", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you check that your kind isn't mpijob
so I am not sure I follow why you can't resolve that from the kind
https://jira.iguazeng.com/browse/ML-3119
With
mpijobs
we update the run state by monitoring the run (mlrun/api/main.py:_monitor_runs)If the client is watching the logs, we may get a faster response from the API since the endpoint
checks the Launcher status if the run object is not in a terminal state.
This change affects all runtimes since the
execution
no longer decides on run completion and only the API will.Next phase:
Add a
workers
section in the run object status that will track the state of each worker separately.