-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Runtimes][Dask] Ensure that MLRUN_DBPATH is available for workers #4172
[Runtimes][Dask] Ensure that MLRUN_DBPATH is available for workers #4172
Conversation
@@ -457,7 +457,7 @@ def _run(self, runobj: RunObject, execution): | |||
handler = runobj.spec.handler | |||
self._force_handler(handler) | |||
|
|||
extra_env = self._generate_runtime_env(runobj) | |||
extra_env = self.generate_runtime_env(runobj) | |||
environ.update(extra_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related, but this looks like a big hazard. is overrides the environ with runtime-specific envvars?
if code runs (and it might is) on server side, it is exposed to race condition of overridden envvars. im not sure where it is being used afterwards, but lets add a TODO here to make sure we understand it does not run on server-side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I considered removing this code, because it seems to me that its original intention was to mirror the responsibility split employed on the creation of functions with the type Job. In such case, it falls to _run
the role of acquiring the runtime variables (see here and here).
During my testing, however, while _run
executes in the MLRun API Pod for functions of type Job, it does not on functions of the type Dask. In this case, _run
seem to only run in the Dask Scheduler Pod, so it's already too late to get the traditional runtime variables (and also too late to incorrectly overwrite the environment on MLRun API Pods). That's why MLRUN_DBPATH
wasn't available originally for Dask functions.
I added a comment with a TODO here so we can reconsider this code later. Should I create a Jira task as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 TIL
Yes please!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, this discussion will be followed up on ML-4515.
env.extend( | ||
[{"name": k, "value": v} for k, v in function.generate_runtime_env().items()] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code dup with mlrun/runtimes/mpijob/v1alpha1.py
extra_env = self.generate_runtime_env(runobj)
extra_env = [{"name": k, "value": v} for k, v in extra_env.items()]
Perhaps you can have expose "generate_runtime_k8s_env" which uses _ generate_runtime_env
. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking. I did what you suggested. See if it's looks better now.
mlrun/runtimes/base.py
Outdated
@@ -374,15 +374,16 @@ def _get_db_run(self, task: RunObject = None): | |||
if task: | |||
return task.to_dict() | |||
|
|||
def _generate_runtime_env(self, runobj: RunObject): | |||
def generate_runtime_env(self, runobj: RunObject = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some docs and move function above as we (mostly try) keep public functions up, private functions down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I created the public function you suggested on another comment and moved it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@@ -457,7 +457,7 @@ def _run(self, runobj: RunObject, execution): | |||
handler = runobj.spec.handler | |||
self._force_handler(handler) | |||
|
|||
extra_env = self._generate_runtime_env(runobj) | |||
extra_env = self.generate_runtime_env(runobj) | |||
environ.update(extra_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 TIL
Yes please!
ML-3711
The traditional process for populating Job pods with MLRun related environment variables doesn't work for Dask pods, since the their scheduler is itself separated from the MLRun API pod and, because of that, it's not aware of such variables.
This PR ensures that the MLRun API pod add the relevant variables to the pod templates used to create both scheduler and worker pods on the Dask runtime.