Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtimes][Dask] Ensure that MLRUN_DBPATH is available for workers #4172

Merged
merged 3 commits into from
Aug 30, 2023

Conversation

laurybueno
Copy link
Contributor

ML-3711

The traditional process for populating Job pods with MLRun related environment variables doesn't work for Dask pods, since the their scheduler is itself separated from the MLRun API pod and, because of that, it's not aware of such variables.

This PR ensures that the MLRun API pod add the relevant variables to the pod templates used to create both scheduler and worker pods on the Dask runtime.

@laurybueno laurybueno marked this pull request as ready for review August 29, 2023 18:22
@@ -457,7 +457,7 @@ def _run(self, runobj: RunObject, execution):
handler = runobj.spec.handler
self._force_handler(handler)

extra_env = self._generate_runtime_env(runobj)
extra_env = self.generate_runtime_env(runobj)
environ.update(extra_env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related, but this looks like a big hazard. is overrides the environ with runtime-specific envvars?
if code runs (and it might is) on server side, it is exposed to race condition of overridden envvars. im not sure where it is being used afterwards, but lets add a TODO here to make sure we understand it does not run on server-side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I considered removing this code, because it seems to me that its original intention was to mirror the responsibility split employed on the creation of functions with the type Job. In such case, it falls to _run the role of acquiring the runtime variables (see here and here).

During my testing, however, while _run executes in the MLRun API Pod for functions of type Job, it does not on functions of the type Dask. In this case, _run seem to only run in the Dask Scheduler Pod, so it's already too late to get the traditional runtime variables (and also too late to incorrectly overwrite the environment on MLRun API Pods). That's why MLRUN_DBPATH wasn't available originally for Dask functions.

I added a comment with a TODO here so we can reconsider this code later. Should I create a Jira task as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 TIL
Yes please!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, this discussion will be followed up on ML-4515.

Comment on lines 310 to 312
env.extend(
[{"name": k, "value": v} for k, v in function.generate_runtime_env().items()]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code dup with mlrun/runtimes/mpijob/v1alpha1.py

extra_env = self.generate_runtime_env(runobj)
extra_env = [{"name": k, "value": v} for k, v in extra_env.items()]

Perhaps you can have expose "generate_runtime_k8s_env" which uses _ generate_runtime_env. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. I did what you suggested. See if it's looks better now.

@@ -374,15 +374,16 @@ def _get_db_run(self, task: RunObject = None):
if task:
return task.to_dict()

def _generate_runtime_env(self, runobj: RunObject):
def generate_runtime_env(self, runobj: RunObject = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some docs and move function above as we (mostly try) keep public functions up, private functions down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I created the public function you suggested on another comment and moved it up.

Copy link
Member

@quaark quaark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -457,7 +457,7 @@ def _run(self, runobj: RunObject, execution):
handler = runobj.spec.handler
self._force_handler(handler)

extra_env = self._generate_runtime_env(runobj)
extra_env = self.generate_runtime_env(runobj)
environ.update(extra_env)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 TIL
Yes please!

@liranbg liranbg merged commit 163260a into mlrun:development Aug 30, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants