Support disjoint DAGs in MLP #6743

liangz1 · 2022-09-09T09:13:36Z

Related Issues/PRs

What changes are proposed in this pull request?

A pipeline can have multiple disjoint DAGs. Each DAG is composed of one or more steps.

Only reload self._steps from config files at the beginning of __init__(), run(), and inspect(), and assume they contain up-to-date info elsewhere. Previously we also reload in self._get_step().
Add _get_subgraph_for_target_step(target_step) to _BasePipeline. Pass the subgraph DAG to run_pipeline_step() instead of the full self._steps.
Add _get_default_step() to _BasePipeline, which defines the default step to run when no step is specified. Previously we run self._steps[-1].
Implemented above methods in mlflow/pipelines/regression/v1/pipeline.py.

How is this patch tested?

Existing tests
AND

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Click the Details link on the Preview docs check.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

liangz1 · 2022-09-09T23:50:15Z

tests/pipelines/test_pipeline.py

-_STEP_NAMES = ["ingest", "split", "train", "transform", "evaluate"]
+# _STEP_NAMES must contain all step names that are expected to be executed when
+# `pipeline.run(step=None)` is called
+_STEP_NAMES = ["ingest", "split", "train", "transform", "evaluate", "register"]


Existing test already covers testing self._get_default_step(): test_pipelines_execution_directory_is_managed_as_expected() calls p.run() and asserts outputs are found for all steps in _STEP_NAMES.

liangz1 · 2022-09-09T23:58:35Z

I don't understand why ci/circleci: build_doc is failing...
<unknown>:1: WARNING: py:class reference target not found: mlflow.pipelines.steps.predict.PredictStep
However, I don't see any reference to this class, and this PR should not be changing docs.

jinzhang21

LGTM with minor comments! Thanks, @liangz1 !

jinzhang21 · 2022-09-12T14:42:24Z

mlflow/pipelines/regression/v1/pipeline.py

+    _SUBGRAPH_INDICES_MAP = {
+        _TRAIN_DAG_NAME: (
+            _PIPELINE_STEPS.index(_TRAIN_DAG_STEPS[0]),
+            _PIPELINE_STEPS.index(_TRAIN_DAG_STEPS[-1]) + 1,


I'd prefer to not "+1" here to respect what index returns for the last step for each DAG, then on L222 return return self._steps[s:e+1] to include the last step for each graph.

jinzhang21 · 2022-09-12T14:44:20Z

mlflow/pipelines/pipeline.py

@@ -140,7 +143,7 @@ def clean(self, step: str = None) -> None:

    def _get_step(self, step_name) -> BaseStep:
        """Returns a step class object from the pipeline."""
-        steps = self._steps or self._resolve_pipeline_steps()


What's the reason to remove self._resolve_pipeline_steps()? Does it cause any harm?

(This is related to the first item in my PR description) self._resolve_pipeline_steps() will return new BaseStep instances. This line means self._steps may contain stale information so _get_step() also tries to get the step info from the config file. This PR changes the logic to make self._steps always up-to-date (by reloading before each pipeline action) so we don't reload elsewhere (here).

Got it. Leaving the code as is should not cause any harm either, since self._steps is always populated at this point. But it could be a stopgap in the future if one misses to update the steps.

harupy · 2022-09-12T15:11:46Z

mlflow/pipelines/regression/v1/pipeline.py

+    for step_class in _SCORING_DAG_STEPS:
+        _STEPS_SUBGRAPH_MAP[step_class] = _SCORING_DAG_NAME


@liangz1 step_class here is referencing PredictStep (the last element in _SCORING_DAG_STEPS), which caused <unknown>:1: WARNING: py:class reference target not found: mlflow.pipelines.steps.predict.PredictStep

Suggested change

for step_class in _SCORING_DAG_STEPS:

_STEPS_SUBGRAPH_MAP[step_class] = _SCORING_DAG_NAME

for _step_class in _TRAIN_DAG_STEPS:

should fix the error.

Thank you for explaining! This looks tricky to me :D

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

A pipeline can have multiple disjoint DAGs. Each DAG is composed of one or more steps.

liangz1 added 2 commits September 9, 2022 00:00

clarify when we should call self._resolve_pipeline_steps()

bdc1d2a

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

implement

350ad4c

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

liangz1 added rn/none List under Small Changes in Changelogs. area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates labels Sep 9, 2022

liangz1 requested review from dbczumar and jinzhang21 September 9, 2022 09:13

liangz1 added 3 commits September 9, 2022 11:09

fix index

8682c15

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

fix default step

2a46655

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

add tests

694222a

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

liangz1 commented Sep 9, 2022

View reviewed changes

jinzhang21 approved these changes Sep 12, 2022

View reviewed changes

harupy reviewed Sep 12, 2022

View reviewed changes

liangz1 added 2 commits September 12, 2022 08:49

fix doc issue

57ee361

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

address comment

6c78cae

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>

liangz1 merged commit b2f1569 into mlflow:master Sep 12, 2022

nnethery pushed a commit to nnethery/mlflow that referenced this pull request Feb 1, 2024

Support disjoint DAGs in MLP (mlflow#6743)

96fd37c

A pipeline can have multiple disjoint DAGs. Each DAG is composed of one or more steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support disjoint DAGs in MLP #6743

Support disjoint DAGs in MLP #6743

liangz1 commented Sep 9, 2022 •

edited

liangz1 Sep 9, 2022

liangz1 commented Sep 9, 2022 •

edited

jinzhang21 left a comment

jinzhang21 Sep 12, 2022

jinzhang21 Sep 12, 2022

liangz1 Sep 12, 2022

jinzhang21 Sep 12, 2022

harupy Sep 12, 2022

liangz1 Sep 12, 2022

		for step_class in _SCORING_DAG_STEPS:
		_STEPS_SUBGRAPH_MAP[step_class] = _SCORING_DAG_NAME

	for step_class in _SCORING_DAG_STEPS:
	_STEPS_SUBGRAPH_MAP[step_class] = _SCORING_DAG_NAME
	for _step_class in _TRAIN_DAG_STEPS:

Support disjoint DAGs in MLP #6743

Support disjoint DAGs in MLP #6743

Conversation

liangz1 commented Sep 9, 2022 • edited

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

liangz1 Sep 9, 2022

Choose a reason for hiding this comment

liangz1 commented Sep 9, 2022 • edited

jinzhang21 left a comment

Choose a reason for hiding this comment

jinzhang21 Sep 12, 2022

Choose a reason for hiding this comment

jinzhang21 Sep 12, 2022

Choose a reason for hiding this comment

liangz1 Sep 12, 2022

Choose a reason for hiding this comment

jinzhang21 Sep 12, 2022

Choose a reason for hiding this comment

harupy Sep 12, 2022

Choose a reason for hiding this comment

liangz1 Sep 12, 2022

Choose a reason for hiding this comment

liangz1 commented Sep 9, 2022 •

edited

liangz1 commented Sep 9, 2022 •

edited