Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disjoint DAGs in MLP #6743

Merged
merged 7 commits into from
Sep 12, 2022
Merged

Support disjoint DAGs in MLP #6743

merged 7 commits into from
Sep 12, 2022

Conversation

liangz1
Copy link
Collaborator

@liangz1 liangz1 commented Sep 9, 2022

Related Issues/PRs

What changes are proposed in this pull request?

A pipeline can have multiple disjoint DAGs. Each DAG is composed of one or more steps.

  1. Only reload self._steps from config files at the beginning of __init__(), run(), and inspect(), and assume they contain up-to-date info elsewhere. Previously we also reload in self._get_step().
  2. Add _get_subgraph_for_target_step(target_step) to _BasePipeline. Pass the subgraph DAG to run_pipeline_step() instead of the full self._steps.
  3. Add _get_default_step() to _BasePipeline, which defines the default step to run when no step is specified. Previously we run self._steps[-1].
  4. Implemented above methods in mlflow/pipelines/regression/v1/pipeline.py.

How is this patch tested?

Existing tests
AND

  • I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Click the Details link on the Preview docs check.
  2. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Liang Zhang <liang.zhang@databricks.com>
@liangz1 liangz1 added rn/none List under Small Changes in Changelogs. area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates labels Sep 9, 2022
Signed-off-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Liang Zhang <liang.zhang@databricks.com>
_STEP_NAMES = ["ingest", "split", "train", "transform", "evaluate"]
# _STEP_NAMES must contain all step names that are expected to be executed when
# `pipeline.run(step=None)` is called
_STEP_NAMES = ["ingest", "split", "train", "transform", "evaluate", "register"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing test already covers testing self._get_default_step(): test_pipelines_execution_directory_is_managed_as_expected() calls p.run() and asserts outputs are found for all steps in _STEP_NAMES.

@liangz1
Copy link
Collaborator Author

liangz1 commented Sep 9, 2022

I don't understand why ci/circleci: build_doc is failing...
<unknown>:1: WARNING: py:class reference target not found: mlflow.pipelines.steps.predict.PredictStep
However, I don't see any reference to this class, and this PR should not be changing docs.

Copy link
Collaborator

@jinzhang21 jinzhang21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments! Thanks, @liangz1 !

_SUBGRAPH_INDICES_MAP = {
_TRAIN_DAG_NAME: (
_PIPELINE_STEPS.index(_TRAIN_DAG_STEPS[0]),
_PIPELINE_STEPS.index(_TRAIN_DAG_STEPS[-1]) + 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to not "+1" here to respect what index returns for the last step for each DAG, then on L222 return return self._steps[s:e+1] to include the last step for each graph.

@@ -140,7 +143,7 @@ def clean(self, step: str = None) -> None:

def _get_step(self, step_name) -> BaseStep:
"""Returns a step class object from the pipeline."""
steps = self._steps or self._resolve_pipeline_steps()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to remove self._resolve_pipeline_steps()? Does it cause any harm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is related to the first item in my PR description) self._resolve_pipeline_steps() will return new BaseStep instances. This line means self._steps may contain stale information so _get_step() also tries to get the step info from the config file. This PR changes the logic to make self._steps always up-to-date (by reloading before each pipeline action) so we don't reload elsewhere (here).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Leaving the code as is should not cause any harm either, since self._steps is always populated at this point. But it could be a stopgap in the future if one misses to update the steps.

Comment on lines 198 to 199
for step_class in _SCORING_DAG_STEPS:
_STEPS_SUBGRAPH_MAP[step_class] = _SCORING_DAG_NAME
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangz1 step_class here is referencing PredictStep (the last element in _SCORING_DAG_STEPS), which caused <unknown>:1: WARNING: py:class reference target not found: mlflow.pipelines.steps.predict.PredictStep

Suggested change
for step_class in _SCORING_DAG_STEPS:
_STEPS_SUBGRAPH_MAP[step_class] = _SCORING_DAG_NAME
for _step_class in _TRAIN_DAG_STEPS:

should fix the error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explaining! This looks tricky to me :D

Signed-off-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Liang Zhang <liang.zhang@databricks.com>
@liangz1 liangz1 merged commit b2f1569 into mlflow:master Sep 12, 2022
nnethery pushed a commit to nnethery/mlflow that referenced this pull request Feb 1, 2024
A pipeline can have multiple disjoint DAGs. Each DAG is composed of one or more steps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/recipes MLflow Recipes, Recipes APIs, Recipes configs, Recipe Templates rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants