Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs for artifact proxy #5320

Merged
merged 4 commits into from
Jan 28, 2022
Merged

Conversation

BenWilson2
Copy link
Member

Signed-off-by: Ben Wilson benjamin.wilson@databricks.com

What changes are proposed in this pull request?

Adding documentation for proxy artifact handling mode for the MLflow server.

How is this patch tested?

sphinx doc build in local

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
    next step, otherwise fix it.
  2. Click Details on the right to open the job page of CircleCI.
  3. Click the Artifacts tab.
  4. Click docs/build/html/index.html.
  5. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Documentation added for using the Mlflow server in proxy artifact mode and in exclusive artifact handling mode.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 added the area/docs Documentation issues label Jan 26, 2022
@github-actions github-actions bot added the rn/documentation Mention under Documentation Changes in Changelogs. label Jan 26, 2022
@@ -83,6 +83,9 @@ and a variety of remote file storage solutions. For storing runs and artifacts,
MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc), the artifact store persists artifacts
(files, models, images, in-memory objects, or model summary, etc).

The MLflow client can be configured with an HTTP proxy, passing artifact requests through the tracking server to store and retrieve artifacts without having to specify the fully qualified path to the artifacts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an MLflow server configuration, not a client configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes yeah. fixed.

Comment on lines 186 to 187
model artifacts, images, documents, and files. This eliminates the need to allow end users to have direct path access to a remote file store for artifact handling and eliminates the
need for an end-user to provide access credentials to interact with an underlying file store.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "File store" is a bit of an overloaded term in MLflow, generally referring to the FileStore implementation of the backend metadata store API. Can we use "object store" instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed and added references to object stores (just in case)

--host 0.0.0.0 \
--port 8889 \
--serve-artifacts \
--default-artifact-root s3://my-mlflow-bucket/ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? I think this should be --artifacts-destination.

mlflow server \
--host 0.0.0.0 \
--port 8885 \
--default-artifact-root hdfs://myhost:8887/mlprojects/models \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--default-artifact-root hdfs://myhost:8887/mlprojects/models \
--artifacts-destination hdfs://myhost:8887/mlprojects/models \

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated both of these references

Comment on lines 1052 to 1053
Using this tracking server for any tasks aside from the aforementioned ``mlflow.<flavor>.log_model()``,
``mlflow.<flavor>.load_model()`` and ``client.list_artifacts(<run_id>)`` will throw an ``MLflowException``.
Copy link
Collaborator

@dbczumar dbczumar Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other methods used for logging artifacts, e.g. mlflow.log_artifact(). Perhaps we should update the phrasing to indicate that this isn't an exhaustive list of artifact-based APIs. These examples are also Python-specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased the entire paragraph.

@@ -750,6 +833,18 @@ location to server's artifact store. This will be used as artifact location for
experiments that do not specify one. Once you create an experiment, ``--default-artifact-root``
is no longer relevant to that experiment.

Starting a server with the ``--serve-artifacts`` option defined will enable proxy access for artifacts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Instead of using future-tense "will" throughout the docs here, can we use present-tense phrasing (not an English expert, these are probably inaccurate tense names):

Suggested change
Starting a server with the ``--serve-artifacts`` option defined will enable proxy access for artifacts.
Starting a server with the ``--serve-artifacts`` option defined enables proxy access for artifacts.

One more small phrasing suggestion:

Suggested change
Starting a server with the ``--serve-artifacts`` option defined will enable proxy access for artifacts.
Starting a server with the ``--serve-artifacts`` option defined enables proxied access for artifacts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great call-out. Modified it here and a few other places.

Comment on lines 837 to 838
The uri ``mlflow-artifacts:/`` will serve in place of the ``--default-artifact-root`` path configured during
server start to refer to artifact storage locations.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The uri ``mlflow-artifacts:/`` will serve in place of the ``--default-artifact-root`` path configured during
server start to refer to artifact storage locations.
The URI ``mlflow-artifacts:/`` is the default value for ``--default-artifact-root`` when `--serve-artifacts` is specified. This indicates to clients that artifacts can be accessed via HTTP requests to the MLflow Tracking Server.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also enumerate and explain the use cases for some other proxied artifact --default-artifact-root values? E.g. mlflow-artifacts://<tracking_server_host_name>/sub/path or https://<host_name>/sub/path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few listed examples - please check to make sure I'm capturing what you're looking for here or if I completely missed the plot.

Comment on lines 841 to 843
When operating an MLflow tracking server with ``--serve-artifacts`` option enabled, the parameter
``--default-artifact-root`` will be set to ``mlflow-artifacts:/`` as a proxy root to the file store location
for artifact storage. Artifact handling can be accomplished through the use of this proxy uri. Access credentials
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These first two sentences are redundant given the information above, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely. removed them.

@@ -934,6 +1029,45 @@ You can then pass authentication headers to MLflow using these :ref:`environment
Additionally, you should ensure that the ``--backend-store-uri`` (which defaults to the
``./mlruns`` directory) points to a persistent (non-ephemeral) disk or database connection.

.. _artifact_only_mode:

Using the Tracking Server exclusively for proxy artifact access
Copy link
Collaborator

@dbczumar dbczumar Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use "proxied artifact access", rather than "proxy artifact access", throughout the docs?

Suggested change
Using the Tracking Server exclusively for proxy artifact access
Using the Tracking Server exclusively for proxied artifact access

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed 17 references.

Comment on lines 205 to 206
* Logging events for artifacts are made to the ``mlflow-artifacts:/`` uri for saving
* The Tracking server, serving as proxy, interfaces with the configured backend store, using its configured authorization to interact with the file store for writing artifacts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, the client uses HttpArtifactRepository (

class HttpArtifactRepository(ArtifactRepository):
) to upload and download files to / from the remote host where the MLflow Tracking server is running.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the wording here.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
…21-mlflow-artifacts-docs

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenWilson2 Left a few more comments. LGTM once they're addressed - feel free to merge after that. Thanks so much!

@@ -83,6 +83,9 @@ and a variety of remote file storage solutions. For storing runs and artifacts,
MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc), the artifact store persists artifacts
(files, models, images, in-memory objects, or model summary, etc).

The MLflow server can be configured with an HTTP proxy, passing artifact requests through the tracking server to store and retrieve artifacts without having to specify the fully qualified path to the artifacts.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The MLflow server can be configured with an HTTP proxy, passing artifact requests through the tracking server to store and retrieve artifacts without having to specify the fully qualified path to the artifacts.
The MLflow server can be configured with an artifacts HTTP proxy, passing artifact requests through the tracking server to store and retrieve artifacts without having to interact with underlying object storage services.

Comment on lines 216 to 217
Administrators who are enabling this feature should ensure that the access level granted to the Tracking Server for artifact operations meets with
any security requirements prior to enabling the Tracking Server to operate in a proxied file handling role.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Administrators who are enabling this feature should ensure that the access level granted to the Tracking Server for artifact operations meets with
any security requirements prior to enabling the Tracking Server to operate in a proxied file handling role.
Administrators who are enabling this feature should ensure that the access level granted to the Tracking Server for artifact operations meets all
security requirements prior to enabling the Tracking Server to operate in a proxied file handling role.

Comment on lines 224 to 226
MLflow's Tracking Server can be used in an exclusive artifact proxied access file handling role. The configuration at server start by adding the flag
``--artifacts-only`` will restrict a Tracking Server instance to be used as a bastion host to an underlying file store without allowing for Tracking Server
functionality apart from artifact handling. See: :ref:`artifact_only_mode` for more details.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bastion host" is somewhat bespoke terminology. Can we simplify the phrasing here?

Suggested change
MLflow's Tracking Server can be used in an exclusive artifact proxied access file handling role. The configuration at server start by adding the flag
``--artifacts-only`` will restrict a Tracking Server instance to be used as a bastion host to an underlying file store without allowing for Tracking Server
functionality apart from artifact handling. See: :ref:`artifact_only_mode` for more details.
MLflow's Tracking Server can be used in an exclusive proxied artifact handling role. Specifying the
``--artifacts-only`` flag restricts an MLflow server instance to only serve artifact-related API requests by
proxying to an underlying object store.


.. figure:: _static/images/scenario_6.png

Enabling the Tracking Server in ``--artifacts-only`` mode when enabling the instance to proxied artifact operations via ``--serve-artifacts`` disables other Tracking API functionality:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already stated in the note immediately above & can be removed.

@@ -685,6 +755,20 @@ You run an MLflow tracking server using ``mlflow server``. An example configura
--default-artifact-root s3://my-mlflow-bucket/ \
--host 0.0.0.0

An MLflow tracking server can also be run as a proxied artifact handler. An example configuration for the ``mlflow server`` in this mode is:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An MLflow tracking server can also be run as a proxied artifact handler. An example configuration for the ``mlflow server`` in this mode is:
An MLflow Tracking server can also be run as a proxied artifact handler. An example configuration for the ``mlflow server`` in this mode is:

* ``http://<host>/mlartifacts``
* ``mlflow-artifacts://<host>/mlartifacts``
* ``mlflow-artifacts://<host>:<port>/mlartifacts``
* ``mlflow-artifacts:/mlartifacts``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify that, when the host is absent from the mlflow-artifacts URI, the client assumes that the host is the same as the host component of the tracking URI?

Using the Tracking Server exclusively for proxied artifact access
-----------------------------------------------------------------

To use an instance of the MLflow tracking server *exclusively* for artifact operations ( :ref:`scenario_6` ),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To use an instance of the MLflow tracking server *exclusively* for artifact operations ( :ref:`scenario_6` ),
To use an instance of the MLflow Tracking server *exclusively* for artifact operations ( :ref:`scenario_6` ),


To use an instance of the MLflow tracking server *exclusively* for artifact operations ( :ref:`scenario_6` ),
start a server with the optional parameters ``--serve-artifacts`` to enable proxied artifact access and ``--artifacts-only``
for disabling all other functionality of the tracking server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for disabling all other functionality of the tracking server.
for disabling all other functionality of the Tracking server.

--serve-artifacts \
--artifacts-only

Using a tracking server configured in ``--artifacts-only`` mode for any tasks aside from those concerned with artifact
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Using a tracking server configured in ``--artifacts-only`` mode for any tasks aside from those concerned with artifact
Using an MLflow server configured in ``--artifacts-only`` mode for any tasks aside from those concerned with artifact

Comment on lines 1078 to 1080
Using this mode to control access to artifacts without exposing the server endpoint's ability to create experiments,
manage runs, or perform any action apart from artifact handling can be useful in some scenarios (continuous deployment
by an external team, for instance).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest motivator is decoupling the compute / infrastructure used for serving these different types of requests, since each type of request (artifacts vs metadata) has different performance requirements and usage characteristics.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 merged commit a1faae7 into master Jan 28, 2022
@BenWilson2 BenWilson2 deleted the ML-18421-mlflow-artifacts-docs branch January 28, 2022 20:10
BenWilson2 added a commit that referenced this pull request Feb 14, 2022
* Add docs for artifact proxy

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

* PR fixes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

* Fixes

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs Documentation issues rn/documentation Mention under Documentation Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants