[air][doc] Update docs to reflect head node syncing deprecation (ray-…

…project#37475) Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: harborn <gangsheng.wu@intel.com>
harborn · Aug 17, 2023 · a6016b6 · a6016b6
1 parent 63f80f7
commit a6016b6
Show file tree

Hide file tree

Showing 4 changed files with 116 additions and 96 deletions.
diff --git a/doc/source/tune/doc_code/faq.py b/doc/source/tune/doc_code/faq.py
@@ -363,13 +363,7 @@ def wait(self):
 
     tuner = tune.Tuner(
         train_fn,
-        run_config=air.RunConfig(
-            storage_path="/path/to/shared/storage",
-        ),
-        sync_config=tune.SyncConfig(
-            # Do not sync because we are on shared storage
-            syncer=None
-        ),
+        run_config=air.RunConfig(storage_path="/path/to/shared/storage"),
     )
     tuner.fit()
     # __sync_config_end__

diff --git a/doc/source/tune/tutorials/tune-output.rst b/doc/source/tune/tutorials/tune-output.rst
@@ -269,9 +269,11 @@ should be configured to log to the Trainable's *working directory.* By default,
 the current working directory of both functional and class trainables is set to the
 corresponding trial directory once it's been launched as a remote Ray actor.
 
-When running with multiple nodes using the :ref:`default syncing method <tune-default-syncing>`,
-trial artifacts are synchronized to the driver node under the specified path.
-This will allow you to visualize and analyze logs of all distributed training workers on a single machine.
+.. warning::
+
+    When running in a multi-node cluster using the *deprecated* :ref:`head node storage option <tune-default-syncing>`,
+    trial artifacts are synchronized to the driver node under the specified path.
+    This will allow you to visualize and analyze logs of all distributed training workers on a single machine.
 
 When :ref:`specifying a cloud upload directory <tune-cloud-checkpointing>`, trial artifacts are uploaded to that cloud bucket
 for later analysis. Note that the driver node does not necessarily contain

diff --git a/doc/source/tune/tutorials/tune-storage.rst b/doc/source/tune/tutorials/tune-storage.rst
@@ -1,7 +1,7 @@
 .. _tune-storage-options:
 
-How to Configure Storage Options for a Distributed Tune Experiment
-==================================================================
+How to Configure Persistent Storage in Ray Tune
+===============================================
 
 .. seealso::
 
@@ -26,17 +26,14 @@ Storage Options in Tune
 
 Tune provides support for three scenarios:
 
-1. When running Tune on a distributed cluster without any external persistent storage.
+1. When using cloud storage (e.g. AWS S3 or Google Cloud Storage) accessible by all machines in the cluster.
 2. When using a network filesystem (NFS) mounted to all machines in the cluster.
-3. When using cloud storage (e.g. AWS S3 or Google Cloud Storage) accessible by all machines in the cluster.
-
-Situation (1) is the default scenario if a network filesystem or cloud storage are not provided.
-In this scenario, we assume that we only have the local filesystems of each machine in the Ray cluster for storing experiment outputs.
+3. When running Tune on a single node and using the local filesystem as the persistent storage location.
+4. **(Deprecated)** When running Tune on multiple nodes and using the local filesystem of the head node as the persistent storage location.
 
 .. note::
 
-    Although we are considering distributed Tune experiments in this guide,
-    a network filesystem or cloud storage can also be configured for single-node
+    A network filesystem or cloud storage can be configured for single-node
     experiments. This can be useful to persist your experiment results in external storage
     if, for example, the instance you run your experiment on clears its local storage
     after termination.
@@ -46,21 +43,17 @@ In this scenario, we assume that we only have the local filesystems of each mach
     See :class:`~ray.tune.syncer.SyncConfig` for the full set of configuration options as well as more details.
 
 
-.. _tune-default-syncing:
+.. _tune-cloud-checkpointing:
 
-Configure Tune without external persistent storage
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Configuring Tune with cloud storage (AWS S3, Google Cloud Storage)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If you're using neither a shared filesystem nor cloud storage, Ray Tune will resort to the
-default mechanism of periodically synchronizing data saved on worker nodes to the head node.
-**This treats the head node's local filesystem as the main storage location of the distributed Tune experiment.**
+If all nodes in a Ray cluster have access to cloud storage, e.g. AWS S3 or Google Cloud Storage (GCS),
+then all experiment outputs can be saved in a shared cloud bucket.
 
-By default, workers will sync to the head node whenever a trial running on that workers
-has finished saving a checkpoint. This can be configured by ``sync_on_checkpoint`` and
-``sync_period`` in :class:`SyncConfig <ray.tune.syncer.SyncConfig>`:
+We can configure cloud storage by telling Ray Tune to **upload to a remote** ``storage_path``:
 
 .. code-block:: python
-    :emphasize-lines: 9, 10, 11, 12, 13, 14
 
     from ray import tune
     from ray.air.config import RunConfig
@@ -69,33 +62,28 @@ has finished saving a checkpoint. This can be configured by ``sync_on_checkpoint
         trainable,
         run_config=RunConfig(
             name="experiment_name",
-            storage_path="~/ray_results",
-            sync_config=tune.SyncConfig(
-                syncer="auto",
-                # Sync approximately every minute rather than on every checkpoint
-                sync_on_checkpoint=False,
-                sync_period=60,
-            )
+            storage_path="s3://bucket-name/sub-path/",
         )
     )
     tuner.fit()
 
-In the snippet above, we disabled forceful syncing on trial checkpoints and adjusted the sync period to 60 seconds.
-Setting the sync period to a lower value (in seconds) will sync from remote nodes more often.
-This will lead to more robust trial recovery, but it will also lead to more synchronization overhead.
+Ray AIR defaults to use pyarrow to perform syncing with the specified cloud ``storage_path``.
+You can also pass a custom :class:`Syncer <ray.tune.syncer.Syncer>` object
+to a :class:`tune.SyncConfig <ray.tune.SyncConfig>` within the :class:`air.RunConfig <ray.air.RunConfig>`
+if you want to implement custom logic for uploading/downloading from the cloud.
+See :ref:`tune-cloud-syncing` and :ref:`tune-cloud-syncing-command-line-example`
+for more details and examples of custom syncing.
 
-In this example, all experiment results can found on the head node at ``~/ray_results/experiment_name`` for further processing.
+In this example, all experiment results can be found in the shared storage at ``s3://bucket-name/sub-path/experiment_name`` for further processing.
 
 .. note::
 
-    If you don't provide a :class:`~ray.tune.syncer.SyncConfig` at all, this is the default configuration.
+    The head node will not have access to all experiment results locally. If you want to process
+    e.g. the best checkpoint further, you will first have to fetch it from the cloud storage.
 
+    Experiment restoration should also be done using the experiment directory at the cloud storage
+    URI, rather than the local experiment directory on the head node. See :ref:`here for an example <tune-syncing-restore-from-uri>`.
 
-.. tip::
-    Please note that this approach is likely the least efficient one - you should always try to use
-    shared or cloud storage if possible when training on a multi-node cluster.
-    Using a network filesystem or cloud storage recommended when training a large number of distributed trials,
-    since the default scenario with many worker nodes can introduce significant overhead.
 
 
 Configuring Tune with a network filesystem (NFS)
@@ -104,37 +92,36 @@ Configuring Tune with a network filesystem (NFS)
 If all Ray nodes have access to a network filesystem, e.g. AWS EFS or Google Cloud Filestore,
 they can all write experiment outputs to this directory.
 
-All we need to do is **set the shared network filesystem as the path to save results** and
-**disable Ray Tune's default syncing behavior**.
+All we need to do is **set the shared network filesystem as the path to save results**.
 
 .. code-block:: python
-    :emphasize-lines: 7, 8, 9, 10
 
     from ray import air, tune
 
     tuner = tune.Tuner(
         trainable,
         run_config=air.RunConfig(
             name="experiment_name",
-            storage_path="/path/to/shared/storage/",
-            sync_config=tune.SyncConfig(
-                syncer=None  # Disable syncing
-            )
+            storage_path="/mnt/path/to/shared/storage/",
         )
     )
     tuner.fit()
 
 In this example, all experiment results can be found in the shared storage at ``/path/to/shared/storage/experiment_name`` for further processing.
 
-.. _tune-cloud-checkpointing:
 
-Configuring Tune with cloud storage (AWS S3, Google Cloud Storage)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+.. _tune-default-syncing:
 
-If all nodes in a Ray cluster have access to cloud storage, e.g. AWS S3 or Google Cloud Storage (GCS),
-then all experiment outputs can be saved in a shared cloud bucket.
+Configure Tune without external persistent storage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-We can configure cloud storage by telling Ray Tune to **upload to a remote** ``storage_path``:
+On a single-node cluster
+************************
+
+If you're just running an experiment on a single node (e.g., on a laptop), Tune will use the
+local filesystem as the default storage location for checkpoints and other artifacts.
+Results are saved to ``~/ray_results`` in a sub-directory with a unique auto-generated name by default,
+unless you customize this with ``storage_path`` and ``name`` in :class:`~ray.air.RunConfig`.
 
 .. code-block:: python
 
@@ -144,42 +131,76 @@ We can configure cloud storage by telling Ray Tune to **upload to a remote** ``s
     tuner = tune.Tuner(
         trainable,
         run_config=RunConfig(
+            storage_path="/tmp/custom/storage/path",
             name="experiment_name",
-            storage_path="s3://bucket-name/sub-path/",
         )
     )
     tuner.fit()
 
-Ray AIR automatically configures a default syncer that uses pyarrow to
-perform syncing with the specified cloud ``storage_path``.
-You can also pass a custom :class:`Syncer <ray.tune.syncer.Syncer>` object
-to a :class:`tune.SyncConfig <ray.tune.SyncConfig>` within the :class:`air.RunConfig <ray.air.RunConfig>`
-if you want to implement custom logic for uploading/downloading from the cloud.
-See :ref:`tune-cloud-syncing` and :ref:`tune-cloud-syncing-command-line-example`
-for more details and examples of custom syncing.
+In this example, all experiment results can found locally at ``/tmp/custom/storage/path/experiment_name`` for further processing.
 
-In this example, all experiment results can be found in the shared storage at ``s3://bucket-name/sub-path/experiment_name`` for further processing.
 
-.. note::
+On a multi-node cluster (Deprecated)
+************************************
 
-    The head node will not have access to all experiment results locally. If you want to process
-    e.g. the best checkpoint further, you will first have to fetch it from the cloud storage.
+.. warning::
 
-    Experiment restoration should also be done using the experiment directory at the cloud storage
-    URI, rather than the local experiment directory on the head node. See :ref:`here for an example <tune-syncing-restore-from-uri>`.
+    When running on multiple nodes, using the local filesystem of the head node as the persistent storage location is *deprecated*.
+    If you save trial checkpoints and run on a multi-node cluster, Tune will raise an error by default, if NFS or cloud storage is not setup.
+    See `this issue <https://github.com/ray-project/ray/issues/37177>`_ for more information, including temporary workarounds
+    as well as the deprecation and removal schedule.
+
+
+If you're using neither a shared filesystem nor cloud storage, Ray Tune will resort to the
+default mechanism of periodically synchronizing data saved on worker nodes to the head node.
+This treats the head node's local filesystem as the main storage location of the distributed Tune experiment.
+
+By default, workers will sync the entire trial directory to the head node whenever that trial saves a checkpoint.
+This can be configured by ``sync_on_checkpoint`` and ``sync_period`` in :class:`SyncConfig <ray.tune.syncer.SyncConfig>`:
+
+.. code-block:: python
+
+    from ray import tune
+    from ray.air.config import RunConfig
+
+    tuner = tune.Tuner(
+        trainable,
+        run_config=RunConfig(
+            name="experiment_name",
+            storage_path="~/ray_results",
+            sync_config=tune.SyncConfig(
+                syncer="auto",
+                # Sync approximately every minute rather than on every checkpoint
+                sync_on_checkpoint=False,
+                sync_period=60,
+            )
+        )
+    )
+    tuner.fit()
+
+In the snippet above, we disabled forceful syncing on trial checkpoints and adjusted the sync period to 60 seconds.
+Setting the sync period to a lower value (in seconds) will sync from remote nodes more often.
+This will lead to more robust trial recovery, but it will also lead to more synchronization overhead.
+
+In this example, all experiment results can found on the head node at ``~/ray_results/experiment_name`` for further processing.
+
+.. tip::
+    Please note that this approach is likely the least efficient one - you should always try to use
+    shared or cloud storage if possible when training on a multi-node cluster.
 
 
 Examples
 --------
 
 Let's show some examples of configuring storage location and synchronization options.
 We'll also show how to resume the experiment for each of the examples, in the case that your experiment gets interrupted.
-See :ref:`tune-stopping-guide` for more information on resuming experiments.
+See :ref:`tune-fault-tolerance-ref` for more information on resuming experiments.
 
 In each example, we'll give a practical explanation of how *trial checkpoints* are saved
 across the cluster and the external storage location (if one is provided).
 See :ref:`tune-persisted-experiment-data` for an overview of other experiment data that Tune needs to persist.
 
+
 Example: Running Tune with cloud storage
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -254,12 +275,13 @@ There are a few options for restoring an experiment:
 Please see the documentation of
 :meth:`Tuner.restore() <ray.tune.tuner.Tuner.restore>` for more details.
 
+
 .. _tune-default-syncing-example:
 
-Example: Running Tune without external persistent storage (default scenario)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Example: Running Tune in a multi-node cluster without external persistent storage (Deprecated)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Now, let's take a look at an example using default syncing behavior described above.
+Now, let's take a look at an example using the deprecated head node syncing behavior described above.
 Again, we're running this example script from the Ray cluster's head node.
 
 .. code-block:: python
@@ -297,6 +319,7 @@ This experiment can be resumed from the head node:
 .. code-block:: python
 
     from ray import tune
+
     tuner = tune.Tuner.restore(
         "/tmp/mypath/my-tune-exp",
         trainable=my_trainable,

diff --git a/python/ray/tune/syncer.py b/python/ray/tune/syncer.py
@@ -112,36 +112,37 @@ class SyncConfig:
     See :ref:`tune-persisted-experiment-data` for an overview of what data is
     synchronized.
 
-    If an ``upload_dir`` is specified, both experiment and trial checkpoints
-    will be stored on remote (cloud) storage. Synchronization then only
-    happens via uploading/downloading from this remote storage -- no syncing will
-    happen between nodes.
+    If a remote ``RunConfig(storage_path)`` is specified, both experiment and trial
+    checkpoints will be stored on remote (cloud) storage. Synchronization then only
+    happens via uploading/downloading from this remote storage.
 
     There are a few scenarios where syncing takes place:
 
     (1) The Tune driver (on the head node) syncing the experiment directory to the cloud
         (which includes experiment state such as searcher state, the list of trials
         and their statuses, and trial metadata)
     (2) Workers directly syncing trial checkpoints to the cloud
-    (3) Workers syncing their trial directories to the head node
-        (this is the default option when no cloud storage is used)
+    (3) Workers syncing their trial directories to the head node (Deprecated)
     (4) Workers syncing artifacts (which include all files saved in the trial directory
         *except* for checkpoints) directly to the cloud.
 
+    .. warning::
+        When running on multiple nodes, using the local filesystem of the head node as
+        the persistent storage location is *deprecated*.
+        If you save trial checkpoints and run on a multi-node cluster,
+        Tune will raise an error by default, if NFS or cloud storage is not setup.
+        See `this issue <https://github.com/ray-project/ray/issues/37177>`_
+        for more information, including temporary workarounds
+        as well as the deprecation and removal schedule.
+
     See :ref:`tune-storage-options` for more details and examples.
 
     Args:
-        upload_dir: Optional URI to sync training results and checkpoints
-            to (e.g. ``s3://bucket``, ``gs://bucket`` or ``hdfs://path``).
-            Specifying this will enable cloud-based checkpointing.
-        syncer: If ``upload_dir`` is specified, then this config accepts a custom
-            syncer subclassing :class:`~ray.tune.syncer.Syncer` which will be
+        upload_dir: This config is deprecated in favor of ``RunConfig(storage_path)``.
+        syncer: If a cloud ``storage_path`` is configured, then this config accepts a
+            custom syncer subclassing :class:`~ray.tune.syncer.Syncer` which will be
             used to synchronize checkpoints to/from cloud storage.
-            If no ``upload_dir`` is specified, this config can be set to ``None``,
-            which disables the default worker-to-head-node syncing.
-            Defaults to ``"auto"`` (auto detect), which assigns a default syncer
-            that uses pyarrow to handle cloud storage syncing when ``upload_dir``
-            is provided.
+            Defaults to ``"auto"`` (auto detect), which defaults to use ``pyarrow.fs``.
         sync_period: Minimum time in seconds to wait between two sync operations.
             A smaller ``sync_period`` will have more up-to-date data at the sync
             location but introduces more syncing overhead.
@@ -157,10 +158,10 @@ class SyncConfig:
             trial directory (accessed via `session.get_trial_dir()`) to the cloud.
             Artifact syncing happens at the same frequency as trial checkpoint syncing.
             **Note**: This is scenario (4).
-        sync_on_checkpoint: If *True*, a sync from a worker's remote trial directory
+        sync_on_checkpoint: This config is deprecated.
+            If *True*, a sync from a worker's remote trial directory
             to the head node will be forced on every trial checkpoint, regardless
-            of the ``sync_period``.
-            Defaults to True.
+            of the ``sync_period``. Defaults to True.
             **Note**: This is ignored if ``upload_dir`` is specified, since this
             only applies to worker-to-head-node syncing (3).
     """