Skip to content

Commit

Permalink
[air][doc] Update docs to reflect head node syncing deprecation (ray-…
Browse files Browse the repository at this point in the history
…project#37475)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: harborn <gangsheng.wu@intel.com>
  • Loading branch information
justinvyu authored and harborn committed Aug 17, 2023
1 parent 63f80f7 commit a6016b6
Show file tree
Hide file tree
Showing 4 changed files with 116 additions and 96 deletions.
8 changes: 1 addition & 7 deletions doc/source/tune/doc_code/faq.py
Original file line number Diff line number Diff line change
Expand Up @@ -363,13 +363,7 @@ def wait(self):

tuner = tune.Tuner(
train_fn,
run_config=air.RunConfig(
storage_path="/path/to/shared/storage",
),
sync_config=tune.SyncConfig(
# Do not sync because we are on shared storage
syncer=None
),
run_config=air.RunConfig(storage_path="/path/to/shared/storage"),
)
tuner.fit()
# __sync_config_end__
Expand Down
8 changes: 5 additions & 3 deletions doc/source/tune/tutorials/tune-output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -269,9 +269,11 @@ should be configured to log to the Trainable's *working directory.* By default,
the current working directory of both functional and class trainables is set to the
corresponding trial directory once it's been launched as a remote Ray actor.

When running with multiple nodes using the :ref:`default syncing method <tune-default-syncing>`,
trial artifacts are synchronized to the driver node under the specified path.
This will allow you to visualize and analyze logs of all distributed training workers on a single machine.
.. warning::

When running in a multi-node cluster using the *deprecated* :ref:`head node storage option <tune-default-syncing>`,
trial artifacts are synchronized to the driver node under the specified path.
This will allow you to visualize and analyze logs of all distributed training workers on a single machine.

When :ref:`specifying a cloud upload directory <tune-cloud-checkpointing>`, trial artifacts are uploaded to that cloud bucket
for later analysis. Note that the driver node does not necessarily contain
Expand Down
157 changes: 90 additions & 67 deletions doc/source/tune/tutorials/tune-storage.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _tune-storage-options:

How to Configure Storage Options for a Distributed Tune Experiment
==================================================================
How to Configure Persistent Storage in Ray Tune
===============================================

.. seealso::

Expand All @@ -26,17 +26,14 @@ Storage Options in Tune

Tune provides support for three scenarios:

1. When running Tune on a distributed cluster without any external persistent storage.
1. When using cloud storage (e.g. AWS S3 or Google Cloud Storage) accessible by all machines in the cluster.
2. When using a network filesystem (NFS) mounted to all machines in the cluster.
3. When using cloud storage (e.g. AWS S3 or Google Cloud Storage) accessible by all machines in the cluster.

Situation (1) is the default scenario if a network filesystem or cloud storage are not provided.
In this scenario, we assume that we only have the local filesystems of each machine in the Ray cluster for storing experiment outputs.
3. When running Tune on a single node and using the local filesystem as the persistent storage location.
4. **(Deprecated)** When running Tune on multiple nodes and using the local filesystem of the head node as the persistent storage location.

.. note::

Although we are considering distributed Tune experiments in this guide,
a network filesystem or cloud storage can also be configured for single-node
A network filesystem or cloud storage can be configured for single-node
experiments. This can be useful to persist your experiment results in external storage
if, for example, the instance you run your experiment on clears its local storage
after termination.
Expand All @@ -46,21 +43,17 @@ In this scenario, we assume that we only have the local filesystems of each mach
See :class:`~ray.tune.syncer.SyncConfig` for the full set of configuration options as well as more details.


.. _tune-default-syncing:
.. _tune-cloud-checkpointing:

Configure Tune without external persistent storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Configuring Tune with cloud storage (AWS S3, Google Cloud Storage)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you're using neither a shared filesystem nor cloud storage, Ray Tune will resort to the
default mechanism of periodically synchronizing data saved on worker nodes to the head node.
**This treats the head node's local filesystem as the main storage location of the distributed Tune experiment.**
If all nodes in a Ray cluster have access to cloud storage, e.g. AWS S3 or Google Cloud Storage (GCS),
then all experiment outputs can be saved in a shared cloud bucket.

By default, workers will sync to the head node whenever a trial running on that workers
has finished saving a checkpoint. This can be configured by ``sync_on_checkpoint`` and
``sync_period`` in :class:`SyncConfig <ray.tune.syncer.SyncConfig>`:
We can configure cloud storage by telling Ray Tune to **upload to a remote** ``storage_path``:

.. code-block:: python
:emphasize-lines: 9, 10, 11, 12, 13, 14
from ray import tune
from ray.air.config import RunConfig
Expand All @@ -69,33 +62,28 @@ has finished saving a checkpoint. This can be configured by ``sync_on_checkpoint
trainable,
run_config=RunConfig(
name="experiment_name",
storage_path="~/ray_results",
sync_config=tune.SyncConfig(
syncer="auto",
# Sync approximately every minute rather than on every checkpoint
sync_on_checkpoint=False,
sync_period=60,
)
storage_path="s3://bucket-name/sub-path/",
)
)
tuner.fit()
In the snippet above, we disabled forceful syncing on trial checkpoints and adjusted the sync period to 60 seconds.
Setting the sync period to a lower value (in seconds) will sync from remote nodes more often.
This will lead to more robust trial recovery, but it will also lead to more synchronization overhead.
Ray AIR defaults to use pyarrow to perform syncing with the specified cloud ``storage_path``.
You can also pass a custom :class:`Syncer <ray.tune.syncer.Syncer>` object
to a :class:`tune.SyncConfig <ray.tune.SyncConfig>` within the :class:`air.RunConfig <ray.air.RunConfig>`
if you want to implement custom logic for uploading/downloading from the cloud.
See :ref:`tune-cloud-syncing` and :ref:`tune-cloud-syncing-command-line-example`
for more details and examples of custom syncing.

In this example, all experiment results can found on the head node at ``~/ray_results/experiment_name`` for further processing.
In this example, all experiment results can be found in the shared storage at ``s3://bucket-name/sub-path/experiment_name`` for further processing.

.. note::

If you don't provide a :class:`~ray.tune.syncer.SyncConfig` at all, this is the default configuration.
The head node will not have access to all experiment results locally. If you want to process
e.g. the best checkpoint further, you will first have to fetch it from the cloud storage.

Experiment restoration should also be done using the experiment directory at the cloud storage
URI, rather than the local experiment directory on the head node. See :ref:`here for an example <tune-syncing-restore-from-uri>`.

.. tip::
Please note that this approach is likely the least efficient one - you should always try to use
shared or cloud storage if possible when training on a multi-node cluster.
Using a network filesystem or cloud storage recommended when training a large number of distributed trials,
since the default scenario with many worker nodes can introduce significant overhead.


Configuring Tune with a network filesystem (NFS)
Expand All @@ -104,37 +92,36 @@ Configuring Tune with a network filesystem (NFS)
If all Ray nodes have access to a network filesystem, e.g. AWS EFS or Google Cloud Filestore,
they can all write experiment outputs to this directory.

All we need to do is **set the shared network filesystem as the path to save results** and
**disable Ray Tune's default syncing behavior**.
All we need to do is **set the shared network filesystem as the path to save results**.

.. code-block:: python
:emphasize-lines: 7, 8, 9, 10
from ray import air, tune
tuner = tune.Tuner(
trainable,
run_config=air.RunConfig(
name="experiment_name",
storage_path="/path/to/shared/storage/",
sync_config=tune.SyncConfig(
syncer=None # Disable syncing
)
storage_path="/mnt/path/to/shared/storage/",
)
)
tuner.fit()
In this example, all experiment results can be found in the shared storage at ``/path/to/shared/storage/experiment_name`` for further processing.

.. _tune-cloud-checkpointing:

Configuring Tune with cloud storage (AWS S3, Google Cloud Storage)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _tune-default-syncing:

If all nodes in a Ray cluster have access to cloud storage, e.g. AWS S3 or Google Cloud Storage (GCS),
then all experiment outputs can be saved in a shared cloud bucket.
Configure Tune without external persistent storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can configure cloud storage by telling Ray Tune to **upload to a remote** ``storage_path``:
On a single-node cluster
************************

If you're just running an experiment on a single node (e.g., on a laptop), Tune will use the
local filesystem as the default storage location for checkpoints and other artifacts.
Results are saved to ``~/ray_results`` in a sub-directory with a unique auto-generated name by default,
unless you customize this with ``storage_path`` and ``name`` in :class:`~ray.air.RunConfig`.

.. code-block:: python
Expand All @@ -144,42 +131,76 @@ We can configure cloud storage by telling Ray Tune to **upload to a remote** ``s
tuner = tune.Tuner(
trainable,
run_config=RunConfig(
storage_path="/tmp/custom/storage/path",
name="experiment_name",
storage_path="s3://bucket-name/sub-path/",
)
)
tuner.fit()
Ray AIR automatically configures a default syncer that uses pyarrow to
perform syncing with the specified cloud ``storage_path``.
You can also pass a custom :class:`Syncer <ray.tune.syncer.Syncer>` object
to a :class:`tune.SyncConfig <ray.tune.SyncConfig>` within the :class:`air.RunConfig <ray.air.RunConfig>`
if you want to implement custom logic for uploading/downloading from the cloud.
See :ref:`tune-cloud-syncing` and :ref:`tune-cloud-syncing-command-line-example`
for more details and examples of custom syncing.
In this example, all experiment results can found locally at ``/tmp/custom/storage/path/experiment_name`` for further processing.

In this example, all experiment results can be found in the shared storage at ``s3://bucket-name/sub-path/experiment_name`` for further processing.

.. note::
On a multi-node cluster (Deprecated)
************************************

The head node will not have access to all experiment results locally. If you want to process
e.g. the best checkpoint further, you will first have to fetch it from the cloud storage.
.. warning::

Experiment restoration should also be done using the experiment directory at the cloud storage
URI, rather than the local experiment directory on the head node. See :ref:`here for an example <tune-syncing-restore-from-uri>`.
When running on multiple nodes, using the local filesystem of the head node as the persistent storage location is *deprecated*.
If you save trial checkpoints and run on a multi-node cluster, Tune will raise an error by default, if NFS or cloud storage is not setup.
See `this issue <https://github.com/ray-project/ray/issues/37177>`_ for more information, including temporary workarounds
as well as the deprecation and removal schedule.


If you're using neither a shared filesystem nor cloud storage, Ray Tune will resort to the
default mechanism of periodically synchronizing data saved on worker nodes to the head node.
This treats the head node's local filesystem as the main storage location of the distributed Tune experiment.

By default, workers will sync the entire trial directory to the head node whenever that trial saves a checkpoint.
This can be configured by ``sync_on_checkpoint`` and ``sync_period`` in :class:`SyncConfig <ray.tune.syncer.SyncConfig>`:

.. code-block:: python
from ray import tune
from ray.air.config import RunConfig
tuner = tune.Tuner(
trainable,
run_config=RunConfig(
name="experiment_name",
storage_path="~/ray_results",
sync_config=tune.SyncConfig(
syncer="auto",
# Sync approximately every minute rather than on every checkpoint
sync_on_checkpoint=False,
sync_period=60,
)
)
)
tuner.fit()
In the snippet above, we disabled forceful syncing on trial checkpoints and adjusted the sync period to 60 seconds.
Setting the sync period to a lower value (in seconds) will sync from remote nodes more often.
This will lead to more robust trial recovery, but it will also lead to more synchronization overhead.

In this example, all experiment results can found on the head node at ``~/ray_results/experiment_name`` for further processing.

.. tip::
Please note that this approach is likely the least efficient one - you should always try to use
shared or cloud storage if possible when training on a multi-node cluster.


Examples
--------

Let's show some examples of configuring storage location and synchronization options.
We'll also show how to resume the experiment for each of the examples, in the case that your experiment gets interrupted.
See :ref:`tune-stopping-guide` for more information on resuming experiments.
See :ref:`tune-fault-tolerance-ref` for more information on resuming experiments.

In each example, we'll give a practical explanation of how *trial checkpoints* are saved
across the cluster and the external storage location (if one is provided).
See :ref:`tune-persisted-experiment-data` for an overview of other experiment data that Tune needs to persist.


Example: Running Tune with cloud storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -254,12 +275,13 @@ There are a few options for restoring an experiment:
Please see the documentation of
:meth:`Tuner.restore() <ray.tune.tuner.Tuner.restore>` for more details.


.. _tune-default-syncing-example:

Example: Running Tune without external persistent storage (default scenario)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Example: Running Tune in a multi-node cluster without external persistent storage (Deprecated)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now, let's take a look at an example using default syncing behavior described above.
Now, let's take a look at an example using the deprecated head node syncing behavior described above.
Again, we're running this example script from the Ray cluster's head node.

.. code-block:: python
Expand Down Expand Up @@ -297,6 +319,7 @@ This experiment can be resumed from the head node:
.. code-block:: python
from ray import tune
tuner = tune.Tuner.restore(
"/tmp/mypath/my-tune-exp",
trainable=my_trainable,
Expand Down
39 changes: 20 additions & 19 deletions python/ray/tune/syncer.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,36 +112,37 @@ class SyncConfig:
See :ref:`tune-persisted-experiment-data` for an overview of what data is
synchronized.
If an ``upload_dir`` is specified, both experiment and trial checkpoints
will be stored on remote (cloud) storage. Synchronization then only
happens via uploading/downloading from this remote storage -- no syncing will
happen between nodes.
If a remote ``RunConfig(storage_path)`` is specified, both experiment and trial
checkpoints will be stored on remote (cloud) storage. Synchronization then only
happens via uploading/downloading from this remote storage.
There are a few scenarios where syncing takes place:
(1) The Tune driver (on the head node) syncing the experiment directory to the cloud
(which includes experiment state such as searcher state, the list of trials
and their statuses, and trial metadata)
(2) Workers directly syncing trial checkpoints to the cloud
(3) Workers syncing their trial directories to the head node
(this is the default option when no cloud storage is used)
(3) Workers syncing their trial directories to the head node (Deprecated)
(4) Workers syncing artifacts (which include all files saved in the trial directory
*except* for checkpoints) directly to the cloud.
.. warning::
When running on multiple nodes, using the local filesystem of the head node as
the persistent storage location is *deprecated*.
If you save trial checkpoints and run on a multi-node cluster,
Tune will raise an error by default, if NFS or cloud storage is not setup.
See `this issue <https://github.com/ray-project/ray/issues/37177>`_
for more information, including temporary workarounds
as well as the deprecation and removal schedule.
See :ref:`tune-storage-options` for more details and examples.
Args:
upload_dir: Optional URI to sync training results and checkpoints
to (e.g. ``s3://bucket``, ``gs://bucket`` or ``hdfs://path``).
Specifying this will enable cloud-based checkpointing.
syncer: If ``upload_dir`` is specified, then this config accepts a custom
syncer subclassing :class:`~ray.tune.syncer.Syncer` which will be
upload_dir: This config is deprecated in favor of ``RunConfig(storage_path)``.
syncer: If a cloud ``storage_path`` is configured, then this config accepts a
custom syncer subclassing :class:`~ray.tune.syncer.Syncer` which will be
used to synchronize checkpoints to/from cloud storage.
If no ``upload_dir`` is specified, this config can be set to ``None``,
which disables the default worker-to-head-node syncing.
Defaults to ``"auto"`` (auto detect), which assigns a default syncer
that uses pyarrow to handle cloud storage syncing when ``upload_dir``
is provided.
Defaults to ``"auto"`` (auto detect), which defaults to use ``pyarrow.fs``.
sync_period: Minimum time in seconds to wait between two sync operations.
A smaller ``sync_period`` will have more up-to-date data at the sync
location but introduces more syncing overhead.
Expand All @@ -157,10 +158,10 @@ class SyncConfig:
trial directory (accessed via `session.get_trial_dir()`) to the cloud.
Artifact syncing happens at the same frequency as trial checkpoint syncing.
**Note**: This is scenario (4).
sync_on_checkpoint: If *True*, a sync from a worker's remote trial directory
sync_on_checkpoint: This config is deprecated.
If *True*, a sync from a worker's remote trial directory
to the head node will be forced on every trial checkpoint, regardless
of the ``sync_period``.
Defaults to True.
of the ``sync_period``. Defaults to True.
**Note**: This is ignored if ``upload_dir`` is specified, since this
only applies to worker-to-head-node syncing (3).
"""
Expand Down

0 comments on commit a6016b6

Please sign in to comment.