Skip to content
Permalink
Browse files

Fixed autotuning with horovodrun by excluding unset parameters from t…

…he environment, and added docs for autotune (#1356)
  • Loading branch information...
tgaddair committed Aug 27, 2019
1 parent a639de5 commit 88692848c38b4c7e630856ce5be1fe9aa3e46d21
Showing with 198 additions and 44 deletions.
  1. +9 −0 README.rst
  2. +86 −0 docs/autotune.rst
  3. +3 −0 docs/autotune_include.rst
  4. +2 −0 docs/index.rst
  5. +6 −0 docs/mpirun.rst
  6. +10 −0 docs/summary.rst
  7. +3 −2 horovod/run/common/util/config_parser.py
  8. +44 −22 horovod/run/run.py
  9. +35 −20 test/test_run.py
@@ -306,6 +306,15 @@ Horovod has the ability to record the timeline of its activity, called Horovod T
See `here <docs/timeline.rst>`__ for full details and usage instructions.


Automated Performance Tuning
----------------------------
Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve
a good amount of trial and error. We provide a system to automate this performance optimization process called
**autotuning**, which you can enable with a single command line argument to ``horovodrun``.

See `here <docs/autotune.rst>`__ for full details and usage instructions.


Guides
------
1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_. Send us links to any user guides you want to publish on this site
@@ -0,0 +1,86 @@
.. inclusion-marker-start-do-not-remove
Autotune: Automated Performance Tuning
======================================

Horovod comes with several adjustable "knobs" that can affect runtime performance, including
``--fusion-threshold-mb`` and ``--cycle-time-ms`` (tensor fusion), ``--cache-capacity`` (response cache), and
hierarchical collective algorithms ``--hierarchical-allreduce`` and ``--hierarchical-allgather``.

Determining the best combination of these values to maximize performance (minimize time to convergence) can be a
matter of trial-and-error, as many factors including model complexity, network bandwidth, GPU memory, etc. can all
affect inputs per second throughput during training.

Horovod provides a mechanism to automate the process of selecting the best values for these "knobs" called
**autotuning**. The Horovod autotuning system uses
`Bayesian optimization <https://en.wikipedia.org/wiki/Bayesian_optimization>`_ to intelligently search through the
space of parameter combinations during training. This feature can be enabled by setting the ``--autotune`` flag for
``horovodrun``:

.. code-block:: bash
$ horovodrun -np 4 --autotune python train.py
When autotuning is enabled, Horovod will spend the first steps / epochs of training experimenting with different
parameter values and collecting metrics on performance (measured in bytes allreduced / allgathered per unit of time).
Once the experiment reaches convergence, or a set number of samples have been collected, the system will record the best
combination of parameters discovered and continue to use them for the duration of training.

A log of all parameter combinations explored (and the best values selected) can be recorded by providing
the ``--autotune-log-file`` option to ``horovodrun``:

.. code-block:: bash
$ horovodrun -np 4 --autotune --autotune-log-file /tmp/autotune_log.csv python train.py
By logging the best parameters to a file, you can opt to set the best parameters discovered on the command line
instead of re-running autotuning if training is paused and later resumed.

Note that some configurable parameters, like tensor compression, are not included as part of the autotuning process
because they can affect model convergence. The purpose of autotuning at this time is entirely to improve scaling
efficiency without making any tradeoffs on model performance.


Constant Parameters
-------------------

Sometimes you may wish to hold certain values constant and only tune the unspecified parameters. This can be
accomplished by explicitly setting those values on the command line or in the config file provided
by ``--config-file``:

.. code-block:: bash
$ horovodrun -np 4 --autotune --cache-capacity 1024 --no-hierarchical-allgather python train.py
In the above example, parameters ``cache-capacity`` and ``hierarchical-allgather`` will not be adjusted by
autotuning.


Advanced Autotuning
-------------------

Enabling autotuning imposes a tradeoff between degraded performance during the early phases of training in exchange for
better performance later on. As such, it's generally recommended to use autotuning in situations where training is both
expected to take a long time (many epochs on very large datasets) and where scaling efficiency has been found lacking
using the default settings.

You can tune the autotuning system itself to change the number of warmup samples (discarded samples at the beginning),
steps per sample, and maximum samples:

.. code-block:: bash
$ horovodrun -np 4 --autotune \
--autotune-warmup-samples 5 --autotune-steps-per-sample 20 --autotune-bayes-opt-max-samples 40 \
python train.py
Increasing these values will generally improve the accuracy of the autotuning process at the cost of greater time
spent in the autotuning process with degraded performance.

Finally, for those familiar with the underlying theory of Bayesian optimization and Gaussian processes, you can tune
the noise regularization term (alpha) to account for variance in your network or other system resources:

.. code-block:: bash
$ horovodrun -np 4 --autotune --autotune-gaussian-process-noise 0.75 python train.py
.. inclusion-marker-end-do-not-remove
@@ -0,0 +1,3 @@
.. include:: ./autotune.rst
:start-after: inclusion-marker-start-do-not-remove
:end-before: inclusion-marker-end-do-not-remove
@@ -121,6 +121,8 @@ Guides

timeline_include

autotune_include

troubleshooting_include

contributors_include
@@ -112,6 +112,12 @@ Timeline:
$ mpirun -x HOROVOD_TIMELINE=/path/to/timeline.json -x HOROVOD_TIMELINE_MARK_CYCLES=1 ... python train.py
Autotuning:

.. code-block:: bash
$ mpirun -x HOROVOD_AUTOTUNE=1 -x HOROVOD_AUTOTUNE_LOG=/tmp/autotune_log.csv ... python train.py
Note that when using ``horovodrun``, any command line arguments will override values set in the environment.

Hangs due to non-routed network interfaces
@@ -323,6 +323,7 @@ to batch small *allreduce* operations, which results in improved performance. We

See `here <tensor-fusion.rst>`__ for full details and tweaking instructions.


Horovod Timeline
----------------
Horovod has the ability to record the timeline of its activity, called Horovod Timeline.
@@ -334,6 +335,15 @@ Use Horovod timeline to analyze Horovod performance.
See `here <timeline.rst>`__ for full details and usage instructions.


Automated Performance Tuning
----------------------------
Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve
a good amount of trial and error. We provide a system to automate this performance optimization process called
**autotuning**, which you can enable with a single command line argument to ``horovodrun``.

See `here <autotune.rst>`__ for full details and usage instructions.


Guides
------
1. Run distributed training in Microsoft Azure using `Batch AI and Horovod <https://github.com/Azure/BatchAI/tree/master/recipes/Horovod>`_.
@@ -92,7 +92,7 @@ def set_args_from_config(args, config, override_args):

def _validate_arg_nonnegative(args, arg_name):
value = getattr(args, arg_name)
if value < 0:
if value is not None and value < 0:
raise ValueError('{}={} must be >= 0'.format(arg_name, value))


@@ -104,7 +104,8 @@ def validate_config_args(args):
_validate_arg_nonnegative(args, 'autotune_steps_per_sample')
_validate_arg_nonnegative(args, 'autotune_bayes_opt_max_samples')

if args.autotune_gaussian_process_noise < 0 or args.autotune_gaussian_process_noise > 1:
noise = args.autotune_gaussian_process_noise
if noise is not None and (noise < 0 or noise > 1):
raise ValueError('{}={} must be in [0, 1]'.format('autotune_gaussian_process_noise',
args.autotune_gaussian_process_noise))

@@ -314,7 +314,7 @@ class StoreOverrideAction(argparse.Action):
def __init__(self,
option_strings,
dest,
default=False,
default=None,
type=None,
required=False,
help=None):
@@ -334,28 +334,35 @@ def __call__(self, parser, args, values, option_string=None):
return StoreOverrideAction


def make_override_true_action(override_args):
class StoreOverrideTrueAction(argparse.Action):
def make_override_bool_action(override_args, bool_value):
class StoreOverrideBoolAction(argparse.Action):
def __init__(self,
option_strings,
dest,
default=False,
required=False,
help=None):
super(StoreOverrideTrueAction, self).__init__(
super(StoreOverrideBoolAction, self).__init__(
option_strings=option_strings,
dest=dest,
const=True,
const=bool_value,
nargs=0,
default=default,
default=None,
required=required,
help=help)

def __call__(self, parser, args, values, option_string=None):
override_args.add(self.dest)
setattr(args, self.dest, self.const)

return StoreOverrideTrueAction
return StoreOverrideBoolAction


def make_override_true_action(override_args):
return make_override_bool_action(override_args, True)


def make_override_false_action(override_args):
return make_override_bool_action(override_args, False)


def parse_args():
@@ -407,28 +414,43 @@ def parse_args():
'this argument, and will be overridden by any arguments that come after it.')

group_params = parser.add_argument_group('tuneable parameter arguments')
group_params.add_argument('--fusion-threshold-mb', action=make_override_action(override_args), type=int, default=64,
group_params.add_argument('--fusion-threshold-mb', action=make_override_action(override_args),type=int,
help='Fusion buffer threshold in MB. This is the maximum amount of '
'tensor data that can be fused together into a single batch '
'during allreduce / allgather. Setting 0 disables tensor fusion. '
'(default: %(default)s)')
group_params.add_argument('--cycle-time-ms', action=make_override_action(override_args), type=float, default=5,
'(default: 64)')
group_params.add_argument('--cycle-time-ms', action=make_override_action(override_args), type=float,
help='Cycle time in ms. This is the delay between each tensor fusion '
'cycle. The larger the cycle time, the more batching, but the '
'greater latency between each allreduce / allgather operations. '
'(default: %(default)s)')
group_params.add_argument('--cache-capacity', action=make_override_action(override_args), type=int, default=1024,
'(default: 5')
group_params.add_argument('--cache-capacity', action=make_override_action(override_args), type=int,
help='Maximum number of tensor names that will be cached to reduce amount '
'of coordination required between workers before performing allreduce / '
'allgather. (default: %(default)s)')
group_params.add_argument('--hierarchical-allreduce', action=make_override_true_action(override_args),
help='Perform hierarchical allreduce between workers instead of ring allreduce. '
'Hierarchical allreduce performs a local allreduce / gather within a host, then '
'a parallel cross allreduce between equal local ranks across workers, and '
'finally a local gather.')
group_params.add_argument('--hierarchical-allgather', action=make_override_true_action(override_args),
help='Perform hierarchical allgather between workers instead of ring allgather. See '
'hierarchical allreduce for algorithm details.')
'allgather. (default: 1024')

group_hierarchical_allreduce = group_params.add_mutually_exclusive_group()
group_hierarchical_allreduce.add_argument('--hierarchical-allreduce',
action=make_override_true_action(override_args),
help='Perform hierarchical allreduce between workers instead of '
'ring allreduce. Hierarchical allreduce performs a local '
'allreduce / gather within a host, then a parallel cross allreduce '
'between equal local ranks across workers, and finally a '
'local gather.')
group_hierarchical_allreduce.add_argument('--no-hierarchical-allreduce', dest='hierarchical_allreduce',
action=make_override_false_action(override_args),
help='Explicitly disable hierarchical allreduce to prevent autotuning '
'from adjusting it.')

group_hierarchical_allgather = group_params.add_mutually_exclusive_group()
group_hierarchical_allgather.add_argument('--hierarchical-allgather',
action=make_override_true_action(override_args),
help='Perform hierarchical allgather between workers instead of '
'ring allgather. See hierarchical allreduce for algorithm details.')
group_hierarchical_allgather.add_argument('--no-hierarchical-allgather', dest='hierarchical_allgather',
action=make_override_false_action(override_args),
help='Explicitly disable hierarchical allgather to prevent autotuning '
'from adjusting it.')

group_autotune = parser.add_argument_group('autotune arguments')
group_autotune.add_argument('--autotune', action=make_override_true_action(override_args),

0 comments on commit 8869284

Please sign in to comment.
You can’t perform that action at this time.