Releases: horovod/horovod
Hotfix for horovodrun network interface discovery process
Fixed
- Fixed issue with network interface discovery causing
horovodrun
to fail during startup (#1974).
Platform LSF support, Spark on Gloo, and Sync Batch Norm
Highlights
- Added Platform LSF and
jsrun
support tohorovodrun
. (#1805) - Added support for running Horovod on Spark with Gloo in place of MPI. (#1807)
- Added synchronous batch normalization for
horovod.torch
API. (#1923)
Additional changes
- Added support for providing a set of inclusive NICs to
horovodrun
. (#1808) - Added optional
initial_lr
parameter toLearningRateScheduleCallback
, deprecated implicit initialization. (#1933) - Changed Spark Estimators to use Petastorm
BatchDataLoader
. (#1879) - Changed Spark Estimators to use Petastorm's
make_reader
API. (#1804) - Improved latency of background thread loop. (#1880)
- Enabled setting Horovod background thread affinity with all frameworks. (#1881)
- Added
verbose
parameter toSparkBackend
. (#1922) - Use parameter names when scheduling broadcasts in MXNet
broadcast_parameters
. (#1894) - Added metadata cache with calling
fit_on_parquet
. (#1826) - Added optional local version to package version. (#1925)
Bugfixes
- Fixed module resolution for
tf.keras
optimizers when callinghvd.load_model
. (#1935) - Modified
safe_shell_exec
to use multiprocessing spawn instead of fork to prevent deadlocks. (#1915) - Fixed multiprocessing to support Python 3.8. (#1904)
- Added extra preprocessor guard for FMA optimization. (#1835)
- Fixed exception in
KerasEstimator
whennum_proc
is larger than 4. (#1945) - Fixed memory leaks. (#1845)
- Fixed a bug with sample weight in
TorchEstimator
. (#1790) - Removed
torchvision
frompytorch
extra. (#1899)
TensorFlow 2.2 compatibility, MPI args for horovodrun
TensorFlow
- Added
_aggregate_gradients
inDistributedOptimizer
to support Keras in TensorFlow 2.2. (#1784)
PyTorch
- Removed uses of deprecated PyTorch C++ API to support PyTorch 1.6. (#1731)
Changes to horovodrun
- Added process binding arguments to horovodrun. (#1767)
- Added
--tcp
flag to horovodrun for TCP only communication. (#1744)
Changes to installer
- Added
HOROVOD_BUILD_ARCH_FLAGS
to specify architecture-specific compiler flags. (#1751) - Added Python extras to enforce that Horovod is installed after other frameworks. (#1785)
API changes
- Added
mpi_args
tohorovod.run.run
. (#1787) - Added support for data transformation before train and validation in
TorchEstimator
(#1750)
Bugs
Horovod Spark Estimators, Spark 3, Join, Interactive Run
In version 0.19.0, Horovod adds tighter integration with Apache Spark, including a new high-level Horovod Spark Estimator framework and support for accelerator-aware task-level scheduling in the upcoming Spark 3.0 release. This release also contains experimental new features including a join operation for PyTorch and the ability to launch Horovod jobs programmatically from environments like notebooks using a new interactive run mode.
Horovod Spark Estimators (#1554)
To bridge the gap between large-scale data processing in Spark and large-scale deep learning training with Horovod, we’re introducing a new open source API called Horovod Spark Estimators.
With Horovod Spark Estimators, you can train your deep neural network directly on your existing Spark DataFrame, leveraging Horovod’s ability to scale to hundreds of workers in parallel without any specialized code for distributed training:
from tensorflow import keras
import tensorflow as tf
import horovod.spark.keras as hvd
model = keras.models.Sequential()
.add(keras.layers.Dense(8, input_dim=2))
.add(keras.layers.Activation('tanh'))
.add(keras.layers.Dense(1))
.add(keras.layers.Activation('sigmoid'))
# NOTE: unscaled learning rate
optimizer = keras.optimizers.SGD(lr=0.1)
loss = 'binary_crossentropy'
store = HDFSStore('/user/username/experiments')
keras_estimator = hvd.KerasEstimator(
num_proc=4,
store=store,
model=model,
optimizer=optimizer,
loss=loss,
feature_cols=['features'],
label_cols=['y'],
batch_size=32,
epochs=10)
keras_model = keras_estimator.fit(train_df) \
.setOutputCols(['predict'])
predict_df = keras_model.transform(test_df)
Horovod Spark Estimators provide a single abstraction — the Estimator — which hides the complexity of gluing Spark DataFrames to a deep learning training script, reading data into a format interpretable by the training framework, and distributing the training using Horovod. The user only needs to provide a model written in the deep learning framework of their choice, and the Estimator will do the work of fitting it to the DataFrame.
After training, the Estimator returns a Transformer representation of the trained model. The model transformer can be used like any Spark ML transformer to make predictions on an input DataFrame, writing them as new columns in the output DataFrame.
Estimators can be used to track experiment history through model checkpointing, hot start retraining, and metric logging (for Tensorboard) using the Estimator Store abstraction. Stores persist all training artifacts including intermediate representations of the training data. Horovod natively supports stores for HDFS and local filesystems.
Horovod Spark Estimators are available for Keras (both tf.keras and standalone Keras) and PyTorch, with more frameworks (including pure TensorFlow) coming soon.
Spark 3.0 task-level GPU scheduling (#1584)
Apache Spark 3.0 introduces a new accelerator-aware scheduling capability, allowing a production ETL job running on CPUs to hand off data to Horovod running distributed deep learning training on GPUs within the same pipeline, breaking down the barriers between ETL and continuous model training.
Horovod users can now request GPU resources directly from their Spark application, without assuming which tasks should map to which GPUs:
import horovod.spark
def train():
from horovod.spark.task import get_available_devices
import horovod.tensorflow.keras as hvd
hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = get_available_devices()[0]
K.set_session(tf.Session(config=config))
...
horovod.spark.run(train)
Check out the keras_spark3_rossmann.py script for a complete example.
Spark 3.0 is currently in preview release, with the full release forthcoming.
Join Operation for PyTorch (#1058)
The ability for different workers to train on a different number of batches in each epoch has been one of the most requested features for Horovod. This problem frequently arises when a dataset doesn’t evenly split among all workers, requiring the user to truncate any extra examples or risk deadlock during training.
With the new join operation, users no longer need to worry about how evenly their dataset divides when training. Just add a join step at the end of each epoch, and Horovod will train on any extra batches without causing the waiting workers to deadlock:
for epoch in range(epochs):
for batch in dataset:
...
hvd.join(device=hvd.local_rank())
The join operation is currently supported in Horovod for PyTorch, with support for TensorFlow and Apache MXNet coming soon.
Interactive Run Mode (#1307)
With horovod.spark.run
, Horovod was made to support launching training jobs programmatically by defining Python functions that were executed on Spark Executors. Within Horovod Interactive Run Mode, we created a similar API that can launch training jobs on any visible hosts, similar to the command-line horovodrun
tool:
from horovod.run import run as hvdrun
def train():
import horovod.tensorflow as hvd
hvd.init()
...
results = hvdrun(train, np=2)
Interactive mode supports most of the functionality provided by horovodrun
. See the API for a complete reference.
Bug Fixes and Improvements
- Added NCCL implementation of
hvd.broadcast
when building withHOROVOD_GPU_BROADCAST=NCCL
(#1579). - Fixed
hvd.allgather
to work with CUDA tensors when building withHOROVOD_GPU_ALLGATHER=MPI
(#1480). - Fixed a crash bug in MXNet caused by early free of tensor object (#1639).
- Added experimental implementation for the Adasum gradient aggregation method from Microsoft (full support coming in v0.20.0) (#1485).
- Added support for Intel oneCCL to replace MLSL (#1566).
- Added FP16 support in IBM DDL (#1465).
- Improved support for running Horovod on Spark with YARN (#1525).
- Added support for TensorFlow 2.0 learning rate schedules with
tf.keras
(#1588). - Added support for broadcasting Python objects with PyTorch (#1609).
- Added thread pool for CUDA finalizer threads (#1562).
- Fixed host file usage and parsing within
horovodrun
(#1607).