This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.
1. Mixed Precision
amp: Automatic Mixed Precision
apex.amp is a tool designed for ease of use and maximum safety in FP16 training. All potentially unsafe ops are performed in FP32 under the hood, while safe ops are performed using faster, Tensor Core-friendly FP16 math.
amp also automatically implements dynamic loss scaling.
The intention of
amp is to be the "on-ramp" to easy FP16 training: achieve all the numerical stability of full FP32 training, with most of the performance benefits of full FP16 training.
apex.FP16_Optimizer wraps an existing Python optimizer and automatically implements master parameters and static or dynamic loss scaling under the hood.
The intention of
FP16_Optimizer is to be the "highway" for FP16 training: achieve most of the numerically stability of full FP32 training, and almost all the performance benefits of full FP16 training.
The Imagenet and word_language_model directories also contain examples that show manual management of master parameters and static loss scaling.
These manual examples illustrate what sort of operations
FP16_Optimizer are performing automatically.
2. Distributed Training
apex.parallel.DistributedDataParallel is a module wrapper, similar to
torch.nn.parallel.DistributedDataParallel. It enables convenient multiprocess distributed training,
optimized for NVIDIA's NCCL communication library.
The Imagenet with FP16_Optimizer
mixed precision examples also demonstrate
Synchronized Batch Normalization
support synchronized BN.
It reduces stats across processes during multiprocess distributed data parallel
Synchronous Batch Normalization has been used in cases where only very small
number of mini-batch could be fit on each GPU.
All-reduced stats boost the effective batch size for sync BN layer to be the
total number of mini-batches across all processes.
It has improved the converged accuracy in some of our research models.
CUDA 9 or 10
PyTorch 0.4 or newer. We recommend to use the latest stable release, obtainable from
https://pytorch.org/. We also test against the latest master branch, obtainable from https://github.com/pytorch/pytorch.
If you have any problems building, please file an issue.
The cpp and cuda extensions require pytorch 1.0 or newer.
To build the extension run
python setup.py install
in the root directory of the cloned repository.
To use the extension
Apex contains optional CUDA/C++ extensions, installable via
python setup.py install [--cuda_ext] [--cpp_ext]
- Fused kernels that improve the performance and numerical stability of
- Fused kernels required to use
- Fused kernels required to use 'apex.normalization.FusedLayerNorm'.
- C++-side flattening and unflattening utilities that reduce the CPU overhead of
Windows support is experimental, and Linux is recommended. However, since Apex could be Python-only, there's a good chance the Python-only features "just works" the same way as Linux. If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.