Deep Learning Framework Examples

For more details check out our blog-post

Goal

Create a Rosetta Stone of deep-learning frameworks to allow data-scientists to easily leverage their expertise from one framework to another
Optimised GPU code with using the most up-to-date highest-level APIs.
Common setup for comparisons across GPUs (potentially CUDA versions and precision)
Common setup for comparisons across languages (Python, Julia, R)
Possibility to verify expected performance of own installation
Collaboration between different open-source communities

The notebooks are executed on an Azure Deep Learning Virtual Machine.

Accuracies (and other metrics) are reported in notebooks

Results

1. Training Time(s): CNN (VGG-style, 32bit) on CIFAR-10 - Image Recognition

DL Library	K80/CUDA 8/CuDNN 6	P100/CUDA 8/CuDNN 6
Caffe2	148	54
Chainer	162	69
CNTK	163	53
MXNet(Gluon)	152	57
Keras(CNTK)	194	76
Keras(TF)	241	76
Keras(Theano)	269	93
Tensorflow	173	57
Lasagne(Theano)	253	65
MXNet(Module API)	145	52
PyTorch	169	51
Julia - Knet	159	??
R - Keras(TF)	205	72

Note: It is recommended to use higher level APIs where possible; see these notebooks for examples with Tensorflow, MXNet and CNTK. They are not linked in the table to keep the common-structure-for-all approach

Input for this model is the standard CIFAR-10 dataset containing 50k training images and 10k test images, uniformly split across 10 classes. Each 32 by 32 image is supplied as a tensor of shape (3, 32, 32) with pixel intensity re-scaled from 0-255 to 0-1.

2. Training Time: DenseNet-121 on ChestXRay - Image Recognition (Multi-GPU)

Train+Val w/ data-loader + data-augmentation on real-data on SSD

DL Library	1xV100/CUDA 9/CuDNN 7	4xV100/CUDA 9/CuDNN 7
Pytorch	27min	10min
Keras(TF)	38min	18min
Tensorflow	33min	22min
Chainer	29min	8min
MXNet(Gluon)	29min	10min

Train w/ synthetic-data in RAM

DL Library	1xV100/CUDA 9/CuDNN 7	4xV100/CUDA 9/CuDNN 7
Pytorch	25min	8min
Keras(TF)	36min	15min
Tensorflow	25min	14min
Chainer	27min	7min
MXNet(Gluon)	28min	8min

Notes:

Chainer suffered an AUC drop relative to all other frameworks when going from single to multi-GPU

Input for this model is 112,120 PNGs of chest X-rays resized to (264, 264). Note for the notebook to automatically download the data you must install Azcopy and increase the size of your OS-Disk in Azure Portal so that you have at-least 45GB of free-space (the Chest X-ray data is large!). The notebooks may take more than 10 minutes to first download the data. These notebooks train DenseNet-121 and use native data-loaders to pre-process the data perform some augmentations (random horizontal flip and random crop to 224px).

3. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction

DL Library	K80/CUDA 8/CuDNN 6	P100/CUDA 8/CuDNN 6
Caffe2	14.1	7.9
Chainer	9.3	2.7
CNTK	8.5	1.6
MXNet(Gluon)		1.7
Keras(CNTK)	21.7	5.9
Keras(TF)	10.2	2.9
Tensorflow	6.5	1.8
MXNet(Module API)	7.7	1.6
PyTorch	7.7	1.9
Julia - Knet	6.3	???
R - MXNet	???	???
R - Keras(TF)	17	7.4

A pre-trained ResNet50 model is loaded and chopped just after the avg_pooling at the end (7, 7), which outputs a 2048D dimensional vector. This can be plugged into a softmax layer or another classifier such as a boosted tree to perform transfer learning. Allowing for a warm start; this forward-only pass to the avg_pool layer is timed. Note: batch-size remains constant, however filling the RAM on a GPU would produce further performance boosts (greater for GPUs with more RAM).

4. Training Time(s): RNN (GRU) on IMDB - Sentiment Analysis

DL Library	K80/CUDA 8/CuDNN 6	P100/CUDA 8/CuDNN 6	Using CuDNN?
CNTK	32	15	Yes
Keras(CNTK)	86	53	No
Keras(TF)	35	26	Yes
MXNet(Module API)	29	24	Yes
MXNet(Gluon API)	TBA	TBA	Yes
Pytorch	31	16	Yes
Tensorflow	30	22	Yes
Julia - Knet	29	??	Yes
R - MXNet	??	??	???
R - Keras(TF)	35	25	Yes

Input for this model is the standard IMDB movie review dataset containing 25k training reviews and 25k test reviews, uniformly split across 2 classes (positive/negative). Processing follows Keras approach where start-character is set as 1, out-of-vocab (vocab size of 30k is used) represented as 2 and thus word-index starts from 3. Zero-padded / truncated to fixed axis of 150 words per review.

Where possible we try to use the cudnn-optimised RNN (noted by the CUDNN=True switch), since we have a vanilla RNN that can be easily reduced to the CuDNN level. For example with CNTK we use optimized_rnnstack instead of Recurrence(LSTM()). This is much faster but less flexible and, for example, with CNTK we can no longer use more complicated variants like Layer Normalisation, etc. It appears in PyTorch this is enabled by default. For MXNet I could not find this and instead use the slightly slower Fused RNN. Keras has just very recently received cudnn support, however only for the Tensorflow backend (not CNTK). Tensorflow has many RNN variants (including their own custom kernel) and there is a nice benchmark here, I will try to update the example to use CudnnLSTM instead of the current method.

Note: CNTK supports dynamic axes which means we don't need to pad the input to 150 words and can consume as-is, however since I could not find a way to do this with other frameworks I have fallen back to padding - which is a bit unfair on CNTK and understates its capabilities

The classification model creates an embedding matrix of size (150x125) and then applies 100 gated recurrent units and takes as output the final output (not sequence of outputs and not hidden state). Any suggestions on alterations to this are welcome.

Lessons Learned

CNN

The below offers some insights we gained after trying to match test-accuracy across frameworks and from all the GitHub issues/PRs raised.

The above examples (except for Keras), for ease of comparison, try to use the same level of API and so all use the same generator-function. For MXNet, Tensorflow, and CNTK I have experimented with a higher-level API, where I use the framework's training generator function. The speed improvement is negligible in this example because the whole dataset is loaded as NumPy array in RAM and the only processing done each epoch is a shuffle. I suspect the framework's generators perform the shuffle asynchronously. Curiously, it seems that the frameworks shuffle on a batch-level, rather than on an observation level, and thus ever so slightly decreases the test-accuracy (at least after 10 epochs). For scenarios where we have IO activity and perhaps pre-processing and data-augmentation on the fly, custom generators would have a much bigger impact on performance.
Running on CuDNN we want to use [NCHW] instead of channels-last. Keras finally supports this for Tensorflow (previously it had NHWC hard-coded and would auto-reshape after every batch)
Enabling CuDNN's auto-tune/exhaustive search parameter (which selects the most efficient CNN algorithm for images of fixed-size) produced a huge performance boost on the K80 many months ago. However, now most frameworks have automatically integrated this.
Some frameworks required a boolean supplied to the dropout-layer indicating whether we were training or not (this had a huge impact on test-accuracy, 72 vs 77%). Dropout should not be applied to test in this case.
TF_ENABLE_WINOGRAD_NONFUSED no longer speeds up TensorFlow's CNN and actually makes it slower
Softmax is usually bundled with cross_entropy_loss() for most functions and it's worth checking if you need an activation on your final fully-connected layer to save time applying it twice
Kernel initializer for different frameworks can vary (I've found this to have +/- 1% effect on accuracy) and I try to specify xavier/glorot uniform whenever possible/not too verbose
Type of momentum implemented for SGD-momentum; I had to turn off unit_gain (which was on by default in CNTK) to match other frameworks' implementations
Caffe2 has an extra optimisation for the first layer of a network (no_gradient_to_input=1) that produces a small speed-boost by not computing gradients for input. It's possible that Tensorflow and MXNet already enable this by default. Computing this gradient could be useful for research purposes and for networks like deep-dream
Applying the ReLU activation after max-pooling (instead of before) means you perform a calculation after dimensionality-reduction and thus shave off a few seconds. This helped reduce MXNet time by 3 seconds
Some further checks which may be useful:

specifying kernel as (3) becomes a symmetric tuple (3, 3) or 1D convolution (3, 1)?
strides (for max-pooling) are (1, 1) by default or equal to kernel (Keras does this)?
default padding is usually off (0, 0)/valid but useful to check it's not on/'same'
is the default activation on a convolutional layer 'None' or 'ReLu' (Lasagne)
the bias initializer may vary (sometimes no bias is included)
gradient clipping and treatment of inifinty/NaNs may differ across frameworks
some frameworks support sparse labels instead of one-hot (which I use if available, e.g. Tensorflow has f.nn.sparse_softmax_cross_entropy_with_logits)
data-type assumptions may be different - I try to use float32 and int32 for X and y but, for example, torch needs double for y (to be coerced into torch.LongTensor(y).cuda)
if the framework has a slightly lower-level API make sure during testing you don't compute the gradient by setting something like training=False

Installing Caffe2 for python 3.5 proved a bit difficult so I wanted to share the process:

# build as root
sudo -s
cd /opt/caffe2
make clean
git pull
git checkout v0.8.1
git submodule update
export CPLUS_INCLUDE_PATH=/anaconda/envs/py35/include/python3.5m
mkdir build
cd build
echo $PATH
# CONFIRM that Anaconda is not in the path
cmake .. -DBLAS=MKL -DPYTHON_INCLUDE_DIR=/anaconda/envs/py35/include/python3.5m -DPYTHON_LIBRARY=/anaconda/envs/py35/lib/libpython3.5m.so -DPYTHON_EXECUTABLE=/anaconda/envs/py35/bin/python
make -j$(nproc)
make install

When using MXNet, you should avoid assigning outputs or data to numpy np.array in your training loop. This causes the data to be copied from the GPU to the CPU. You should use mx.nd.array instead, allocated in the right context at the beginning. This can dramatically increase performance.
When using MXNet, operations are allocated on the queue of the back-end engine and parallelized, try to avoid any blocking operations in your training loop. You can add a nd.waitall(), which will force waiting for all operations to complete at the end of each epoch to avoid filling up your memory.
With MXNet/Gluon, calling .hybridize() on your network will cache the computation graph and you will get performance gains. However that means that you won't be able to step through every calculations anymore. Use it once you are done debugging your network.

RNN

There are multiple RNN implementations/kernels available for most frameworks (for example Tensorflow); once reduced down to the cudnnLSTM/GRU level the execution is the fastest, however this implementation is less flexible (e.g. maybe you want layer normalisation) and may become problematic if inference is run on the CPU at a later stage. At the cudDNN level most of the frameworks' runtimes are very similar. This Nvidia blog-post goes through several interesting cuDNN optimisations for recurrent neural nets e.g. fusing - "combining the computation of many small matrices into that of larger ones and streaming the computation whenever possible, the ratio of computation to memory I/O can be increased, which results in better performance on GPU".
It seems that the fastest data-shape for RNNs is TNC - implementing this in MXNet only gave an improvement of 0.5s so I have chosen to use the sligthly slower shape to remain consistent with other frameworks and to keep the code less complicated

Name		Name	Last commit message	Last commit date
Latest commit History 282 Commits
notebooks		notebooks
support		support
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

support

support

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Deep Learning Framework Examples

Goal

Results

1. Training Time(s): CNN (VGG-style, 32bit) on CIFAR-10 - Image Recognition

2. Training Time: DenseNet-121 on ChestXRay - Image Recognition (Multi-GPU)

3. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction

4. Training Time(s): RNN (GRU) on IMDB - Sentiment Analysis

Lessons Learned

CNN

RNN

About

Releases

Packages

Contributors 16

Languages

License

ilkarman/DeepLearningFrameworks

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Framework Examples

Goal

Results

1. Training Time(s): CNN (VGG-style, 32bit) on CIFAR-10 - Image Recognition

2. Training Time: DenseNet-121 on ChestXRay - Image Recognition (Multi-GPU)

3. Avg Time(s) for 1000 images: ResNet-50 - Feature Extraction

4. Training Time(s): RNN (GRU) on IMDB - Sentiment Analysis

Lessons Learned

CNN

RNN

About

Resources

License

Stars

Watchers

Forks

Languages