-
-
Notifications
You must be signed in to change notification settings - Fork 95
Closed
Description
RNNs and particularly LSTM and GRU made a significant contribution to deep learning applications.
They are the default go-to tool for natural language processing, are heavily explored in reinforcement learning, many visual+text combined tasks and time-series prediction (though in competition with WaveNets)
CuDNN implementation is already heavily optimized however CPU implementation should be the fastest possible as well.
General overview
- GRU Paper
- CS231n 2017 - lecture 10
- Colah tutorial
- Towards Data Science
- Tensorflow vs PyTorch/CuDNN
TensorflowPyTorch equationsr = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) n = tanh(W_{in} x + b_{in} + W_{hn} (r * h) + b_{hn})) h' = (1 - z) * n + z * hNote that in the paper equations are:r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) n = tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) h' = (1 - z) * n + z * hAnd CuDNNr = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) n = tanh(W_{in} x + b_{in} + W_{hn} (r * h) + b_{hn})) h' = (1 - z) * h + z * nit = σ(Wi * xt + Ri * ht-1 + bWi + bRu) rt = σ(Wr * xt + Rr * ht-1 + bWr + bRr) h't = tanh(Wh * xt + rt ◦ (Rh * ht-1 + bRh) + bWh) ht = (1 - it)◦h't + it◦ht-1
Readable implementation
- WildML - GRU
- Pure Numpy GRU implementation used by Intel Nervana Neon for testing
- Neon GRU test suite
- Neon implementation
- PyTorch implementation
- CuTorch fused RNN implementation
- Torch implementation of LSTM by jcjohnson
- Official Torch RNN
- Theano RNN
- Lasagne/Theano official implementation
- Tensorflow tutorial for GRU
- MXNet high-level API
"Unreadable" C++ implementations (static graphs)
Benchmarks
Unfortunately only GPU benchs are available:
Optimized implementations
- GRU4Rec in Theano, apparently this was 170x faster than Tensorflow code
- Nvidia on how to optimize RNNs and paper.
- Baidu Research:
- in-depth part 1
- Combine across timesteps the multiplications by weights
- Gemm NN and Gemm TN do not have the same speed (including for CUBLAS)
- part 2 on Graph optimization
- Concatenation across timesteps and gates
- Moving the Reset Gate
- Saving activation
- Persistent RNNs for small batches with weights in GPU registers
- in-depth part 1
- Yandex (Russian Search Engine) Faster-RNNLM
- Focus on the One Billion Word Benchmark and can process about 250k words per second with 8 threads at 3.3 Ghz
- Paper with 3 variants of GRU with less parameters, Rahul Dey and Fathi M. Salem
- See also Wikipedia
Note on biases and equations
The various implementations do not agree on biases, and the equations chosen.
- WildML has 1 bias per equation, Keras and Neon too.
- Chainer, Torch and CuDNN have 2 biases.
To allow loading weights on both CPU and GPU, it would be best to use the same equations as CuDNN.
List of relevant issues:
- PyTorch forum: Redundant biases for LSTM
- Keras: weights on GPU cannot be reused on CPU and solutions (i.e. redoing a CPU layer):
bluenote10, pengzhao-intel, andreaferretti, deem0n, Separius and 1 more