Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL counterpart of cuDNN #34

Open
dagamayank opened this issue May 25, 2016 · 53 comments
Open

OpenCL counterpart of cuDNN #34

dagamayank opened this issue May 25, 2016 · 53 comments

Comments

@dagamayank
Copy link

I came across your post on the Tensorflow thread that you are developing an OpenCL counterpart for cuDNN. I would like to help/contribute on that project. Let me know where and how can I help. I have extensive OpenCL programming experience and am currently focused on ML activities at AMD.

@naibaf7
Copy link
Owner

naibaf7 commented May 25, 2016

@dagamayank
Thank you, help is very welcome, especially from AMD :)
To start, you can have a look at how the kernels are generated and the public interface of the cuDNN replacement:
https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp
https://github.com/naibaf7/caffe/blob/master/include/caffe/greentea/libdnn.hpp

I can also provide you example kernel strings if you don't want to look at that part of the code and are only interested in providing help on optimizing the kernels for AMD GPUs, which would also be very welcome.

@bhack
Copy link

bhack commented May 25, 2016

@naibaf7 Have you seen last updates on Tensorflow thread?

@naibaf7
Copy link
Owner

naibaf7 commented May 25, 2016

@bhack
Yes, why? :)

@bhack
Copy link

bhack commented May 25, 2016

@dagamayank
Copy link
Author

@naibaf7
Kernel strings would be great to have. Also, if you can provide some steps
on how to get started that would be great.

On Wed, May 25, 2016 at 2:24 AM, Fabian Tschopp notifications@github.com
wrote:

@dagamayank https://github.com/dagamayank
Thank you, help is very welcome, especially from AMD :)
To start, you can have a look at how the kernels are generated and the
public interface of the cuDNN replacement:
https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp

https://github.com/naibaf7/caffe/blob/master/include/caffe/greentea/libdnn.hpp

I can also provide you example kernel strings if you don't want to look at
that part of the code and are only interested in providing help on
optimizing the kernels for AMD GPUs, which would also be very welcome.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#34 (comment)

Mayank Daga
"Nothing Succeeds Like Success"

@naibaf7
Copy link
Owner

naibaf7 commented May 26, 2016

@dagamayank
Ok, the easiest way to get started is to compile Caffe with the USE_LIBDNN turned on in the Makefile.config (https://github.com/naibaf7/caffe/blob/master/Makefile.config.example#L15).
Then, if you want to get a kernel string to look for optimization purposes, uncomment this line:

  ss << generate_bw_defs();
  ss << generate_bw_kernels("conv_backward");
  ss << generate_wg_defs();
  ss << generate_wg_kernels("conv_weights");

  // Write complete kernel string
  kernel_ = ss.str();

  // std::cout << kernel_ << std::endl;
}

(it's line https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp#L1588)

This will give you the kernel string in std::cout to examine it for example in AMD's GPU Open CodeXL. Every kernel string will consist of 3 main kernels: conv_forward, conv_backward and conv_weights. For conv_backward and conv_weights, there are 2 different algorithms each that can be selected:

typedef enum {
  // Stack the batch update into one GEMM block
  // (deterministic, 1 kernel call)
  // Serializes the batch and may therefore under use
  // the GPUs compute units.
  LIBDNN_CONVOLUTION_WG_ALGO_DIRECT        = 0,
  // Use multiple GEMM blocks in parallel and update weights atomically
  // (non deterministic, 1 kernel call, not supported on all devices)
  // Parallelizes the batch and has therefore higher GPU usage.
  LIBDNN_CONVOLUTION_WG_ALGO_ATOMIC        = 1,
  // Use multiple GEMM blocks and an intermediate buffer
  // to reduce weight updates
  // (deterministic, >= 2 kernel calls)
  // Parallelizes the batch and has therefore higher GPU usage.
  // NOT IMPLEMENTED YET
  LIBDNN_CONVOLUTION_WG_ALGO_REDUCTION     = 2
} libdnnConvolutionWeightAlgo_t;

typedef enum {
  // Transform data before GEMM (load, im2col, gemm, store)
  // This method is suitable for convolutions with similar
  // spatial input == output sizes, but can become inefficient
  // if input >> output (with large strides and kernels).
  LIBDNN_CONVOLUTION_BW_ALGO_IM2COL        = 0,
  // Transform data after GEMM (load, gemm, col2im, store)
  // Sometimes faster than im2col method, but uses
  // atomic operations and is not deterministic.
  LIBDNN_CONVOLUTION_BW_ALGO_COL2IM_ATOMIC = 1
} libdnnConvolutionBackwardAlgo_t;

which one is being used can be changed here:
https://github.com/naibaf7/caffe/blob/master/src/caffe/layers/libdnn_conv_layer.cpp#L63

Finally, you need to run a network in order to instantiate the layers and get some kernel strings. The recommended starting point for that is using the following command:

./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5

Together with the instructions above, you can dump the kernel strings to a text file like that, and look for optimization possibilities that way. Note that every convolution layer gets its own set of kernels, so the above command will give you many different ones.

@dagamayank
Copy link
Author

@naibaf7
Thanks a lot for these instructions. I will give them a try and report back.

@dagamayank
Copy link
Author

I get failure errors on running "make runtest" on the code in master branch of your repo. Is this expected? Two of the errors are from libDNN. My development environment is AMD W9100 and Ubuntu 14.04.

[----------] Global test environment tear-down
[==========] 2028 tests from 274 test cases ran. (3614992 ms total)
[ PASSED ] 2013 tests.
[ FAILED ] 15 tests, listed below:
[ FAILED ] NetTest/0.TestSharedWeightsUpdate, where TypeParam = caffe::CPUDevice
[ FAILED ] LibDNNComparativeTest/0.TestBackward, where TypeParam = float
[ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial11x11x1x2_caffenet_Conv1, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv4, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestGradient_Spatial, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x1_caffenet_Conv3, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3x2_caffenet_Conv5, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5x1x2_caffenet_Conv2, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.Test1x1Convolution_Spatial, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.Test1x1Gradient_Spatial, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial3x3xPad1, where TypeParam = caffe::GPUDevice
[ FAILED ] ConvolutionLayerTest_Spatial/1.TestSimpleConvolution_Spatial5x5, where TypeParam = caffe::GPUDevice

@naibaf7
Copy link
Owner

naibaf7 commented May 31, 2016

@dagamayank
TestSharedWeightsUpdate seems to fail by being off by a small margin. This is weird but can be ignored and is not relevant for this implementation.

The _Spatial failures are from Intel's convolution implementation. I think the fix here is to use the latest ViennaCL development branch: https://github.com/viennacl/viennacl-dev instead of what Ubuntu supplies.

As for the libDNN, this test should definitely not fail. Here it would be helpful to get the failure message from the runtest itself (i.e. where the runtest on libdnn aborted. You can test this in detail by using:
./build/test/test_all.testbin --gtest_filter=*LibDNN*Comparative*Backward* 0

@dagamayank
Copy link
Author

dagamayank commented Jun 1, 2016

@naibaf7
Well, I do not clearly understand the output; there are a bunch of lines with values but the last few lines are -
Error count: 134841/159600
Difference: 3.17333e+06 (value: 2.30564e+06 vs 2.2954e+06)
src/caffe/test/test_libdnn_conv.cpp:1064: Failure
Value of: false
Expected: failure
Which is: true
[ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double (11638 ms)
[----------] 1 test from LibDNNComparativeTest/1 (11638 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (37154 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 2 tests, listed below:
[ FAILED ] LibDNNComparativeTest/0.TestBackward, where TypeParam = float
[ FAILED ] LibDNNComparativeTest/1.TestBackward, where TypeParam = double

@naibaf7
Copy link
Owner

naibaf7 commented Jun 1, 2016

@dagamayank
I just verified on my W9100 that the backward pass is fine. What driver are you using? I'm using 15.302 (Crimson Edition 15.12 Linux 64 bit).
I had problems with the old FirePro driver, so I switched to the Radeon driver.

Do you have any other OpenCL device to check if the backward pass passes the test?

@dagamayank
Copy link
Author

@naibaf7
Yes, it is probably the old Firepro driver. If it works on your end with
the newer driver, I think we can call it a no-issue for now.

I am going through the kernels right now. Can you mention the reason for
random values to the #defines? It will take sometime for me to understand
what you are doing there.

On Wed, Jun 1, 2016 at 2:36 AM, Fabian Tschopp notifications@github.com
wrote:

@dagamayank https://github.com/dagamayank
I just verified on my W9100 that the backward pass is fine. What driver
are you using? I'm using 15.302 (Crimson Edition 15.12 Linux 64 bit).
I had problems with the old FirePro driver, so I switched to the Radeon
driver.

Do you have any other OpenCL device to check if the backward pass passes
the test?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AIdLMgIUKLvebfxKvpJv2y3FvITbTpxPks5qHTaWgaJpZM4ImHlA
.

Mayank Daga
"Nothing Succeeds Like Success"

@naibaf7
Copy link
Owner

naibaf7 commented Jun 1, 2016

@dagamayank
The defines are defining constants for the kernel, such as padding (v_p), striding (v_s), dilation (v_d) and image sizes (v_imsi, v_imso) in each dimension. Other defines are for the GEMM core configuration (such as TSK, TSM, TSN, WPTM, WPTN, ...)

I put it into defines rather than directly into the kernel string for better readability of the kernel itself (i.e. easier to see where a constant is used and why).
As for documentation, all the values are explained in:
https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp
(look for add_def, which is the C++ method I use for declaring new kernel #defines).

@dagamayank
Copy link
Author

@naibaf7

Are you using autotuning to generate the values of those constants? In
other words, will the constants be same for different kernels and for
different networks?

On Wed, Jun 1, 2016 at 9:01 AM, Fabian Tschopp notifications@github.com
wrote:

@dagamayank https://github.com/dagamayank
The defines are defining constants for the kernel, such as padding (v_p),
striding (v_s), dilation (v_d) and image sizes (v_imsi, v_imso) in each
dimension. Other defines are for the GEMM core configuration (such as TSK,
TSM, TSN, WPTM, WPTN, ...)

I put it into defines rather than directly into the kernel string for
better readability of the kernel itself (i.e. easier to see where a
constant is used and why).
As for documentation, all the values are explained in:
https://github.com/naibaf7/caffe/blob/master/src/caffe/greentea/libdnn.cpp
(look for add_def, which is the C++ method I use for declaring new kernel
#defines).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AIdLMryIrMypGnQtyJCAj583knY-8qvOks5qHZDTgaJpZM4ImHlA
.

Mayank Daga
"Nothing Succeeds Like Success"

@naibaf7
Copy link
Owner

naibaf7 commented Jun 2, 2016

@dagamayank
Some of the values can be autotuned (such as WPTM, WPTN), others are defined by the convolution settings (such as v_p, v_s, v_d). However the autotuner can't store the tuning results yet, so that's experimental.
That means values such as WPTM, WPTN will be the same for every kernel/network at the moment, while v_p, v_s, v_d depends on what kind of convolution you choose (3x3 unpadded, 11x11 with stride, etc.) the image input/output sizes (v_imsi, v_imso) obviously depend on how big the image/feature maps are in the network.

I hope that helps.

@naibaf7
Copy link
Owner

naibaf7 commented Jun 3, 2016

@dagamayank
Have you made any progress on this or is something too complicated?

@dagamayank
Copy link
Author

@naibaf7
I did not get a chance to work on it yet. Working on some internal fires now but I will soon get to it. Auto-generated kernels are not the most simplest ones to understand :)

@naibaf7
Copy link
Owner

naibaf7 commented Jun 3, 2016

@dagamayank
I understand. I will work on the project this weekend and hopefully have some improvements until monday.
One interesting thing I found is that I'm better off targeting TLP instead of ILP on the AMD W9100, i.e. take care not to use too many VGPRS on the AMD card (to get >= 4 waves in flight). On the nVidia card (GTX 980) it was better to push for high ILP (use more #pragma unroll) and relax on occupancy/TLP.
Would be interested what your opinion on this is, and if I am right with these assumptions...

Using vectors of size 4 and 16x16 thread blocks (64x64xTSK shared memory tiling) seems to work best on both cards so far though.

@dagamayank
Copy link
Author

@naibaf7
In my experience using fewer registers is generally a better choice on AMD GPUs. This allows improved occupancy as well as lets the compiler to generate better code.

One question I had was - do I have to run the entire Alexnet or can I just run the 1st convolution layer using cifar10? What kind of performance are you seeing right now?

@naibaf7
Copy link
Owner

naibaf7 commented Jun 3, 2016

@dagamayank
You can remove the layers after the 1st convolution in the prototxt file, or start with any other convolution as long as you have the input data defined & connected correctly.
However the first convolution is usually not the most interesting as it has only a few input feature maps.
Performance wise, on AlexNet forward pass I see these numbers (batch size 64):
(These are all untuned in default configuration, so there should be plenty of headroom)

  • GTX 980 cuDNN forward: 34ms
  • GTX 980 libDNN forward (CUDA): 70ms
  • GTX 980 libDNN forward (OpenCL): 90ms
  • W9100 libDNN forward (OpenCL): 100ms (although here you may see 130ms on the code that you have, I improved the memory access pattern since then. I get this performance at 5 waves in flight.).
  • GTX 980 cuBLAS forward: 110ms
  • GTX 980 clBLAS forward: 184ms
  • W9100 clBLAS forward: 275ms

Especially the clBLAS forward performance is extremely detrimental, which was my main motivation to create libDNN. At this stage, libDNN beats cuBLAS-based implementations. The goal is to get within 70-80% of cuDNN.

@naibaf7
Copy link
Owner

naibaf7 commented Jun 25, 2016

@dagamayank
LibDNN is now available as a standalone library:
https://github.com/naibaf7/libdnn

@zazd
Copy link

zazd commented Jul 2, 2016

@naibaf7 I am very interesting in the LibDNN. It gets a good capability. For I am not familiar with opencl , I just glance over the LibDNN, it seems that it is also using matrix multiplication. If possibly, would your tell me if it is principle same to with cudnn? or so nice as you can provide me the references such as paper or document. Thank you.

@naibaf7
Copy link
Owner

naibaf7 commented Jul 2, 2016

@zazd Yes it uses a local-memory and register-level GEMM.
It is similar to cuDNN, you can read up more here: https://arxiv.org/pdf/1410.0759.pdf

@naibaf7
Copy link
Owner

naibaf7 commented Oct 19, 2016

@bhack @gstoner
Good news for the RX 480: Performance issues and thermal envelope crashes have been completely fixed since Linux kernel 4.8 AMDGPU drivers.
It is now possible to use the RX 480 for deep learning without limitations on any Linux :)

With LibDNN on both the GTX 1080 and RX 480, the RX 480 performs exactly half as fast as the GTX 1080, just like expected.

@bhack
Copy link

bhack commented Oct 19, 2016

Do you have v2 kernels?

@naibaf7
Copy link
Owner

naibaf7 commented Oct 19, 2016

@bhack
For the external library I did not port them yet...
Quite busy with a new project at the moment regarding sparse RNN's. :)
Let me know if you need something though. This was just a heads up because the RX 480 did not work well at all for the past 3 months.

@bhack
Copy link

bhack commented Oct 19, 2016

@naibaf7 It is hard to talk about this topic.. We actually are the only one that use libdnn as upstream :wrink:. It could be nice if caffe could use libdnn as upstream naturally instead of having libdnn downstream. /cc @edgarriba

@naibaf7
Copy link
Owner

naibaf7 commented Oct 19, 2016

@bhack
Yeah last week, Codeplay's CEO contacted me regarding some stuff in OpenCL TensorFlow. If he expresses interest as well, I will definitely re-focus more on the libdnn standalone. But I haven't heard back (yet).

@bhack
Copy link

bhack commented Oct 19, 2016

I think also that @hughperkins could be interested to the standalone upstream

@dagamayank
Copy link
Author

@naibaf7 do you have Winograd kernels in libDNN?

@naibaf7
Copy link
Owner

naibaf7 commented Oct 19, 2016

@dagamayank No not yet...

@bhack
Copy link

bhack commented Oct 19, 2016

Could be interesting if @dicecco1 would contribute upstream on libdnn standalone

@dicecco1
Copy link

I'd be interested in being involved in this, though the way that OpenCL is used with FPGAs has some differences/conflicts with the current way that greentea has been setup.

Currently compile time for kernels is on the order of hours for FPGA implementations, so they use offline compilation and program the FPGA with the binary (this still takes on the order of 300-400ms), so between kernels there has to be little or no reprogramming.

@bhack
Copy link

bhack commented Oct 19, 2016

So it is pratically impossibile to have an autotuning approach like libdnn. Right?

@edgarriba
Copy link

edgarriba commented Oct 19, 2016

Apart from that I think it's quite straightforward to provide a couple of interfaces for offline building and import built binaries. Is that right @naibaf7?

@dicecco1
Copy link

Yeah, essentially for the FPGA implementations you need to decide more on an architecture (since in FPGAs you're configuring circuits rather than processing instructions) and it is usually best to have something that is either general (e.g. can handle different sizes/strides) or is very specific to a model (e.g. tuned to be very high performance for AlexNet). Autotuning for different layers would fit more into the model specific approach to FPGA implementations, but this would still be offline.

@bhack
Copy link

bhack commented Oct 19, 2016

@dicecco1 I have not checked in detail your paper but your Winograd kernel could be ported also on GPU/CPU or need to be heavily reeinginered?

@dicecco1
Copy link

The winograd kernel would need to be heavily re-engineered for CPU/GPU implementations.

@bhack
Copy link

bhack commented Oct 19, 2016

I don't know if also @keryell is interested in dicecco1 kernels

@bhack
Copy link

bhack commented Oct 19, 2016

For all in the thread I'm talking of https://github.com/dicecco1/fpga_caffe

@naibaf7
Copy link
Owner

naibaf7 commented Oct 19, 2016

There certainly are ways to either cache or tune the kernels on a surrogate platform.
The key here would be to know the FPGA's details and make educated guesses about the performance instead of tuning directly on the FPGA.

@naibaf7
Copy link
Owner

naibaf7 commented Oct 19, 2016

@bhack @dicecco1
The issue of having to massively re-engineer winograd kernels to fit to new platforms has been noticed by the developers of NEON/Nervanasys as well as @hughperkins. There's good reasons Nervanasys has built specific compilers for Maxwell/Pascal.
The architectural differences are even bigger when going to AMD; VGPRS usage has to be kept in check, and the constant buffers/local memory has to be optimized differently. Local memory is bigger on Maxwell/Pascal than on Polaris/Hawaii, and the cache system works completely different (AMD has 64 KB constant buffers, nVidia uses a read-through/write-through configurable caching system).

@bhack
Copy link

bhack commented Oct 24, 2016

@naibaf7 Can you notify us if you have some feedback of others interested to have v3 kernels and libdnn standalone as upstream?

@naibaf7
Copy link
Owner

naibaf7 commented Oct 24, 2016

@bhack
Yes. Still waiting on feedback here :)

@hughperkins
Copy link

Observation: I'm still waiting on an example of calling libdnn from C++ :-)

@bhack
Copy link

bhack commented Oct 24, 2016

You can seen an example with tuning commented at https://github.com/tiny-dnn/tiny-dnn/blob/master/tiny_dnn/core/kernels/conv2d_op_libdnn.h

@bhack
Copy link

bhack commented Oct 29, 2016

@naibaf7 ok please give us an update as you can cause the standalone version it is quite on hold.

@naibaf7
Copy link
Owner

naibaf7 commented Oct 30, 2016

@bhack
Yes, quite unfortunately, since I'm working hard on my semester project (sparse repeated pattern recurrent neural networks); unfortunately my university does not accredit my work on Caffe :)
The current timeline is as follows:

  • Non-atomic backward kernel for pooling by beginning of december.
  • Updated standalone LibDNN by end of december, will include V2 (convolution kernels) and V3+V4 (pooling kernels).

@naibaf7
Copy link
Owner

naibaf7 commented Dec 5, 2016

Status update, non-atomic backward kernels for pooling finished, library unit tested & verified with style-transfer and MNIST examples.
Next step: Standalone LibDNN update by end of december (latest).

@bhack
Copy link

bhack commented Dec 5, 2016

Latest? Is it the end of the project?

@naibaf7
Copy link
Owner

naibaf7 commented Dec 5, 2016

No, this is just the latest point in time I project being done with this step :)
It could be from anywhere 2 to 4 weeks until LibDNN is on the newest kernel versions and gets pooling support.

@bhack
Copy link

bhack commented Dec 5, 2016

Ok ;)

@naibaf7
Copy link
Owner

naibaf7 commented Dec 5, 2016

After that I don't know what the next optimization is going to be. Either mobile ARM chips with integrated GPUs or AMD's Vega and FP16, depending on what I can get my hands on first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants