OCL Contribution #91

LouisCastricato · 2016-05-02T17:58:47Z

@hunse Told me to open a discussion on this repo to discuss potential improvements.

Firstly, I'd like to say that Multi GPU may become extremely viable with NVIDIA Pascal's release of NVLink, and AMD Polaris's Coherent Interconnect Fabric (I believe that's what the name is) as it may be much more practical that it was on previous generations to split the simulation of different ensembles between GPUs. I developed a similar system about 18 months ago for doing multi-agent clustering, and essentially performed a "Pseudo Boosting" algorithm on the CPU in order to determine which clusters to group together on the GPUs; however, due to the topology of a Nengo networking being known beforehand it is most likely possible to implement a similar system with significantly less overhead.

Secondly, I noticed that it appears OCL doesn't properly reuse memory that it already allocated on previous kernel executions and doesn't avoid branching

While this is fine currently, @hunse had mentioned that he perhaps wants to implement dynamic parallelism. When dynamic parallelism is in use, you need to have a much tighter control on memory and branching in order to avoid major bottlenecking

An example of well implemented dynamic parallelism is available here. Note that this file has a few syntax errors, but the idea is there. Roughly, no branch has an uneven number of instructions, I do not create new stack frames, and I try to reuse as much memory as possible. Furthermore, I do not recommend trampolining of any sort when doing dynamic parallelism. It isn't designed for that and you will run into limitations very very quickly.

Last time I used OpenCL, you couldn't assign more than one thread to an individual GPU. It is possible that Vulkan has changed this. Said being, BEFORE attempting to tackle dynamic parallelism, we should look into optimizing our method of batching tasks to the GPU from the CPU. Since the order of instructions is known at all points of the simulation, if we rework how we're managing memory a bit its very possible that we could well exceed the performance of dynamic parallelism, although there will probably be some performance boost over its existing state. If we see that we still need dynamic parallelism though, we can always try that too.

There are some issues that need to be brought up. DP is only supported on NVIDIA cards that support compute model 3.5 and up. Namely, 750Ti and >= GTX 770. DP is supported on all AMD GPUs 6xxx and up I believe (With a few odd and random exceptions though). Which brings up the concern that we can't move all of Nengo to DP, since GPUs of that caliber may not be common among members of the community. Also, I don't believe DP can execute kernels on other GPUs. I believe it is only limited to running kernels internally. Note trampolining is possible between GPUs. Don't do this. Don't. Srsly. You will regret it.

If we were to add Multi GPU support with limited CPU communication, I believe OpenCL has a way to do P2P GPU-to-GPU communication without CPU intervention. This would probably be the way we'd want to go. Cuda Equivalent. If we use a UVA system, then we don't need to worry about this; however, UVA is only useful if you don't have an in depth understanding of how the GPUs are communicating. Luckily nengo has a very well defined set of rules for this, pertaining to the topology of the network :P

There was a feature with DirectCompute back in DX11.1 that I only recently saw exists in Vulkan. We can premake the list of operations that the GPU needs to execute, as well as specify memory transfers between peers, and to the host. We can then execute this premade list long after the fact. (Eg: Even between sessions). This may also be useful since it has some of the same performance benefits of DP and multithreading without the need for either (Note: It technically does both slightly worse, but it has much much lower requirements for the end user as well as less CPU overhead.)

Finally, something I noticed when going through OCL. OpenCL has APU support, and Skylake has Iris Pro APUs and AMD's Zen will have perhaps equally as strong APUs. None of OCL is optimized for APUs. Perhaps we should include a mode for this? APUs have the benefit that Host to Device and Device to Host transfers are practically free. I imagine an APU mode would try to offload as much as it could to AVX/SSE while leaving the rest to the APU itself. This would be particularly useful for laptops and/or ARM devices.

Thanks
Louis

arvoelke · 2016-05-02T18:13:03Z

(Note: this is a continuation of nengo/nengo#1050)

LouisCastricato · 2016-05-02T18:18:27Z

Also perhaps this is a better example of OCL not avoiding branching. The branching present in this implementation would most likely create a significant bottleneck with enough throughput

hunse · 2016-05-02T19:21:03Z

Those all sound like good ideas!

One thing to keep in mind is the types of models that we typically run, and the types of operations that are most prevalent. One good example is the circular convolution benchmark, and profile_circconv.py in the same folder runs just one simulation with profiling. In that network, the GEMV operations are far and away the most expensive, so I think anything that can speed them up will lead to significant overall improvements. The neuron step functions are also reasonably costly, so maybe removing some of the branching could help.

As you pointed out, it's also important to maintain support for many types of GPUs. For example, if we wanted to add dynamic parallelism, I'd want it done in such a way that a) GPUs that don't support it can just run using what we've got now, and b) the added complexity is factored out as much as possible, so that one doesn't have to understand dynamic parallelism to understand how things are running on basic GPUs.

LouisCastricato · 2016-05-02T19:37:49Z

Yeah and to add to that I think we can even drop the requirements a bit. Imagine running nengo ocl on an RPI or paralella. If we were to implement an APU mode and support precomputing the kernel call list then we would probably have no problem with supporting a bunch of odd and interesting devices.

I'd even be down for doing some research of how nengo performs on a huge cluster of RPIs or a beowulf cluster of arm processors.

LouisCastricato · 2016-05-02T19:39:38Z

Also I implemented most of the neuron step functions without branching in cuda. I'll look into porting it over this week

LouisCastricato · 2016-05-02T20:05:47Z

Do you mind explaining what GEMV does? Maybe its because I'm fairly novice in Comp Neuro but I tried reading through the file and got lost rather quickly.

Edit:
Decided it was in my best interest to audit and analyze the performance of your code and provide you with potentially very useful feedback.

I started looking through Gemv. I think the first challenge is figuring out exactly how many instructions each of these branches is doing. Branching is okay as long as the branches do the same instructions in the same order but perhaps on different objects. Eg:
if (x > y) return 0; return 1;

Is okay since both branches do the same instruction just with different values. Hence decoherence would essentially be equivalent to having no branching at all.

Just from a brief look through, you can probably expect a 20 - 30% reduction in the number of wasted cycles by properly optimizing gemv. I don't know how large of a performance increase this would be, but probably more than 3% which is significant in the long run.

TODO:

element_wise_inc in plan needs to be using shared memory. Also it needs to be slightly redesigned to avoid throttling when loading shared memory.

linearfilter needs to be scrapped and rewritten. I can't even begin to explain where its issues are. If you run Nsight on it, or some other GPU debugger program you will see that only 1/10th of the threads that were told to run it are active at any given point. Its running at 10% its maximum efficiency.

Probes looks okay besides the first branch. That needs to be fixed. I would recommend breaking it up into two passes and using DP here. It seems like a ripe usage.

direct looks perfectly fine

LIF needs some work. It wastes too much memory every time the kernel is initialized. Luckily I implemented a much more memory efficient one in CUDA already. I can port it over ASAP.

TODO: could precompute -expm1(-dtu / tau)

Its probably negligible. Almost all performance issues on GPUs is memory and/or decoherence related.

lif_rate looks ok

template is a bit interesting lol

rng looks fine

White noise needs some work. Nsight was giving an efficiency rating of about 40% That's ok if it isn't called every often but I don't think that's the case.

I have no idea what present input does but thats a LOT of memory loading for half a dozen math instructions. It may need some revision.

Why not use a professionally made convolution implementation? Your implementation doesn't seem to do anything special when compared to normal 2d convolution, so perhaps you should use a standard library? Not to discredit you obviously but its fairly likely that their implementation may perform significantly faster than yours.

Same goes for pool2d

Back to gemv

With block dot product you had the right idea but wrong tactic. I need to look over it a bit more, but you need to use 1) Shared Memory 2) Atomic block reductions (Eg the GPU shuffle ♫ ) 3) The branching can easily be removed here and I think it'll make a HUUGE performance difference. 4) We need to move to a 3D layout instead of 2D in order to get rid of that for loop. This is typically how a very performance oriented version of matrix multiplication is implemented in CUDA, and as such it will provide large benefits here too. Alternatively, there may be ways to unroll the for loop when building the network.

LouisCastricato · 2016-05-02T21:26:01Z

Also I want to move signals to F16 rather than F32. Lower memory bandwidth requirements and I don't think it'll make that much of a difference in end result since its so noisy to start with anyway.

hunse · 2016-05-03T15:50:32Z

Also I want to move signals to F16 rather than F32. Lower memory bandwidth requirements and I don't think it'll make that much of a difference in end result since its so noisy to start with anyway.

I would never hard-code this. But it would be great to have an option. There's actually an issue in Nengo to change signal dtypes.

Do you mind explaining what GEMV does?

It's a general matrix-vector multiply.

White noise needs some work. Nsight was giving an efficiency rating of about 40% That's ok if it isn't called every often but I don't think that's the case.

That definitely is the case. We don't use it very much right now.

GEMV and LIF steps are far and away the main culprits. Also, there are actually a lot of copies, and even though the kernel is not bad, because they're not getting grouped together it's just resulting in a lot of kernel calls and thus a lot of overhead.

LouisCastricato · 2016-05-03T16:04:29Z

Hmm ok so first thing I am going to do is look into cleaning up LIF and and managing the copy calls. I think solving the issues with copy calls is more of a CPU bounded issue than it is a GPU bounded one. Perhaps look into multithreading the creation of copy calls?

The kernel is fine, I agree. I can't really find any issues with it.

Fixing GEMV may take a very long time and it was only recently that NVIDIA and AMD began working on the architecture of their cards to better handle reductions like GEMV calls. This is going to take some serious work but it looks like a fun challenge :)

drasmuss · 2020-03-17T14:58:58Z

Closing this since it's several years old at this point, but still good ideas in this thread for anyone that wanted to take them on!

arvoelke changed the title ~~OCL Contribution [Discussion]~~ OCL Contribution May 2, 2016

arvoelke added the discussion label May 2, 2016

drasmuss closed this as completed Mar 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCL Contribution #91

OCL Contribution #91

LouisCastricato commented May 2, 2016

arvoelke commented May 2, 2016

LouisCastricato commented May 2, 2016

hunse commented May 2, 2016

LouisCastricato commented May 2, 2016

LouisCastricato commented May 2, 2016

LouisCastricato commented May 2, 2016 •

edited

Loading

LouisCastricato commented May 2, 2016

hunse commented May 3, 2016

LouisCastricato commented May 3, 2016

drasmuss commented Mar 17, 2020

OCL Contribution #91

OCL Contribution #91

Comments

LouisCastricato commented May 2, 2016

arvoelke commented May 2, 2016

LouisCastricato commented May 2, 2016

hunse commented May 2, 2016

LouisCastricato commented May 2, 2016

LouisCastricato commented May 2, 2016

LouisCastricato commented May 2, 2016 • edited Loading

TODO: could precompute -expm1(-dtu / tau)

LouisCastricato commented May 2, 2016

hunse commented May 3, 2016

LouisCastricato commented May 3, 2016

drasmuss commented Mar 17, 2020

LouisCastricato commented May 2, 2016 •

edited

Loading