Porting parts of Nengo to CUDA as a summer project #1050

LouisCastricato · 2016-04-30T20:56:12Z

(Not sure how to tag my topic as "discussion")

Hi, I'm currently in the process of porting portions of Nengo to CUDA since I've written multiple Hebbian learners in CUDA before and this seemed like an interesting challenge. (I would do Vulkan or OpenCL but I'm less comfortable in those SDKs)

I have a few questions though, and I will continue to post questions here as they come up. Is there a way to have nengo report all of the operations its running? Obviously I could do it myself with a few well placed print statements but I'm particularly interested if there is a proper debug output.

Would anyone be willing to test sections of my code on their NVIDIA GPU? I have a Titan in my workstation but as you might imagine that wouldn't provide a reasonable estimate of performance on members of the community's computers. Anything with Compute Model 3.5 or better would be best.

And finally, if I were to implement portions of Nengo in CUDA, would it be of any use? Eg: While spike seemed interesting, its been quite dead for the last few years.

Thanks
Louis

arvoelke · 2016-04-30T21:14:50Z

Have you taken a look at nengo_ocl?

Tagging @hunse since he should be able to answer your questions!

LouisCastricato · 2016-04-30T21:34:36Z

Yes I have, hence this was more of a summer project for fun than something that is widely useful. Regardless, a CUDA port may provide immediate benefits over OpenCL due to the larger (To my knowledge, this may have changed in recent years) feature set available in CUDA

LouisCastricato · 2016-04-30T21:47:44Z

I stand corrected. The feature I was going to use for greatly reducing almost all CPU/GPU communication is dynamic parallelism, which until recently had not been supported in OpenCL. Oddly enough Nengo_ocl doesn't seem to make use of this, where as I saw significant improvements last time I had implemented a hebbian learner. (Further edit: Probably because it would entirely break probes if they did lol)

Edit: word

hunse · 2016-05-02T03:42:34Z

So I haven't played around with dynamic parallelism at all, in fact I hadn't even heard of it until you mentioned it. It seems like it might be able to speed up some of the stuff we do. For example, we often have batch operators that group together a bunch of operations with the same kernel, but operating on different sized data, resulting in some work groups that do a lot more work than others. Dynamic parallelism sounds like it could help by having one of those groups spawn off more threads to deal more quickly with their computation. And I'm sure there are other potential benefits that are obvious to those more familiar with dynamic parallelism.

When we wanted to make a GPU-friendly version of Nengo, we decided to go with OpenCL to support as many devices as we could, i.e. AMD/ATI in addition to NVIDIA. It seems that most other software for neural network research has just gone with CUDA, though. As a result, a lot of the people we're targeting with Nengo OCL are probably already using NVIDIA GPUs. Also, it seems that CUDA has more tools available, for example cuDNN (though I'm not convinced tools like that for ANNs will be equally useful/efficient for dynamic neural networks like in Nengo). The point is, there are arguments for having a CUDA version of Nengo.

However, if you want to go that route, I think it would mostly be a project for your own learning and enjoyment, and not something widely useful (as you said). If you could get a CUDA version of Nengo supporting most/all features, and if it was significantly faster than Nengo OCL, then it could be something that would be more widely useful. There would be other questions, like who is going to support it down the road, but it's an open community and you're welcome to develop whatever you want.

If you do want to make something that is more widely useful, though, I would suggest a project that expands on Nengo OCL in some way. I don't think OpenCL is inherently limited compared to CUDA; I think you can write equally fast kernels in either language, other than maybe some special cases like dynamic parallelism, but I think as those cases come up OpenCL will try to make sure it keeps up to speed with CUDA. I think the main place where OpenCL lacks is in terms of supporting libraries; for example, it has no equivalent of cuDNN that I know. But what I've found so far is that because Nengo's requirements are considerably different than typical ANN stuff, those libraries might not be that useful anyway. So I don't think OpenCL is holding us back in any significant way.

In terms of things to do for Nengo OCL, some that are at the top of my list are: multi-GPU support, and better GEMV operations (possibly integrating clBLAS). I'm sure you can come up with other things, too, like taking advantage of dynamic parallelism. If you do want to talk about any of those things, open up an issue on nengo_ocl and we can talk about it there.

And if you do decide to develop a CUDA version, let us know! I'd be curious to see what you come up with, and it's always possible that if you're able to get some cool things working that aren't in Nengo OCL, between your GPU knowledge and my (somewhat limited) OpenCL knowledge, we could get something similar working in Nengo OCL too.

Thanks for your interest!

hunse · 2016-05-02T03:56:19Z

Is there a way to have nengo report all of the operations its running? Obviously I could do it myself with a few well placed print statements

I think you'd just need the one, actually, right here. We've tried to make sure all the step functions have helpful names, so hopefully just printing the function name in that loop will tell you what operation is being run. That's as much as a "proper" way as we've got.

Would anyone be willing to test sections of my code on their NVIDIA GPU? I have a Titan in my workstation but as you might imagine that wouldn't provide a reasonable estimate of performance on members of the community's computers.

I think I would go about this another way. If it's not too hard for you to set up OpenCL (which hopefully it isn't), then try out Nengo OCL and see how fast it is, either for whole models or specific operations (add your own timing code, or use our built-in profiling support). Compare with the same model/operation using your CUDA code. This should give you a good baseline. I've worked a bit to optimize the Nengo OCL code, so it might be hard to beat, but if you're significantly faster than it on a particular model or operation, then that'd be great to know, as it's an obvious place to improve Nengo OCL (especially if you're using your own custom kernels, not NVIDIA-specific libraries).

EDIT: Actually, this is a great way to contribute back while doing your own thing in CUDA. Just having something else to compare against would give me a good idea of how my kernels are doing, since I don't consider myself an expert at GPU programming by any means. If there are places you can beat me, it would be great to know, and CUDA and OpenCL are similar enough that hopefully it's obvious how, and hopefully similar improvements to Nengo OCL will yield similar results.

LouisCastricato · 2016-05-02T17:16:14Z

On second thought I may be more willing to help with Nengo OCL than write Nengo Cuda. After looking through your code base, there are some obvious ways to re-architecture it to squeeze out the rest of the performance. I'll open an issue there :)

LouisCastricato · 2016-05-02T17:59:29Z

Posted

Seanny123 · 2016-05-16T21:15:32Z

Given you've started a collaboration with Eric and OpenCL, can this issue be closed?

LouisCastricato · 2016-05-17T00:18:59Z

Oh yeah of course. Sorry, I forgot

LouisCastricato changed the title ~~Porting parts of Nengo to CUDA as a summer project~~ Porting parts of Nengo to CUDA as a summer project [Discussion] Apr 30, 2016

arvoelke added the discussion label Apr 30, 2016

arvoelke changed the title ~~Porting parts of Nengo to CUDA as a summer project [Discussion]~~ Porting parts of Nengo to CUDA as a summer project May 2, 2016

arvoelke mentioned this issue May 2, 2016

OCL Contribution nengo/nengo-ocl#91

Closed

LouisCastricato closed this as completed May 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting parts of Nengo to CUDA as a summer project #1050

Porting parts of Nengo to CUDA as a summer project #1050

LouisCastricato commented Apr 30, 2016

arvoelke commented Apr 30, 2016 •

edited

LouisCastricato commented Apr 30, 2016

LouisCastricato commented Apr 30, 2016 •

edited

hunse commented May 2, 2016

hunse commented May 2, 2016 •

edited

LouisCastricato commented May 2, 2016

LouisCastricato commented May 2, 2016

Seanny123 commented May 16, 2016

LouisCastricato commented May 17, 2016

Porting parts of Nengo to CUDA as a summer project #1050

Porting parts of Nengo to CUDA as a summer project #1050

Comments

LouisCastricato commented Apr 30, 2016

arvoelke commented Apr 30, 2016 • edited

LouisCastricato commented Apr 30, 2016

LouisCastricato commented Apr 30, 2016 • edited

hunse commented May 2, 2016

hunse commented May 2, 2016 • edited

LouisCastricato commented May 2, 2016

LouisCastricato commented May 2, 2016

Seanny123 commented May 16, 2016

LouisCastricato commented May 17, 2016

arvoelke commented Apr 30, 2016 •

edited

LouisCastricato commented Apr 30, 2016 •

edited

hunse commented May 2, 2016 •

edited