-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Realtime compiled kernels #13
Comments
Hmmm, it's lower-level than I imagined it. For example, I'm not sure I envisaged the client needing to provide the grid and block sizes. I suppose I imagined someting like apply/map/map2, jsut with bigger kernels. Can you perhaps give me some higher level background of your use-case? At the moment, in my head I'm just imagining apply/map etc really. What kinds of things do you want to do that dont fit into apply/map? arguments as a table makes sense. I suppose it can initially take ClTensors and floats. Can add more types later, if needed. as far as contiguity, I think that should be an option somehow possibly, but obviously if you're going to be the first/main client, probably should configure the default option, and the only initial option, to be the one that you will use. |
With this level you'll basically will be able to write nn modules directly in lua. So clnn could be just lua library. You code in lua using ClTensors and if there is a need to speed up something you write a kernel. Then actually contiguity is not a problem, one could provide strides as arguments in the table. |
RTC kernels will be a nightmare for debugging and for general speed of iterative development (unless compiles are cached properly and always). Theano is a good example of why we should tread this path with care. |
@soumith due to the nature of OpenCL kernels are already realtime compiled and compilations are supposed to be cached by the driver. It's just moving the place where they compiled from C++ to lua. |
Ok, I see. So some middle ground perhaps between 314 lines for LogSoftMax in cunn and 48 lines for LogSoftMax in clnn. Sounds like a reasonable direction to try. I think it might be nice if the global workgroup size was set by default to the number of elements in the first output tensor, but can be overridden, by simplying providing it as a parameter. @Sergey: Question, what are your thoughts on how to differentiate between input, output, and inputoutput tensors? Basically, we copy from host to device for input tensors, visa versa for output tensors, and both for inout tensors. But need some way to tell the API which are in, out etc. Or we can always copy, in both directions, but thats a huge performance hit. @Sergey: another question: up till last night, there was some doubt in my mind about the ability to pass in sizes, strides and so on as a struct without impacting performance, but turns out that the horrible Apply performance I was getting was because passing structs to other functions by value in Intel HD gives a huge performance hit. Even for inline functions. Changing to passing htem by pointer gave a 20x speed boost :-P I think it would be good to make the sizes and so on available in the kernel code, automatically, by just passing the tensor into the table. Since getting the user to write the parameters to the kernel by hand would be buggy, I'm thining of getting cltorch to write these automatically. In the specific, it could be used something like this:
Thoughts? |
Let me know if you're ok with this. I'm kind of interested to implement it. I dont have any great ideas for identifying One other nuance: we presumably wnat to be able to define additional functions in the kernel, not just one single Option 1: templated:
Option 2, pass in additional functions through an optional parameter:
On the whole, passing in the other methods through an adidtional (optional) parameter looks cleaner somehow. |
Well, I kind of feel motivated to work on this now, so I might just write a first draft now, and then go from there. |
could do with a snappy-name. I suck at those. RTKernel is accurate in CUDA, but kind of tortology in OpenCL. So far, I just came up with 'CustomKernel'. Maybe it will be Custom.cpp and Custom.h. Freestyle perhaps. UserKernel. (Edit: going with UserKernel.cpp/.h for now) Edit: updated the lua samples above:
Edit2:
Edit3:
Edit4:
Edit5:
Edit6:
(Edit: by the way, can follow progress here: UserKernel.cpp, and corresponding test script src/test/test_userkernel.lua |
Working now :-) Please install cltorch from branch Run this script:
Output:
Code here: src/UserKernel.cpp Edit: can pass options to
|
=> closed |
I don't see use-cases for this, most of the time your tensors have different size, that's the biggest issue. Then what if you want to run on a funny subset of a tensor or do simple reduction. Or have functions outside of the kernel. |
|
Thanks for the clarification Hugh. Still I don't see places were I could apply your current implementation in my projects. |
Looking at your original request, I see no mention of requirement to expose context, device, or buffer pointers. However, it's not hard to expose these. But... they'd just be an opaque black box I guess, like a userdata, so you'd probably need to define what you want to do with them. I guess the most obvious thing one might want to do is use multiple command queues? (Edit: or you'd somehow use these through FFI, so the raw pointer is sufficient?) |
so the original idea of this PR was: you have an OpenCL kernel, you compile it and run. Now what you did is: you have to write something looking like an OpenCL kernel or edit an existing kernel, compile it and run. Kernels might come from different sources, you might just load them from a different library or store in a separate file to be able to compile and profile or analyze with some utility program, so it's good to have the same language. That's what cutorch-rtc allows you, but it's something new to CUDA and natural in OpenCL. So if you'd be exposing pointers to opencl structures I could marry cltorch and lua-opencl to have this low-level and have torch tensors power. |
I think you need to think through how exactly such kernels would look, and how exactly you would use them. All the kernels you've provided me to date have been CUDA-kernels, and you cannot typically pass a CUDA-kernel to OpenCL, for a number of reasons, and not just the syntax:
So, I think you need to think about how exactly your kernels will look like, where they will come from, how you will use them. Maybe even try to implement what you want yourself, and see how it looks like? |
By the way, I have no objection to your providing an alternative implementation of this. Either from scratch, or Frankenversion of what I have, both ok. Either in a fork of cltorch, or, you know what, one of the super-cool things about Torch, cf caffe, is how easy it is to make new modules, monkey-patch things in. So you could create eg cltorch-rtc, and port cutorch-rtc into that. Happy to answer any technical questions you might have on how to approach that by the way. |
Hi Hugh, I finally have an example for you here https://github.com/szagoruyko/cunn-rtc/blob/master/im2col.lua and here https://github.com/szagoruyko/cunn-rtc/blob/master/SpatialConvolutionRTC.lua, im2col and col2im are compiled in runtime and it is a bit faster in some cases. cudnn is times faster of course, but the point of this was to have a clean convolution in lua to experiment with. I think you could have the same lua wrapper and just write im2col and col2im opencl wrapper (doesn't matter if it would be in C++ side or lua as mine), and probably it would be faster too. |
Thanks! Yes, those should be baked in. But, dont need user custom kernels for this, can do it directly in SpatialConvolutionMM, eg see how Gather and Scatter bake in the loop sizes https://github.com/hughperkins/cltorch/blob/master/src/lib/THClGather.cl |
yes sure that would work too |
Following #2.
The API has to be the similar to cutorch-rtc (example here https://github.com/szagoruyko/cutorch-rtc/blob/master/test.lua#L95) with may be different names for the functions (without PTX for example).
So there has to be two functions:
taking the kernel code as the argument and storing the compiled code somewhere inside cltorch and
arguments in the form of a table (in cutorch-rtc I did it in ffi here https://github.com/szagoruyko/cutorch-rtc/blob/master/init.lua#L8). In this form only textures are out of the boat, still haven't thought how to add them to cutorch-rtc.
local_work_size should be taken from max device properties if not specified.
Kernels should have just raw pointers as arguments, I wouldn't bother with strides, just enforce contiguity of input and output tensors.
Is there a way to compile with CPU support btw?
The text was updated successfully, but these errors were encountered: