You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Func::in() can also be used to compute pieces of a Func into a
* smaller scratch buffer (perhaps on the GPU) and then copy them
* into a larger output buffer one tile at a time. See
* apps/interpolate/interpolate.cpp for an example of this. In
While we are at .in() (again with FAQs efforts in mind), I'd like to also hear about the technique of copying memory into a SM's shared memory for improved performance. There is a trick in the apps somewhere that uses .in().in() to achieve this. I think this needs extensive elaboration:
// vectorized/unrolled 2x2 tiles. Instead of having
// each unrolled iteration do its own mix of scalar
// and vector loads from shared memory in a 5x5
// window, many of which get deduped across the block,
// we load a 6x6 window of shared into registers using
// only aligned vector loads, and then the actual
// stencil pulls from those registers. We're adding
// another wrapper Func around the wrapper Func we
// created above, so we say .in().in()
prev.in()
.in()
.compute_at(s, xi)
.vectorize(prev.args()[0], 2)
.unroll(prev.args()[0])
.unroll(prev.args()[1]);
I'm slowly getting the hang of what .in() does, but this I don't get. It seems that the first block is meant to copy it to block Shared Memory, and then the second one (the one embedded in code here) is meant to load it into registers? Maybe I'm not familiar with how CUDA works, but how can a function be loaded into registers? Every value goes into a register? Why do you know this in this case? Doesn't there need to be a .store_in(MemoryType::Register) then? Same for the loading in the shared memory: doesn't it need a .store_in(MemoryType::GPUShared)?
The text was updated successfully, but these errors were encountered:
The app/interpolate no longer uses the
.in()
directive. A new app should be chosen to guide the reader to a useful example.Halide/src/Func.h
Lines 1313 to 1316 in c0192ff
While we are at
.in()
(again with FAQs efforts in mind), I'd like to also hear about the technique of copying memory into a SM's shared memory for improved performance. There is a trick in the apps somewhere that uses.in().in()
to achieve this. I think this needs extensive elaboration:Halide/apps/stencil_chain/stencil_chain_generator.cpp
Lines 86 to 101 in c0192ff
I'm slowly getting the hang of what
.in()
does, but this I don't get. It seems that the first block is meant to copy it to block Shared Memory, and then the second one (the one embedded in code here) is meant to load it into registers? Maybe I'm not familiar with how CUDA works, but how can a function be loaded into registers? Every value goes into a register? Why do you know this in this case? Doesn't there need to be a.store_in(MemoryType::Register)
then? Same for the loading in the shared memory: doesn't it need a.store_in(MemoryType::GPUShared)
?The text was updated successfully, but these errors were encountered: