Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documenation for .in() refers to interpolate app, but is rewritten since. #6454

Open
mcourteaux opened this issue Nov 30, 2021 · 1 comment
Labels
documentation Missing, incorrect, or unclear. Spelling & grammar mistakes.

Comments

@mcourteaux
Copy link
Contributor

mcourteaux commented Nov 30, 2021

The app/interpolate no longer uses the .in() directive. A new app should be chosen to guide the reader to a useful example.

Halide/src/Func.h

Lines 1313 to 1316 in c0192ff

* Func::in() can also be used to compute pieces of a Func into a
* smaller scratch buffer (perhaps on the GPU) and then copy them
* into a larger output buffer one tile at a time. See
* apps/interpolate/interpolate.cpp for an example of this. In

While we are at .in() (again with FAQs efforts in mind), I'd like to also hear about the technique of copying memory into a SM's shared memory for improved performance. There is a trick in the apps somewhere that uses .in().in() to achieve this. I think this needs extensive elaboration:

// A similar benefit applies for the
// vectorized/unrolled 2x2 tiles. Instead of having
// each unrolled iteration do its own mix of scalar
// and vector loads from shared memory in a 5x5
// window, many of which get deduped across the block,
// we load a 6x6 window of shared into registers using
// only aligned vector loads, and then the actual
// stencil pulls from those registers. We're adding
// another wrapper Func around the wrapper Func we
// created above, so we say .in().in()
prev.in()
.in()
.compute_at(s, xi)
.vectorize(prev.args()[0], 2)
.unroll(prev.args()[0])
.unroll(prev.args()[1]);

I'm slowly getting the hang of what .in() does, but this I don't get. It seems that the first block is meant to copy it to block Shared Memory, and then the second one (the one embedded in code here) is meant to load it into registers? Maybe I'm not familiar with how CUDA works, but how can a function be loaded into registers? Every value goes into a register? Why do you know this in this case? Doesn't there need to be a .store_in(MemoryType::Register) then? Same for the loading in the shared memory: doesn't it need a .store_in(MemoryType::GPUShared)?

@mcourteaux
Copy link
Contributor Author

Documentation of in() should definitely refer to the tutorial. Didn't know there was one by now.

@abadams abadams added the documentation Missing, incorrect, or unclear. Spelling & grammar mistakes. label Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Missing, incorrect, or unclear. Spelling & grammar mistakes.
Projects
None yet
Development

No branches or pull requests

2 participants