Documenation for .in() refers to interpolate app, but is rewritten since. #6454

mcourteaux · 2021-11-30T19:53:11Z

The app/interpolate no longer uses the .in() directive. A new app should be chosen to guide the reader to a useful example.

Halide/src/Func.h

Lines 1313 to 1316 in c0192ff

    
                * Func::in() can also be used to compute pieces of a Func into a 
        
                * smaller scratch buffer (perhaps on the GPU) and then copy them 
        
                * into a larger output buffer one tile at a time. See 
        
                * apps/interpolate/interpolate.cpp for an example of this. In

While we are at .in() (again with FAQs efforts in mind), I'd like to also hear about the technique of copying memory into a SM's shared memory for improved performance. There is a trick in the apps somewhere that uses .in().in() to achieve this. I think this needs extensive elaboration:

Halide/apps/stencil_chain/stencil_chain_generator.cpp

Lines 86 to 101 in c0192ff

    
           // A similar benefit applies for the 
        
           // vectorized/unrolled 2x2 tiles. Instead of having 
        
           // each unrolled iteration do its own mix of scalar 
        
           // and vector loads from shared memory in a 5x5 
        
           // window, many of which get deduped across the block, 
        
           // we load a 6x6 window of shared into registers using 
        
           // only aligned vector loads, and then the actual 
        
           // stencil pulls from those registers. We're adding 
        
           // another wrapper Func around the wrapper Func we 
        
           // created above, so we say .in().in() 
        
           prev.in() 
        
               .in() 
        
               .compute_at(s, xi) 
        
               .vectorize(prev.args()[0], 2) 
        
               .unroll(prev.args()[0]) 
        
               .unroll(prev.args()[1]);

I'm slowly getting the hang of what .in() does, but this I don't get. It seems that the first block is meant to copy it to block Shared Memory, and then the second one (the one embedded in code here) is meant to load it into registers? Maybe I'm not familiar with how CUDA works, but how can a function be loaded into registers? Every value goes into a register? Why do you know this in this case? Doesn't there need to be a .store_in(MemoryType::Register) then? Same for the loading in the shared memory: doesn't it need a .store_in(MemoryType::GPUShared)?

The text was updated successfully, but these errors were encountered:

mcourteaux · 2021-12-03T13:16:50Z

Documentation of in() should definitely refer to the tutorial. Didn't know there was one by now.

abadams added the documentation Missing, incorrect, or unclear. Spelling & grammar mistakes. label Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenation for .in() refers to interpolate app, but is rewritten since. #6454

Documenation for .in() refers to interpolate app, but is rewritten since. #6454

mcourteaux commented Nov 30, 2021 •

edited

mcourteaux commented Dec 3, 2021

Documenation for .in() refers to interpolate app, but is rewritten since. #6454

Documenation for .in() refers to interpolate app, but is rewritten since. #6454

Comments

mcourteaux commented Nov 30, 2021 • edited

mcourteaux commented Dec 3, 2021

mcourteaux commented Nov 30, 2021 •

edited