# <span style="color:green">Objective</span>
<div style="text-align:justify;font-size:20px">
<body>
To learn to write a better reduction kernel
<ul>
<li>Resource efficiency analysis</li>
<li>Improved thread to data mapping</li>
<li>Reduced control divergence</li>
</ul>
</body>
</div>
<hr style="height:2px">

# <span style="color:green">Some Observations on the naïve reduction kernel</span>

<br></br>
<div style="text-align:justify;font-size:20px">
        <ul>
            <body>
            
            <li>In each iteration, two control flow paths will be sequentially
traversed for each warp
                <ul>
                    <li>Threads that perform addition and threads that do not</li>
                    <li>Threads that do not perform addition still consume execution resources</li>
                </ul>
            </li>
            <li>Half or fewer of threads will be executing after the first step
                <ul>
                    <li>All odd-index threads are disabled after first step</li>
                    <li>After the 5th step, entire warps in each block will fail the if test, poor resource utilization
but no divergence
                        <ul>
                            <li>This can go on for a while, up to 6 more steps (stride = 32, 64, 128, 256, 512, 1024),
where each active warp only has one productive thread until all warps in a block retire</li>
                        </ul>
                    </li>
                </ul>
            </li>
            </body>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Thread Index Usage Matters</span>

<br></br>
<div style="text-align:justify;font-size:20px">
        <ul>
            <body>
            
            <li>In some algorithms, one can shift the index usage to improve
the divergence behavior
                <ul>
                    <li>Commutative and associative operators</li>
                </ul>
            </li>
            <li>Always compact the partial sums into the front locations in the
partialSum[ ] array</li>
            <li>Keep the active threads consecutive</li>
            </body>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">An Example of 4 threads</span>

<br></br>
<span style="float:center;clear:center">![alt tag](img/1.png)</span>





<hr style="height:2px">

# <span style="color:green">A Better Reduction Kernel</span>

```C++
for (unsigned int stride = blockDim.x; stride > 0; stride /= 2)
{
    __syncthreads();
    if (t < stride)
        partialSum[t] += partialSum[t+stride];
}
```





<hr style="height:2px">

# <span style="color:green">A Quick Analysis</span>

<br></br>
<div style="text-align:justify;font-size:20px">
        <ul>
            <body>
            
            
            <li>For a 1024 thread block
                <ul>
                    <li>No divergence in the first 5 steps
                        <ul>
                            <li>1024, 512, 256, 128, 64, 32 consecutive threads are active in each
step</li>
                            <li>All threads in each warp either all active or all inactive</li>
                        </ul>
                    </li>
                    <li>The final 5 steps will still have divergence</li>
                </ul>
            </li>
            </body>
        </ul>
</div>

<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>
<hr style="height:2px">