# <span style="color:green">Objective</span>
<div style="text-align:justify;font-size:20px">
<body>
To learn to write a basic reduction kernel
<ul>
<li>Thread to data mapping</li>
<li>Turning off threads</li>
<li>Control divergence</li>
</ul>
</body>
</div>
<hr style="height:2px">

# <span style="color:green">Parallel Sum Reduction</span>

<br></br>
<div style="text-align:justify;font-size:20px">
        <ul>
            <body>
            
            <li>Parallel implementation
                <ul>
                    <li>Recursively halve # of threads, add two values per thread in each step</li>
                    <li>Takes log(n) steps for n elements, requires n/2 threads</li>
                </ul>
            </li>
            <li>Assume an in-place reduction using shared memory
                <ul>
                    <li>The original vector is in device global memory</li>
                    <li>The shared memory is used to hold a partial sum vector</li>
                    <li>Each step brings the partial sum vector closer to the sum</li>
                    <li>The final sum will be in element 0 of the partial sum vector</li>
                    <li>Reduces global memory traffic due to partial sum values</li>
                    <li>Thread block size limits n to be less than or equal to 2,048</li>
                </ul>
            </li>
            </body>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">A Parallel Sum Reduction Example</span>

<br></br>
<span style="float:center;clear:center">![alt tag](img/1.png)</span>





<hr style="height:2px">

# <span style="color:green">A Naive Thread to Data Mapping</span>

<br></br>
<div style="text-align:justify;font-size:20px">
        <body>
        <ul>
            <li>Each thread is responsible for an even-index location of the partial sum
vector (location of responsibility)</li>
            <li>After each step, half of the threads are no longer needed</li>
            <li>One of the inputs is always from the location of responsibility</li>
            <li>In each step, one of the inputs comes from an increasing distance away</li>
        </ul>
        </body>
</div>

<hr style="height:2px">

# <span style="color:green">A Simple Thread Block Design</span>


<br></br>
<div style="text-align:justify;font-size:20px">
        <body>
        <ul>
                    <li>Each thread block takes 2*BlockDim.x input elements</li>
                    <li>Each thread loads 2 elements into shared memory</li>
        </ul>
        </body>
</div>
```C++
__shared__ float partialSum[2*BLOCK_SIZE];

unsigned int t = threadIdx.x;
unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] = input[start + t];
partialSum[blockDim+t] = input[start + blockDim.x+t];
```

<hr style="height:2px">

# <span style="color:green">The Reduction Steps</span>



```C++
for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2)
{
    __syncthreads();
    if (t % stride == 0)
        partialSum[2*t]+= partialSum[2*t+stride];
}
```
<br></br>
<div style="text-align:center;font-size:20px">
        <body>
            <p><strong>Why do we need __syncthreads()?</strong></p>
        </body>
</div>



<hr style="height:2px">

# <span style="color:green">Barrier Synchronization</span>


<br></br>
<div style="text-align:justify;font-size:20px">
        <body>
        <ul>
                    <li>__syncthreads() is needed to ensure that all elements of each
version of partial sums have been generated before we proceed
to the next step</li>
        </ul>
        </body>
</div>


<hr style="height:2px">

# <span style="color:green">Back to the Global Picture</span>

<br></br>
<div style="text-align:justify;font-size:20px">
        <body>
        <ul>
            <li>At the end of the kernel, Thread 0 in each thread block
writes the sum of the thread block in partialSum[0] into a
vector indexed by the blockIdx.x</li>
            <li>There can be a large number of such sums if the original
vector is very large
                <ul>
                    <li>The host code may iterate and launch another kernel</li>
                </ul>
            </li>
            <li>If there are only a small number of sums, the host can
simply transfer the data back and add them together</li>
        </ul>
        </body>
</div>

<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>
<hr style="height:2px">