# <span style="color:green">Objective</span>
<h2>
To understand how CUDA threads execute on SIMD Hardware
<ul>
<li>Warp partitioning</li>
<li>SIMD Hardware</li>
<li>Control divergence</li>
</ul>
</h2>
<hr style="height:2px">

# <span style="color:green">Warps as Scheduling Units</span>


<br></br>
<div style="text-align:justify">

        <span style="float:center;clear:center">![alt tag](img/1.png)</span>
        
        <ul>
            <h2>
            <li>Each block is divided into 32-thread warps
                <ul>
                    <li>An implementation technique, not part of the CUDA programming
model</li>
                    <li>Warps are scheduling units in SM</li>
                    <li>Threads in a warp execute in Single Instruction Multiple Data
(SIMD) manner</li>
                    <li>The number of threads in a warp may vary in future generations</li>
                </ul>
            </li>
            </h2>
        </ul>

</div>

<hr style="height:2px">

# <span style="color:green">Warps in Multi-dimensional Thread Blocks</span>


<br></br>
<div style="text-align:justify">

        
        
        <ul>
            <h2>
            <li>The thread blocks are first linearized into 1D in row major order
                <ul>
                    <li>In x-dimension first, y-dimension next, and z-dimension last</li>
                    
                </ul>
            </li>
            </h2>
        </ul>
        <br></br>
        <span style="float:center;clear:center">![alt tag](img/2.png)</span>

</div>

<hr style="height:2px">

# <span style="color:green">Blocks are partitioned after linearization</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>Linearized thread blocks are partitioned
                <ul>
                    <li>Thread indices within a warp are consecutive and increasing.</li>
                    <li>Warp 0 starts with Thread 0</li>
                </ul>
            </li>
            <li>Linearized thread blocks are partitioned
                <ul>
                    <li>Partitioning scheme is consistent across devices</li>
                    <li>However, the exact size of warps may change from generation to generation</li>
                </ul>
            </li>
            <li>DO NOT rely on any ordering within or between warps
                <ul>
                    <li>If there are any dependencies between threads, you must ```__syncthreads()``` to get correct results (more later).</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">SMs are SIMD Processors</span>


<br></br>
<div style="text-align:justify">

        
        
        <ul>
            <h2>
            <li>Control unit for instruction fetch, decode, and control is shared among multiple processing units
                <ul>
                    <li>Control overhead is minimized (Module 1)</li>
                    
                </ul>
            </li>
            </h2>
        </ul>
        <br></br>
        <span style="float:center;clear:center">![alt tag](img/3.png)</span>

</div>

<hr style="height:2px">

# <span style="color:green">SIMD Execution Among Threads in a Warp</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>All threads in a warp must execute the same instruction at any point in time</li>
            <li>This works efficiently if all threads follow the same control flow path
                <ul>
                    <li>All if-then-else statements make the same decision</li>
                    <li>All loops iterate the same number of times</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Control Divergence</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>Control divergence occurs when threads in a warp take different control flow paths by making different control decisions
                <ul>
                    <li>Some take the then-path and others take the else-path of an if-statement</li>
                    <li>Some threads take different number of loop iterations than others</li>
                </ul>
            </li>
            <li>The execution of threads taking different paths are serialized in current GPUs
                <ul>
                    <li>The control paths taken by the threads in a warp are traversed one at a time until there is no more.</li>
                    <li>During the execution of each path, all threads taking that path will be executed in parallel</li>
                    <li>The number of different paths can be large when considering nested control flow statements</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Control Divergence Examples</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>Divergence can arise when branch or loop condition is a function of thread indices</li>
            <li>Example kernel statement with divergence:
                <ul>
                    <li>if (threadIdx.x > 2) { }</li>
                    <li>This creates two different control paths for threads in a block</li>
                    <li>Decision granularity &lt; warp size; threads 0, 1 and 2 follow
different path than the rest of the threads in the first warp</li>
                </ul>
            </li>
            <li>Example without divergence:
                <ul>
                    <li>If (blockIdx.x > 2) { }</li>
                    <li>Decision granularity is a multiple of blocks size; all threads in any given warp follow the same path</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Example: Vector Addition Kernel</span>

<br></br>
<span style="float:left;clear:center">![alt tag](img/4.png)</span>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Analysis for vector size of 1,000 elements</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            
            <li>Assume that block size is 256 threads
                <ul>
                    <li>8 warps in each block</li>
                </ul>
            </li>
            <li>All threads in Blocks 0, 1, and 2 are within valid range
                <ul>
                    <li>values from 0 to 767</li>
                    <li>There are 24 warps in these three blocks, none will have control divergence</li>
                </ul>
            </li>
            <li>Most warps in Block 3 will not control divergence
                <ul>
                    <li>Threads in the warps 0-6 are all within valid range, thus no control divergence</li>
                </ul>
            </li>
            <li>One warp in Block 3 will have control divergence
                <ul>
                    <li>Threads with i values 992-999 will all be within valid range</li>
                    <li>Threads with i values of 1000-1023 will be outside valid range</li>
                </ul>
            </li>
            <li>Effect of serialization on control divergence will be small
                <ul>
                    <li>1 out of 32 warps has control divergence</li>
                    <li>The impact on performance will likely be less than 3%</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>
<hr style="height:2px">