# <span style="color:green">Objective</span>
<h2>
To learn to analyze the performance impact of control divergence
<ul>
<li>Boundary condition checking</li>
<li>Control divergence is data-dependent</li>
</ul>
</h2>
<hr style="height:2px">

# <span style="color:green">Performance Impact of Control Divergence</span>


<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>Boundary condition checks are vital for complete functionality and robustness of parallel code
                <ul>
                    <li>The tiled matrix multiplication kernel has many boundary condition checks</li>
                    <li>The concern is that these checks may cause significant performance degradation</li>
                    <li>For example, see the tile loading code below:</li>
                </ul>
            </li>
            </h2>
        </ul>
        <br></br>
        <span style="float:center;clear:center">![alt tag](img/1.png)</span>

</div>

<hr style="height:2px">

# <span style="color:green">Two types of blocks in loading M Tiles</span>


<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>1. Blocks whose tiles are all within valid range until the last phase.</li>
            <li>2. Blocks whose tiles are partially outside the valid range all the way</li>
            </h2>
        </ul>
        <br></br>
        <span style="float:center;clear:center">![alt tag](img/2.png)</span>

</div>

<hr style="height:2px">

# <span style="color:green">Analysis of Control Divergence Impact</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            
            <li>Assume 16x16 tiles and thread blocks</li>
            <li>Each thread block has 8 warps (256/32)</li>
            <li>Assume square matrices of 100x100</li>
            <li>Each thread will go through 7 phases (ceiling of 100/16)</li>
            <li>There are 49 thread blocks (7 in each dimension)</li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Control Divergence in Loading M Tiles</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            
            <li>Assume 16x16 tiles and thread blocks</li>
            <li>Each thread block has 8 warps (256/32)</li>
            <li>Assume square matrices of 100x100</li>
            <li>Each thread will go through 7 phases (ceiling of 100/16)</li>
            </h2>
        </ul>
        <br></br>
        <ul>
            <h2>
            <li>There are 42 (6*7) Type 1 blocks, with a total of 336 (8*42) warps</li>
            <li>They all have 7 phases, so there are 2,352 (336*7) warp-phases</li>
            <li>The warps have control divergence only in their last phase</li>
            <li>336 warp-phases have control divergence</li>
            </h2>
        </ul>
        <br></br>
        <span style="float:center;clear:center">![alt tag](img/3.png)</span>
</div>

<hr style="height:2px">

# <span style="color:green">Control Divergence in Loading M Tiles (Type 2)</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            
            <li>Type 2: the 7 block assigned to load the bottom tiles, with a total of
56 (8*7) warps</li>
            <li>They all have 7 phases, so there are 392 (56*7) warp-phases</li>
            <li>The first 2 warps in each Type 2 block will stay within the valid range until the last phase</li>
            <li>The 6 remaining warps stay outside the valid range</li>
            <li>So, only 14 (2*7) warp-phases have control divergence</li>
            </h2>
        </ul>
        <br></br>
        <span style="float:center;clear:center">![alt tag](img/4.png)</span>
</div>

<hr style="height:2px">

# <span style="color:green">Overall Impact of Control Divergence</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            
            <li>Type 1 Blocks: 336 out of 2,352 warp-phases have control divergence</li>
            <li>Type 2 Blocks: 14 out of 392 warp-phases have control divergence</li>
            <li>The performance impact is expected to be less than 12% (350/2,944 or (336+14)/(2352+14))</li>
            </h2>
        </ul>
        <br></br>
        <span style="float:center;clear:center">![alt tag](img/5.png)</span>
</div>

<hr style="height:2px">

# <span style="color:green">Additional Comments</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>The calculation of impact of control divergence in loading N tiles is somewhat different and is left as an exercise</li>
            <li>The estimated performance impact is data dependent.
                <ul>
                    <li>For larger matrices, the impact will be significantly smaller</li>
                </ul>
            </li>
            <li>In general, the impact of control divergence for boundary condition checking for large input data sets should be insignificant
                <ul>
                    <li>One should not hesitate to use boundary checks to ensure full functionality</li>
                </ul>
            </li>
            <li>The fact that a kernel is full of control flow constructs does not mean that there will be heavy occurrence of control divergence</li>
            <li>We will cover some algorithm patterns that naturally incur control divergence (such as parallel reduction) in the Parallel Algorithm Patterns modules</li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>
<hr style="height:2px">