# <span style="color:green">Objective</span>
<h2>
To learn to handle arbitrary matrix sizes in tiled matrix multiplication
<ul>
<li>Boundary condition checking</li>
<li>Regularizing tile contents</li>
<li>Rectangular matrices</li>
</ul>
</h2>
<hr style="height:2px">

# <span style="color:green">Handling Matrix of Arbitrary Size</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>The tiled matrix multiplication kernel we presented so far can
handle only square matrices whose dimensions (Width) are
multiples of the tile width (TILE_WIDTH)
                <ul>
                    <li> However, real applications need to handle arbitrary sized matrices.</li>
                    <li>One could pad (add elements to) the rows and columns into multiples of the tile size, but would have significant space and data transfer time overhead.</li>
                </ul>
            </li>
            <li>We will take a different approach.</li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Phase 1 Loads for Block (0,0) for a 3x3 Examplee</span>

<br></br>
<span style="float:center;clear:center">![alt tag](img/1.png)</span>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Phase 1 Use for Block (0,0) (iteration 0)</span>

<br></br>
<span style="float:center;clear:center">![alt tag](img/2.png)</span>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Phase 1 Use for Block (0,0) (iteration 1)</span>

<br></br>
<span style="float:center;clear:center">![alt tag](img/3.png)</span>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Phase 0 Loads for Block (1,1) for a 3x3 Example</span>

<br></br>
<span style="float:center;clear:center">![alt tag](img/4.png)</span>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Handling Matrix of Arbitrary Size</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>Threads that do not calculate valid P elements but still need to participate in loading the input tiles
                <ul>
                    <li>Phase 0 of Block(1,1), Thread(1,0), assigned to calculate non-existent P[3,2] but need to participate in loading tile element N[1,2]</li>
                </ul>
            </li>
            <li>Threads that calculate valid P elements may attempt to load non-existing input elements when loading input tiles
                <ul>
                    <li>Phase 0 of Block(0,0), Thread(1,0), assigned to calculate valid P[1,0] but attempts to load non-existing N[3,0]</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">A “Simple” Solution</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>When a thread is to load any input element, test if it is in the valid index range
                <ul>
                    <li>PIf valid, proceed to load</li>
                    <li>Else, do not load, just write a 0</li>
                </ul>
            </li>
            <li>Rationale: a 0 value will ensure that that the multiply-add step does not affect the final value of the output element</li>            
            <li>The condition tested for loading input elements is different from the test for calculating output P element<ul>
                    <li>A thread that does not calculate valid P element can still participate in loading input tile elements</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Phase 1 Use for Block (0,0) (iteration 1)</span>

<br></br>
<span style="float:center;clear:center">![alt tag](img/5.png)</span>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Boundary Condition for Input M Tile</span>


<br></br>
<div style="text-align:justify">

        <span style="float:right;clear:center">![alt tag](img/6.png)</span>
        <ul>
            <h2>
            <li>Each thread loads
                <ul>
                    <li>M[Row][p*TILE_WIDTH+tx]</li>
                    <li>M[Row*Width + p*TILE_WIDTH+tx]</li>
                </ul>
            </li>
          
            <li>Need to test
                <ul>
                    <li>(Row < Width) && (p*TILE_WIDTH+tx < Width)</li>
                    <li>If true, load M element</li>
                    <li>Else , load 0</li>
                </ul>
            </li>
            </h2>
        </ul>

</div>

<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Boundary Condition for Input N Tile</span>


<br></br>
<div style="text-align:justify">

        <span style="float:right;clear:center">![alt tag](img/7.png)</span>
        <ul>
            <h2>
            <li>Each thread loads
                <ul>
                    <li>N[p*TILE_WIDTH+ty][Col]</li>
                    <li>N[(p*TILE_WIDTH+ty)*Width+ Col]</li>
                </ul>
            </li>
          
            <li>Need to test
                <ul>
                    <li>(p*TILE_WIDTH+ty < Width) && (Col< Width)</li>
                    <li>If true, load N element</li>
                    <li>Else , load 0</li>
                </ul>
            </li>
            </h2>
        </ul>

</div>

<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">Loading Elements – with boundary check</span>

<br></br>
<span style="float:left;clear:center">![alt tag](img/8.png)</span>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>

<hr style="height:2px">

# <span style="color:green">Inner Product – Before and After</span>

<br></br>
<span style="float:left;clear:center">![alt tag](img/9.png)</span>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>

<hr style="height:2px">

# <span style="color:green">Some Important Points</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>For each thread the conditions are different for
                <ul>
                    <li>Loading M element</li>
                    <li>Loading N element</li>
                    <li>Calculating and storing output elements</li>
                </ul>
            </li>
            <li>The effect of control divergence should be small for large matrices</li>
            </h2>
        </ul>
</div>

<hr style="height:2px">

# <span style="color:green">Handling General Rectangular Matrices</span>

<br></br>
<div style="text-align:justify">
        <ul>
            <h2>
            <li>In general, the matrix multiplication is defined in terms of rectangular matrices
                <ul>
                    <li>A j x k M matrix multiplied with a k x l N matrix results in a j x l P matrix</li>
                </ul>
            </li>
            <li>We have presented square matrix multiplication, a special case</li>
            <li>The kernel function needs to be generalized to handle general rectangular matrices
                <ul>
                    <li>The Width argument is replaced by three arguments: j, k, l</li>
                    <li>When Width is used to refer to the height of M or height of P, replace it with j</li>
                </ul>
            </li>
            </h2>
        </ul>
</div>

<hr style="height:2px">