# <span style="color:green">Objective</span>
<h2>
To learn to effectively use the CUDA memory types in a parallel
program.
<ul>
<li> Importance of memory access efficiency. </li>
<li> Registers, shared memory, global memory.</li>
<li> Scope and lifetime.</li>
</ul>
</h2>
<hr style="height:2px">

# <span style="color:green">Review: Image Blur Kernel.</span>




<span style="float:left;clear:center">![alt tag](img/1.png)</span>


<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">How about performance on a GPU</span>
<br></br>
<div style="text-align:justify">
    <h2>
        <ul>
            <li>All threads access global memory for their input matrix elements.
                <ul>
                    <li>One memory accesses (4 bytes) per floating-point addition</li>
                    <li>4B/s of memory bandwidth/FLOPS</li>
                </ul>
            </li>
            <li>Assume a GPU with
                <ul>
                    <li>Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth</li>
                    <li>4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating</li>
                    <li>The 200 GB/s memory bandwidth limits the execution at 50 GFLOPS</li>
                </ul>
            </li>
            <li>This limits the execution rate to 3.3% (50/1500) of the peak
floating-point execution rate of the device!</li>
            <li>Need to drastically cut down memory accesses to get close to
the1,500 GFLOPS</li>
            
        </ul>
    </h2>
</div>

<hr style="height:2px">

# <span style="color:green">Example – Matrix Multiplication</span>




<span style="float:left;clear:center">![alt tag](img/2.png)</span>


<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<hr style="height:2px">

# <span style="color:green">A Basic Matrix Multiplication</span>




<span style="float:left;clear:center">![alt tag](img/3.png)</span>


<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>

<hr style="height:2px">

# <span style="color:green">Example – Matrix Multiplication</span>




<span style="float:left;clear:center">![alt tag](img/4.png)</span>


<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>

<hr style="height:2px">

# <span style="color:green">A Toy Example: Thread to P Data Mapping</span>




<span style="float:left;clear:center">![alt tag](img/4.png)</span>


<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<br></br>

<hr style="height:2px">

# <span style="color:green">Calculation of $P_{0,0}$ and $P_{0,1}$</span>




<span style="float:center;clear:center">![alt tag](img/6.png)</span>


<br></br>
<br></br>




<hr style="height:2px">

# <span style="color:green">Memory and Registers in the Von-Neumann Model</span>




<span style="float:center;clear:center">![alt tag](img/7.png)</span>


<br></br>
<br></br>




<hr style="height:2px">

# <span style="color:green">Programmer View of CUDA Memories</span>




<span style="float:center;clear:center">![alt tag](img/8.png)</span>






<hr style="height:2px">

# <span style="color:green">Declaring CUDA Variables</span>




<span style="float:center;clear:center">![alt tag](img/9.png)</span>






<hr style="height:2px">

# <span style="color:green">Example: Shared Memory Variable Declaration</span>

```c++
void blurKernel(unsigned char * in, unsigned char * out, int w, int h){
    __shared__ float ds_in[TILE_WIDTH][TILE_WIDTH];
...
}
```

<hr style="height:2px">

# <span style="color:green">Where to Declare Variables?</span>




<span style="float:center;clear:center">![alt tag](img/10.png)</span>






<hr style="height:2px">

# <span style="color:green">How about performance on a GPU</span>
<br></br>
<div style="text-align:justify">
    <h2>
        <ul>
            <li>A special type of memory whose contents are explicitly defined and
used in the kernel source code.
                <ul>
                    <li>One in each SM</li>
                    <li>Accessed at much higher speed (in both latency and throughput) than global
memory</li>
                    <li>Scope of access and sharing - thread blocks</li>
                    <li>Lifetime – thread block, contents will disappear after the corresponding thread
finishes terminates execution</li>
                    <li>Accessed by memory load/store instructions</li>
                    <li>A form of scratchpad memory in computer architecture</li>
                </ul>
            </li>
            
        </ul>
    </h2>
</div>

<hr style="height:2px">

# <span style="color:green">Hardware View of CUDA Memories</span>




<span style="float:center;clear:center">![alt tag](img/11.png)</span>






<hr style="height:2px">

<footer>
<cite> GPU NVIDIA Teaching Kit - University of Illinois </cite>
</footer>
<hr style="height:2px">