# Convolution exploiting GPUs

GPUs offer a highly parallel substrate for accelerating data-parallel computations. Most modern GPU architectures are organized as clusters, which are analogous to **Compute Unit** in OpenCL. Each cluster contains several ALUs, called **Processing Elements** in OpenCL-speak, and a **Local Memory** that is available for shared communication within the cluster. Multiple threads of work, or **work items** can be mapped to a cluster depending on the workgroup scheduling. A GPU matches the OpenCL hardware model to the closest extent among existing OpenCL-compatible hardware platforms.

1. Local/Constant Memory Optimizations
Apart from SIMD, Threading, and Unrolling optimizations, the GPU substrate provides a unique optimization opportunity in the form of memory selection. Both **constant** and **local** memories can be used effectively to help accelerate your OpenCL task. It requires writing OpenCL kernels in a specific way to exploit these memories. This optimization is typically useless on a CPU as neither of these memories are explicitly available, and default to the processor cache. 
- To use **constant** memory, we simply need to tag the relevant data structures with the **__constant** identifier. The memory allocation on the host also needs to be tagged with **CL_MEM_READ_ONLY** qualifier to help copy the data from the host to the correct RAM on the OpenCL device.
- To use **local** memory, we can only do so by declaring a fixed-size array within the OpenCL kernel body and declaring it with the qualifier **__local**. Also, we have to explicitly copy the data from the **__global** memory to the **__local** memory structures ourselves. Depending on the size of the workgroup and the **__local** structures, we can divvy up the memory loading task across multiple work-items. 

For 2D convolution, we can store the kernels in **__constant** memory, and prefetch portions of the input image into **__local** memory. Alternatively, when considering 3D convolution tasks (multiple 2D convolutions), we can store the output image in **__local** memory instead. We can visualize the memory hierarchy in OpenCL below.

![](memory-hierarchy.png)

We now show the OpenCL kernel for 2D convolution optimized for using **constant** and **local** memories below.

In [None]:
__kernel void convolve(
        const __global float *in,               // W*H input images
        __constant float *filt,                 // K*K filter kernel
        __global float *out,                    // W*H output images
        const int K,                            // filter resolution
        const float pBias)                      // constant offset/bias
{
        // get pixel position
        const int W = get_global_size(0);
        const int H = get_global_size(1);

        // get image resolution
        const int x = get_global_id(0); 
        const int y = get_global_id(1);

        // allocate local RAM for storing input pixels
        __local in_local[W*H];

        // load data into the local RAM
        in_local[x*W+y] = in[x*W+y];
        barrier(CLK_LOCAL_MEM_FENCE);

        float sum = 0;
        int c = 0;

        // loop over rows
        for (int r = 0, r < K, r++)
        {
                // loop over columns
                for(c = 0, c < K, c++)
                {
                        sum += filt[r*K+c]*in_local[((y+r)*W+x)+c];
                }
        }
        out[y*W+x] = sum + pBias;
}

In this block of code, we tag the **filt** structure as a **constant** and let OpenCL load the data into constant memory with the host code shown below.

In [None]:
        d_filter = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, K*K, h_filter, &err);

The other portion of code that has been added is the allocation, loading, and use of **in_local** variable. By preloading the image into the GPU local memory, accesses to the pixels are now fast. This is important as each pixel in convolution is read $K$x$K$ times. While caching can help reduce overheads of repetitive accesses, some accesses may miss the cache resulting in wasted cycles. We extract the newly added portion of code below:

In [None]:
        // allocate local RAM for storing input pixels
        __local in_local[W*H];

        // load data into the local RAM
        in_local[x*W+y] = in[x*W+y];
        barrier(CLK_LOCAL_MEM_FENCE);

The **barrier** call forces all workitems in the workgroup to synchornize before proceeding to the computation. This is important as all pixels must complete their memory loads before we can use them to do filtering. By enforcing (1) barrier, and (2) memory fence, both synchronization and completion of memory loads is guaranteed. Remember, this only operates on workitems inside a single workgroup on a single compute unit.