# Convolution on Multi-core CPUs

When targeting CPUs, the set of possible optimizations and opportunities for parallelization are limited. You can exploit multiple cores and split your computations across those. You can exploit SIMD units in each core to further boost performance. If your CPU has an embedded GPU (for example, the Intel Iris graphics cores), it is sometimes possible to use the GPU through the OpenCL environment. These tightly-integrated GPUs share the memory space, and do not require explicit copying that is otherwise necessary for accelerators. A side benefit of using OpenCL here is the future possibility of porting to another, newer, OpenCL-compatible platform. 

1. SIMD Vectorization:
While most OpenCL compilers will auto-vectorize simple code, it is often necessary to explicitly use OpenCL vector types for ensuring high performance. This eliminates writing low-level intrinsics, and transparently allows the same OpenCL code to run on different hardware with varying SIMD vector widths. The **float4** type is an example of the OpenCL vector type. Apart from changing the type of your operation, you have to rearrnage your inputs to fit the vector access pattern. A 2D convolution rewritten for SIMD vectorization looks like one below:


In [None]:
__kernel void convolve(
        const __global float *in,               // W*H input images
        __constant float *filt,                 // K*K filter kernel
        __global float *out,                    // W*H output images
        const int K,                            // filter resolution
        const float pBias)                      // constant offset/bias
{
        // get pixel position
        const int W = get_global_size(0);
        const int H = get_global_size(1);

        // get image resolution
        const int x = get_global_id(0); 
        const int y = get_global_id(1);

        float4 sum = 0;

        // loop over rows
        for (int r = 0; r < K; r++) 
        {
                // loop over columns
                for(int c = 0,c4 = 0; c < K, c4<ceil(K/4); c+=4,c4++)
                {
                        float4 filt4 = vload4(c4,filt[r*K]);
                        float4 in4 = vload4(c4,in[(y+r)*W+x]);
                        sum += filt4*in4;
                }
                // TODO: for the odd last element..
        }
        out[y*W+x] = sum.x + sum.y + sum.z + sum.w + pBias;
}

The code above works well for large values of K>5. For smaller values of K<=5, we may want to avoid vectorization.

2. Loop Unrolling:
We can also use loop unrolling as a way to improve performance. Loops, without extra information, are sequential operations. If the programmer (or compiler) can reason about data independence across loop iterations, we can run each loop iteration in parallel. A programmer can provide hints to the compiler about what loop to unroll and to what extent. We first show a simple manually unrolled OpenCL kernel, and its equivalent version with compiler hints.

In [None]:
__kernel void convolve(
        const __global float *in,               // W*H input images
        __constant float *filt,                 // K*K filter kernel
        __global float *out,                    // W*H output images
        const int K,                            // filter resolution
        const float pBias)                      // constant offset/bias
{
        // get pixel position
        const int W = get_global_size(0);
        const int H = get_global_size(1);

        // get image resolution
        const int x = get_global_id(0); 
        const int y = get_global_id(1);

        float sum = 0;
        int c = 0;

        // loop over rows
        for (int r = 0, r < K, r++)
        {
                // loop over columns
                for(c = 0, c < K, c+=2)
                {
                        sum += filt[r*K+c]*in[((y+r)*W+x)+c];
                        sum += filt[r*K+c+1]*in[((y+r)*W+x)+c+1];
                }
        }
        out[y*W+x] = sum + pBias;
}

In [None]:
__kernel void convolve(
        const __global float *in,               // W*H input images
        __constant float *filt,                 // K*K filter kernel
        __global float *out,                    // W*H output images
        const int K,                            // filter resolution
        const float pBias)                      // constant offset/bias
{
        // get pixel position
        const int W = get_global_size(0);
        const int H = get_global_size(1);

        // get image resolution
        const int x = get_global_id(0); 
        const int y = get_global_id(1);

        float sum = 0;
        int c = 0;

        // loop over rows
        for (int r = 0, r < K, r++)
        {
                // loop over columns
                // only in OpenCL 2.0
                __attribute__ ((opencl unroll hint(2)))
                for(c = 0, c < K, c++)
                {
                        sum += filt[r*K+c]*in[((y+r)*W+x)+c];
                }
        }
        out[y*W+x] = sum + pBias;
}

3. Thread-Level Parallelism
The final aspect of performance tuning is parallelization across cores. This is not direct or explicit under OpenCL, but can be configured by careful selection of **global** and **local** workgroup sizes. When they're equal, all work items are scheduled onto a single core. The ratio between these workgroup sizes indicates the number of threads/cores you can target. The OpenCL runtime can also be configured through **Device Fission** settings to partition the OpenCL device and restrict OpenCL operations onto a subset of available threads/cores.