# Convolution on Multi-core CPUs

When targeting CPUs, the set of possible optimizations and opportunities for parallelization are limited. You can exploit multiple cores and split your computations across those. You can exploit SIMD units in each core to further boost performance. If your CPU has an embedded GPU (for example, the Intel Iris graphics cores), it is sometimes possible to use the GPU through the OpenCL environment. These tightly-integrated GPUs share the memory space, and do not require explicit copying that is otherwise necessary for accelerators. A side benefit of using OpenCL here is the future possibility of porting to another, newer, OpenCL-compatible platform. 

We start as before, with the necessary PyOpenCL header declarations..

In [1]:
# Modeled on github.com/BLVC/caffe.git

# magic function to import numpy, matplotlib, etc in jupyter notebook
from pylab import * 
from scipy import ndimage, misc, signal;
import numpy as np
import matplotlib.pyplot as plt

# display plots in this notebook
%matplotlib inline
# set display defaults
plt.rcParams['figure.figsize'] = (5, 5)        # medium images
plt.rcParams['image.interpolation'] = 'nearest'  # don't interpolate: show square pixels
plt.rcParams['image.cmap'] = 'gray'  # use grayscale output rather than a (potentially misleading) color heatmap

#load the ipython extensions for pyopencl
%load_ext pyopencl.ipython_ext

from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl

ctx = cl.create_some_context(interactive=True)
queue = cl.CommandQueue(ctx)

Choose platform:
[0] <pyopencl.Platform 'Apple' at 0x7fff0000>
Choice [0]:0
Choose device(s):
[0] <pyopencl.Device 'Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz' on 'Apple' at 0xffffffff>
[1] <pyopencl.Device 'AMD Radeon R9 M395 Compute Engine' on 'Apple' at 0x1021c00>
Choice, comma-separated [0]:0
Set the environment variable PYOPENCL_CTX='0:0' to avoid being asked again.


## 1. SIMD Vectorization:
While most OpenCL compilers will auto-vectorize simple code, it is often necessary to explicitly use OpenCL vector types for ensuring high performance. This eliminates writing low-level intrinsics, and transparently allows the same OpenCL code to run on different hardware with varying SIMD vector widths. The **float4** type is an example of the OpenCL vector type. Apart from changing the type of your operation, you have to rearrnage your inputs to fit the vector access pattern. A 2D convolution rewritten for SIMD vectorization looks like one below:


In [2]:
%%cl_kernel

__kernel void convolve2D_vector(
        __global float *in,               // W*H input images
        __global float *filt,                 // K*K filter kernel
        __global float *out)                    // W*H output images
{
        // get pixel position
        int W = get_global_size(0);
        int H = get_global_size(1);
        int K = 3;

        // get image resolution
        int x = get_global_id(0); 
        int y = get_global_id(1);

        float4 sum = 0; 
        int r=0, c=0, c4=0;
        float* filt_ptr;
        float* in_ptr;

        // loop over rows
        for (r = 0; r < K; r++) 
        {
                // loop over columns
                for(c = 0; c < K; c+=4)
                {
                        float4 filt4 = vload4(c4,filt_ptr);
                        float4 in4 = vload4(c4,in_ptr);
                        sum += filt4*in4;
                        c4++;
                        // vload4 requires const float* arguments! ugh!
                        filt_ptr += r*K*sizeof(float);
                        in_ptr += ((y+r)*W+x)*sizeof(float);
                }
                // TODO: for the odd last element..
        }
        out[y*W+x] = sum.x + sum.y + sum.z + sum.w;
}

RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - 

Build on <pyopencl.Device 'Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz' on 'Apple' at 0xffffffff>:

<program source>:25:40: error: no matching function for call to 'vload4'
                        float4 filt4 = vload4(c4,filt[r*K]);
                                       ^~~~~~
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2287:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global float *' for 2nd argument; take the address of the argument with &
float4    __OVERLOAD__ vload4(size_t index, const __global float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2267:24: note: candidate function not viable: no known conversion from '__global float' to 'const char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2268:24: note: candidate function not viable: no known conversion from '__global float' to 'const uchar *' (aka 'const unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2269:24: note: candidate function not viable: no known conversion from '__global float' to 'const short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2270:24: note: candidate function not viable: no known conversion from '__global float' to 'const ushort *' (aka 'const unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2271:24: note: candidate function not viable: no known conversion from '__global float' to 'const int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2272:24: note: candidate function not viable: no known conversion from '__global float' to 'const uint *' (aka 'const unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2273:24: note: candidate function not viable: no known conversion from '__global float' to 'const long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2274:24: note: candidate function not viable: no known conversion from '__global float' to 'const ulong *' (aka 'const unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2275:24: note: candidate function not viable: no known conversion from '__global float' to 'const float *' for 2nd argument
float4    __OVERLOAD__ vload4(size_t index, const float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2277:24: note: candidate function not viable: no known conversion from '__global float' to 'const double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const double *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2279:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const __global char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2280:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global uchar *' (aka 'const __global unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const __global uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2281:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const __global short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2282:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global ushort *' (aka 'const __global unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const __global ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2283:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const __global int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2284:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global uint *' (aka 'const __global unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const __global uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2285:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const __global long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2286:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global ulong *' (aka 'const __global unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const __global ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2289:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const __global double *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2291:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const __local char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2292:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local uchar *' (aka 'const __local unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const __local uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2293:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const __local short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2294:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local ushort *' (aka 'const __local unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const __local ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2295:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const __local int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2296:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local uint *' (aka 'const __local unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const __local uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2297:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const __local long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2298:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local ulong *' (aka 'const __local unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const __local ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2299:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local float *' for 2nd argument
float4    __OVERLOAD__ vload4(size_t index, const __local float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2301:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const __local double *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2303:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const __constant char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2304:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant uchar *' (aka 'const __constant unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const __constant uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2305:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const __constant short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2306:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant ushort *' (aka 'const __constant unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const __constant ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2307:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const __constant int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2308:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant uint *' (aka 'const __constant unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const __constant uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2309:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const __constant long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2310:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant ulong *' (aka 'const __constant unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const __constant ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2311:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant float *' for 2nd argument
float4    __OVERLOAD__ vload4(size_t index, const __constant float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2313:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const __constant double *p);
                       ^
<program source>:26:38: error: no matching function for call to 'vload4'
                        float4 in4 = vload4(c4,in[(y+r)*W+x]);
                                     ^~~~~~
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2287:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global float *' for 2nd argument; take the address of the argument with &
float4    __OVERLOAD__ vload4(size_t index, const __global float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2267:24: note: candidate function not viable: no known conversion from '__global float' to 'const char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2268:24: note: candidate function not viable: no known conversion from '__global float' to 'const uchar *' (aka 'const unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2269:24: note: candidate function not viable: no known conversion from '__global float' to 'const short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2270:24: note: candidate function not viable: no known conversion from '__global float' to 'const ushort *' (aka 'const unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2271:24: note: candidate function not viable: no known conversion from '__global float' to 'const int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2272:24: note: candidate function not viable: no known conversion from '__global float' to 'const uint *' (aka 'const unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2273:24: note: candidate function not viable: no known conversion from '__global float' to 'const long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2274:24: note: candidate function not viable: no known conversion from '__global float' to 'const ulong *' (aka 'const unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2275:24: note: candidate function not viable: no known conversion from '__global float' to 'const float *' for 2nd argument
float4    __OVERLOAD__ vload4(size_t index, const float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2277:24: note: candidate function not viable: no known conversion from '__global float' to 'const double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const double *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2279:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const __global char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2280:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global uchar *' (aka 'const __global unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const __global uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2281:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const __global short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2282:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global ushort *' (aka 'const __global unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const __global ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2283:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const __global int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2284:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global uint *' (aka 'const __global unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const __global uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2285:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const __global long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2286:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global ulong *' (aka 'const __global unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const __global ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2289:24: note: candidate function not viable: no known conversion from '__global float' to 'const __global double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const __global double *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2291:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const __local char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2292:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local uchar *' (aka 'const __local unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const __local uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2293:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const __local short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2294:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local ushort *' (aka 'const __local unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const __local ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2295:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const __local int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2296:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local uint *' (aka 'const __local unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const __local uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2297:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const __local long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2298:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local ulong *' (aka 'const __local unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const __local ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2299:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local float *' for 2nd argument
float4    __OVERLOAD__ vload4(size_t index, const __local float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2301:24: note: candidate function not viable: no known conversion from '__global float' to 'const __local double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const __local double *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2303:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant char *' for 2nd argument
char4     __OVERLOAD__ vload4(size_t index, const __constant char *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2304:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant uchar *' (aka 'const __constant unsigned char *') for 2nd argument
uchar4    __OVERLOAD__ vload4(size_t index, const __constant uchar *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2305:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant short *' for 2nd argument
short4    __OVERLOAD__ vload4(size_t index, const __constant short *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2306:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant ushort *' (aka 'const __constant unsigned short *') for 2nd argument
ushort4   __OVERLOAD__ vload4(size_t index, const __constant ushort *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2307:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant int *' for 2nd argument
int4      __OVERLOAD__ vload4(size_t index, const __constant int *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2308:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant uint *' (aka 'const __constant unsigned int *') for 2nd argument
uint4     __OVERLOAD__ vload4(size_t index, const __constant uint *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2309:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant long *' for 2nd argument
long4     __OVERLOAD__ vload4(size_t index, const __constant long *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2310:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant ulong *' (aka 'const __constant unsigned long *') for 2nd argument
ulong4    __OVERLOAD__ vload4(size_t index, const __constant ulong *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2311:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant float *' for 2nd argument
float4    __OVERLOAD__ vload4(size_t index, const __constant float *p);
                       ^
/System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/3.2/include/cl_kernel.h:2313:24: note: candidate function not viable: no known conversion from '__global float' to 'const __constant double *' for 2nd argument
double4   __OVERLOAD__ vload4(size_t index, const __constant double *p);
                       ^

(options: -I /usr/local/lib/python3.5/site-packages/pyopencl-2016.2-py3.5-macosx-10.11-x86_64.egg/pyopencl/cl)
(source saved as /var/folders/z6/mmcp9bn539j7zmr2cgb6y99m0000gn/T/tmp_wvhntw9.cl)

The code above works well for large values of K>5. For smaller values of K<=5, we may want to avoid vectorization. We modified the loop over c, and replaced the inner portion of the loops with a vector operation. This does result in redundant data loads, but we expect (hope) caching helps reduce off-chip memory traffic. Vectorized loads are a more efficienct use of memory bandwidth as they permit coalesced access. We the run the code as before

In [None]:
from scipy import ndimage, misc, signal;

f = misc.ascent();
in_np = f;

filt_np = np.array([[1,2,1],[2,4,2],[1,2,1]]);

mf = cl.mem_flags
in_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=in_np)
filt_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=filt_np)
out_g = cl.Buffer(ctx, mf.WRITE_ONLY, in_np.nbytes)

# Run OpenCL convolve2D function
print(in_np.shape);
convolve2D_vector(queue, in_np.shape, None, in_g, filt_g, out_g)

out_np = np.empty_like(in_np)
cl.enqueue_copy(queue, out_np, out_g)

fig, (a, b) = plt.subplots(1, 2)
a.imshow(in_np, cmap='gray')
b.imshow(out_np, cmap='gray')
plt.show() # apparently, plt.show() works than fig.show() in juypter

## 2. Loop Unrolling:
We can also use loop unrolling as a way to improve performance. Loops, without extra information, are sequential operations. If the programmer (or compiler) can reason about data independence across loop iterations, we can run each loop iteration in parallel. A programmer can provide hints to the compiler about what loop to unroll and to what extent. We first show a simple manually unrolled OpenCL kernel, and its equivalent version with compiler hints.

In [1]:
%%cl_kernel

__kernel void convolve2D_unroll(
        const __global float *in,               // W*H input images
        __constant float *filt,                 // K*K filter kernel
        __global float *out)                    // W*H output images
{
        // get pixel position
        const int W = get_global_size(0);
        const int H = get_global_size(1);
        const int K = 3;

        // get image resolution
        const int x = get_global_id(0); 
        const int y = get_global_id(1);

        float sum = 0;
        int c = 0;

        // loop over rows
        for (int r = 0, r < K, r++)
        {
                // loop over columns
                // only in OpenCL 2.0 __attribute__ ((opencl unroll hint(2)))
                for(c = 0, c < K, c+=2)
                {
                        // manually unrolled
                        sum += filt[r*K+c]*in[((y+r)*W+x)+c];
                        sum += filt[r*K+c+1]*in[((y+r)*W+x)+c+1];
                }
        }
        out[y*W+x] = sum + pBias;
}

As we can see from the code blocks above we have a choice of (1) verbose, manual operation of unrolling, or (2) automated compiler-driven option. Which one would you choose? We run the code as before:

In [None]:
from scipy import ndimage, misc, signal;

f = misc.ascent();
in_np = f;

filt_np = np.array([[1,2,1],[2,4,2],[1,2,1]]);

mf = cl.mem_flags
in_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=in_np)
filt_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=filt_np)
out_g = cl.Buffer(ctx, mf.WRITE_ONLY, in_np.nbytes)

# Run OpenCL convolve2D function
print(in_np.shape);
convolve2D_unroll(queue, in_np.shape, None, in_g, filt_g, out_g)

out_np = np.empty_like(in_np)
cl.enqueue_copy(queue, out_np, out_g)

fig, (a, b) = plt.subplots(1, 2)
a.imshow(in_np, cmap='gray')
b.imshow(out_np, cmap='gray')
plt.show() # apparently, plt.show() works than fig.show() in juypter

## 3. Thread-Level Parallelism
The final aspect of performance tuning is parallelization across cores. This is not direct or explicit under OpenCL, but can be configured by careful selection of **global** and **local** workgroup sizes. When they're equal, all work items are scheduled onto a single core. The ratio between these workgroup sizes indicates the number of threads/cores you can target. The OpenCL runtime can also be configured through **Device Fission** settings to partition the OpenCL device and restrict OpenCL operations onto a subset of available threads/cores.