# Speeding up kernels with vectors

The majority of OpenCL devices including CPU's and AMD GPU's have the ability to process more than one floating point operation in a single math operation, thus improving performance. We can leverage this ability by grouping data into vectors and processing more elements in one math operation. 

## Vector types

Recall from the basic OpenCL application notes where we covered the vector floating point data types. They are repeated below.

|OpenCL 32-bit floating point data types|Explanation|
|:--|:--|
|cl_float|4 byte (32-bit) single-precision floating point|
|cl_float2|8 byte (64-bit) 2-component single-precision floating point vector|
|cl_float4|16 byte (128-bit) 4-component single-precision floating point vector|
|cl_float8|32 byte (256-bit) 8-component single-precision floating point vector|
|cl_float16|64 byte (512-bit) 16-component single-precision floating point vector|

|OpenCL 64-bit floating point data types|Explanation|
|:--|:--|
|cl_double|8 byte (64-bit) single-precision floating point|
|cl_double2|16 byte (128-bit) 2-component double-precision floating point vector|
|cl_double4|32 byte (256-bit) 4-component double-precision floating point vector|
|cl_double8|64 byte (512-bit) 8-component double-precision floating point vector|
|cl_double16|128 byte (1024-bit) 16-component double-precision floating point vector|

If an optional pragma is included in the code 

```C++
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
```

then you can also have access to 16-bit half-precisions floating point data types, [see here for details](https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/cl_khr_fp16.html).

|OpenCL 16-bit floating point data types|Explanation|
|:--|:--|
|cl_half|2 byte (16-bit) half-precision floating point|
|cl_half2|4 byte (32-bit) 2-component half-precision floating point vector|
|cl_half4|8 byte (64-bit) 4-component half-precision floating point vector|
|cl_half8|16 byte (128-bit) 8-component half-precision floating point vector|
|cl_half16|32 byte (256-bit) 16-component half-precision floating point vector|

## Using vector data types in kernels

In the code [mat_mult_transpose_vector.cpp](code/mat_mult_transpose_vector.cpp) we have further modified the transpose routine so it uses the **float8** data type in the kernel called **mat_mult_transp_vector** the code for that kernel is below.

```C
// special matrix multiply kernel that uses a pre-transposed matrix and vectors A
__kernel void mat_mult_transp_vector ( __global float8* A_transp, 
                                __global float8* B, 
                                __global float* C, 
                                int nrows_A_transp, 
                                int nrows_B, 
                                int nrows_C) { 
    // i0 and i1 represent the coordinates in C 
    // We assume Fortran ordering for the matrices 
    size_t i0=get_global_id(0); 
    size_t i1=get_global_id(1); 
    size_t offset_A=i0*nrows_A_transp; 
    size_t offset_B=i1*nrows_B; 
    float8 temp=0.0; 
    // For every coordinate in C, loop over the related rows of A_transp and B 
    for (int n=0; n<nrows_B; n++) { 
        // Every column of A_transp corresponds to a row of C 
        // Every column of B corresponds to a column of C 
        // C has the same number of rows as A_transp, and the same number of columns as B
        // i0 is the column index of A_transp 
        // i1 is the column index of B 
        temp+=A_transp[offset_A+n]*B[offset_B+n]; 
    } \n\
    // Access components of a vector 
    C[i1*nrows_C+i0]=temp.s0+temp.s1+temp.s2+temp.s3+temp.s4+temp.s5+temp.s6+temp.s7; 
} 
```

Notice that we have replaced **float** with **float8** when referring to matrices **A_transp** and **B**. The kernel then regards the memory passed in through **A_transp** and **B** as an array of **float8** vectors. Obviously to avoid memory problems, the memory for those two matrices would need to be allocated in chunks of 8 floats x 4 bytes per float=32 bytes, which we have already done as the matrices are of size (1024,1024). You can also avoid checking for this by allocating host memory using the cl_float8 data type, which corresponds to float8 in the kernel.

## Accessing elements of a vector

You may have noticed that the last line of the code has some strange looking **.s\*** syntax, such as **temp.s0**. This is just notation to hook into the components of the vector **temp** which is of type float8. For all vector data types, individual elements at indices 0-9 may be accessed using **.s[0-9]** notation (e.g temp.s7), and elements at indices 10-15 may be indexed using **.s[a-g]**. For vector types with less than or equal to four elements you may use the indices **.xyzw** (e.g. temp.x).

## Other considerations

When setting up the kernel arguments **nrows_A_transp** and **nrows_B** we must be aware that they are 8 times shorter than what they would be if we were just using the **float** datatype in the kernel. In order to allow for this, the arguments to the kernel now looks like this.

```C
cl_int vector_length=8;
cl_int vectorsed_nrows_B=nrows_B/vector_length;

// Set arguments for the multiply kernel with transpose
errchk(clSetKernelArg(kernel_mat_mult_transp_vector, 0, sizeof(cl_mem), &buffer_A_transp ),"setting \
mat_mult_transp_vector argument 0");
errchk(clSetKernelArg(kernel_mat_mult_transp_vector, 1, sizeof(cl_mem), &buffer_B ),"setting kernel \
mat_mult_transp_vector argument 1");
errchk(clSetKernelArg(kernel_mat_mult_transp_vector, 2, sizeof(cl_mem), &buffer_C ),"setting kernel \
mat_mult_transp_vector argument 2");
errchk(clSetKernelArg(kernel_mat_mult_transp_vector, 3, sizeof(int), &vectorised_nrows_A_transp ),"setting \
mat_mult_transp_vector argument 3");
errchk(clSetKernelArg(kernel_mat_mult_transp_vector, 4, sizeof(int), &vectorised_nrows_B ),"setting \
mat_mult_transp_vector argument 4");
errchk(clSetKernelArg(kernel_mat_mult_transp_vector, 5, sizeof(int), &nrows_C ),"setting \
mat_mult_transp_vector argument 5");
```

If we run the code we see that with CPU's it may provide a speedup over the standard matrix multiplication code. 

In [9]:
!cd code; ./mat_mult_transpose_vector

Platform 0: Experimental OpenCL 2.1 CPU Only Platform, vendor: Intel(R) Corporation, version OpenCL 2.1 LINUX
Platform 1: NVIDIA CUDA, vendor: NVIDIA Corporation, version OpenCL 1.2 CUDA 9.1.83
Platform 0 has 1 devices
Platform 1 has 1 devices
Matrix transpose took 4.019764 ms
Standard matrix multiply took 109.911231 ms
Transposed matrix multiply took 57.124385 ms
Transposed and vectorised matrix multiply took 22.131199 ms
Transposed approach resulted in a speedup of 1.797576x
Transposed and vectorised approach resulted in a speedup of 4.202952x
RMS difference is 3.46211e-05
Elapsed time is 0.981697seconds


In my case with the CPU platform, the transposed and vectorised code attained a speedup of 2.6x over the standard matrix multiplication algorithm. This is because the Intel OpenCL platform is looking at the OpenCL vector code and compiling to vector CPU instructions.

<address>
&copy; 2018 by Dr. Toby Potter<br>
email: <a href="mailto:tobympotter@gmail.com">tobympotter@gmail.com</a><br>
Visit us at: <a href="https://www.pelagos-consulting.com">www.pelagos-consulting.com</a><br>
</address>