# Exercise 3 - Templating
This notebook shows an example of how to autogenerate C kernels that are specialized for a user requested platform. Doing so you can write Python scripts that take a user input and depending on that launch the C code either on CPU or GPU. This is useful as we have to write the kernel template only once without knowing what platform the future user might want to run our software on. As you will see, this is much simpler than it sounds. 

In our example in the cell below the kernel is a simple function that doubles the elements of a 1D vector. Notice that there are some extra comments added in the string: 
- `/*begin_gpukern*/`
- `/*gpuglmem*/`
- `//begin_parallel i n`
- `//end_parallel`
- `/*end_gpukern*/`


These annotations are markers and we want to replace them by extra lines specific to the context. The context can be CPU or the parallel programming model CUDA or OpenCL for GPU. In the GPU context we want to automatically turn our serial C function into parallelized code and add the relevant kernel qualifiers.

In [None]:
# the function signature
source_sig = """void elementwise(int, const double*, double*);"""

# the function source
source_str = r"""
/*begin_gpukern*/
void elementwise(int n, 
    /*gpuglmem*/ const double* x, 
    /*gpuglmem*/ double* y)
{
  for(int i=0; i<n; i++){//begin_parallel i n
    y[i] = 2 * x[i];
  }//end_parallel
}
/*end_gpukern*/
"""

The `specialized_source()` function below fine tunes the above kernel for the requested context by replacing the annotations with the appropriate syntax (kernel qualifier, bracket, etc..).

__TODO__: try to understand what the function does. Find the lines where each of the 5 annotations listed above are treated. Complete the missing parts of the function marked by __### your code here ###__ that replace the annotations in the source string above with the context specific syntax. You can use `new_lines.append("mystring")` in the first part of the function and a simple string in the second part. The replacement of most annotations is already written, so you can take inspiration from that.

In [None]:
def specialize_source(source_str, specialize_for):
    new_lines = []  # here we collect the new kernel lines to be inserted
    
    # loop over the lines of the source string
    for ll in source_str.splitlines():
        
        # fine tune the for loop
        if "//begin_parallel" in ll:
            varname, limname = ll.split("//begin_parallel")[-1].split()
    
            if specialize_for == "cpu":
                    new_lines.append(
                        f"for (int {varname}=0; {varname}<{limname}; {varname}++)"
                        + "{\n"
                    )
                               
            elif specialize_for == "gpu_opencl":
                new_lines.append(f"int {varname};\n")
                new_lines.append(
                    f"{varname}=get_global_id(0);\n"
                )
                
            elif specialize_for == "gpu_cuda":
                new_lines.append(f"int {varname};\n")
                new_lines.append(
                    f"{varname}=blockDim.x * blockIdx.x + threadIdx.x;\n"
                    f"if ({varname}<{limname})" + "{"
                )
                
        elif "//end_parallel" in ll:
            if specialize_for == "cpu":
                ### your code here ###
            elif specialize_for == "gpu_opencl":
                ### your code here ###
            elif specialize_for == "gpu_cuda":
                ### your code here ###
                
        else:
            new_lines.append(ll)
            
    new_source_src = "\n".join(new_lines)
    
    # fine tune kernel qualifiers
    new_source_src = new_source_src.replace(
        "/*begin_gpukern*/",
        {
            "cpu": " ",
            "gpu_opencl": " __kernel ",
            "gpu_cuda": "extern \"C\"{\n__global__",
        }[specialize_for],
    )
    
    new_source_src = new_source_src.replace(
        "/*end_gpukern*/",
        {
            "cpu": ### your code here ###
            "gpu_opencl": ### your code here ###
            "gpu_cuda": ### your code here ###
        }[specialize_for],
    )
    
    new_source_src = new_source_src.replace(
        "/*gpuglmem*/",
        {
            "cpu": ### your code here ###
            "gpu_opencl": ### your code here ###
            "gpu_cuda": ### your code here ###
        }[specialize_for],
    )
    return new_source_src

Next, we create some input data

In [None]:
import numpy as np
n = 10
x = np.random.randn(n)

Now we specify the desired context: `cpu`, `gpu_cuda` or `gpu_opencl`

In [None]:
specialize_for = "gpu_cuda"

Use our script to create the context specific kernel

In [None]:
new_source_str = specialize_source(source_str, specialize_for)
print(new_source_str)

Test the kernel. If everything is fine, the string "passed" should be printed.

In [None]:
if specialize_for == "cpu":
    import cffi
    print("using cffi")
    
    ffi_interface = cffi.FFI()
    ffi_interface.cdef(source_sig)
    ffi_interface.set_source("_exercise_3", new_source_str)
    ffi_interface.compile(verbose=True)
    
    from _exercise_3 import ffi, lib
    
    x_cffi = ffi.cast( "double *", ffi.from_buffer(x))
    y_cffi = ffi.new("double[]", len(x))
    lib.elementwise(10, x_cffi, y_cffi)
    y = ffi.unpack(y_cffi, 10)
    
    assert np.allclose(x*2, y)
    print("passed")

elif specialize_for == "gpu_cuda":
    import cupy as cp
    print("using cupy")
    
    module = cp.RawModule(code=new_source_str)
    elementwise_kernel = module.get_function("elementwise")
    
    x_gpu = cp.array(x)
    y_gpu = cp.zeros_like(x_gpu)
    
    blocksize = 1
    n_blocks = int(np.ceil(len(x_gpu) / blocksize))
    elementwise_kernel(grid=(n_blocks,), block=(blocksize,), args=(len(x), x_gpu, y_gpu))
    y_cpu = y_gpu.get()
    assert np.allclose(x*2, y_cpu)
    print("passed")
    
elif specialize_for == "gpu_opencl":
    import pyopencl as cl
    from pyopencl import array as cl_array
    print("using pyopencl")

    ctx = cl.create_some_context(interactive=False)
    queue = cl.CommandQueue(ctx)
    
    prg = cl.Program(ctx, new_source_str).build()
    
    x_gpu = cl_array.to_device(queue, x)
    y_gpu = cl_array.zeros_like(x_gpu)

    grid_size = len(x)
    workgroup_size = 1
    prg.elementwise(queue, (grid_size,), (workgroup_size,), np.int32(len(x)), x_gpu.data, y_gpu.data)
    y_cpu = y_gpu.get()
    assert np.allclose(x*2, y_cpu)
    print("passed")


Now imagine that we have a large software framework (e.g. a particle tracking framework) with the numerically heavy parts written in C, the rest written in Python. The framework will be used by many scientists on different computers, some with GPUs some with only CPUs. Using CFFI, CuPy and PyOpenCL we can design the framework to be "multiplatform" meaning that we write the simulation code only once, using template code in C and annotations similar to those in this exercise. In Python we can easily design wrapper classes or functions that specialize the C code for the specific platform, specified by the user depending on their needs and available computing resources.