# PyCUDA: CUDA kernels

Required imports.

In [29]:
import numpy  as np
import pycuda.autoinit
from pycuda import gpuarray
from pycuda.compiler import SourceModule

Using PyCUDA's `SourceModule`, it is possible to write kernels in pure CUDA.  The following kernel takes an array as input, and scales with the given factor.

In [69]:
kernel = SourceModule(
    '''
    __global__ void scale(float *out_vec, float *in_vec, const float factor) {
        int i = threadIdx.x + blockDim.x*blockIdx.x;
        out_vec[i] = factor*in_vec[i];
    }
    '''
)

In [70]:
factor = 3.14

Prepare the data on host and device.

In [71]:
host_vec = np.arange(0.0, 1024.0**2, 1.0, dtype=np.float32)
gpu_vec = gpuarray.to_gpu(host_vec)
gpu_out_vec = gpuarray.empty_like(gpu_vec)

Compile the kernel function.

In [72]:
scale_func = kernel.get_function('scale')

Compute and verify the result.

In [76]:
scale_func(gpu_out_vec, gpu_vec, np.float32(factor), block=(1024, 1, 1), grid=(host_vec.shape[0]//1024, 1, 1))

In [77]:
np.sum(gpu_out_vec.get())

1726231900000.0

In [78]:
factor*host_vec.sum()

1726232638259.2

The result is the same on CPU and GPGPU, taking round-off into account.