In this notebook we can see how to use CUDA alongside Python in the example of adding two vectors. 

Adding two vectors in Python:

In [92]:
N = 10000000

a = [1. for i in range(N)]
b = [2. for i in range(N)]

%time c = [a[i] + b[i] for i in range(N)]

CPU times: user 703 ms, sys: 197 ms, total: 900 ms
Wall time: 898 ms


Optimizing speed with numpy:

In [106]:
import numpy as np

a = np.ones(N)
b = np.ones(N) * 2

%time c = a + b

CPU times: user 20.5 ms, sys: 8.09 ms, total: 28.6 ms
Wall time: 28.6 ms


The first way in which we can use CUDA in Python is by using the `numba` library:

In [58]:
from numba import cuda
from numba.cuda import as_cuda_array as ca

In [107]:
@cuda.jit 
def vector_add_kernel(a, b, c, n):
    i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    if i < n:
        c[i] = a[i] + b[i]

In [108]:
def vector_add(a, b):
    a = cuda.to_device(a)
    b = cuda.to_device(b)
    n = a.size
    c = np.zeros(n)
    c = cuda.to_device(c)
    threads_per_block = 128
    blocks_per_grid = (n + (threads_per_block - 1)) // threads_per_block
    vector_add_kernel[blocks_per_grid, threads_per_block](a, b, c, n)
    return c.copy_to_host()

In [116]:
%time c = vector_add(a, b)

CPU times: user 84.6 ms, sys: 31.5 ms, total: 116 ms
Wall time: 114 ms


There is no much speedup in this case since the overhead of copying the data to the GPU and back to the CPU is high.

Another way to use CUDA in Python is by using the `pytorch`:

> This requires some extra dependencies `pip install -q wurlitzer ninja`.

In [7]:
%load_ext wurlitzer

In [8]:
import torch
from torch.utils.cpp_extension import load_inline

In [9]:
def load_cuda(cuda_src, cpp_src, funcs, opt=False, verbose=False):
    return load_inline(cuda_sources=[cuda_src], cpp_sources=[cpp_src], functions=funcs,
                       extra_cuda_cflags=["-O2"] if opt else [], verbose=verbose, name="inline_ext")

In [10]:
cuda_begin = r'''
#include <torch/extension.h>
#include <stdio.h>
#include <c10/cuda/CUDAException.h>

#define CHECK_CUDA(x) TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x " must be contiguous")
#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)

inline unsigned int cdiv(unsigned int a, unsigned int b) { return (a + b - 1) / b;}
'''

In [21]:
cuda_src = cuda_begin + r'''
__global__ void vectorAddKernel(float *out, float *a, float *b, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n) {
        out[i] = a[i] + b[i];
    }
}

torch::Tensor vectorAdd(torch::Tensor a, torch::Tensor b) {
    CHECK_INPUT(a); CHECK_INPUT(b);
    int n = a.size(0);
    auto output = torch::empty(n, a.options());
    int threads = 256;
    vectorAddKernel<<<cdiv(n, threads), threads>>>(output.data_ptr<float>(), a.data_ptr<float>(), b.data_ptr<float>(), n);
    C10_CUDA_KERNEL_LAUNCH_CHECK();
    return output;
}'''

In [22]:
cpp_src = "torch::Tensor vectorAdd(torch::Tensor a, torch::Tensor b);"

In [23]:
module = load_cuda(cuda_src, cpp_src, ['vectorAdd'], verbose=True)

Using /home/juan/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
The input conditions for extension module inline_ext have changed. Bumping to version 4 and re-building as inline_ext_v4...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/juan/.cache/torch_extensions/py38_cu121/inline_ext/build.ninja...
Building extension module inline_ext_v4...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


[1/3] c++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=inline_ext_v4 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/juan/miniconda3/envs/cuda/lib/python3.8/site-packages/torch/include -isystem /home/juan/miniconda3/envs/cuda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/juan/miniconda3/envs/cuda/lib/python3.8/site-packages/torch/include/TH -isystem /home/juan/miniconda3/envs/cuda/lib/python3.8/site-packages/torch/include/THC -isystem /home/juan/miniconda3/envs/cuda/include -isystem /home/juan/miniconda3/envs/cuda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -c /home/juan/.cache/torch_extensions/py38_cu121/inline_ext/main.cpp -o main.o 
[2/3] /home/juan/miniconda3/envs/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output cuda.cuda.o.d -DTORCH_EXTENSION_NAME=inline_ext_v4 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILE

Loading extension module inline_ext_v4...


In [24]:
dir(module)

['__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'vectorAdd']

In [136]:
%%time
 
a = torch.ones(N, dtype=torch.float32, device='cuda')
b = torch.ones(N, dtype=torch.float32, device='cuda') * 2

c = module.vectorAdd(a, b)

CPU times: user 1.77 ms, sys: 25 µs, total: 1.8 ms
Wall time: 1.4 ms


That is fast!

We can simulate CUDA in Python for easier development and debugging:

In [137]:
def run_kernel(f, times, *args):
    for i in range(times): f(i, *args)

In [138]:
def vector_add_kernel(i, out, a, b, n):
	if i < n:
	    out[i] = a[i] + b[i]

In [140]:
a = np.ones(N)
b = np.ones(N) * 2
c = np.zeros(N)

%time run_kernel(vector_add_kernel, N, c, a, b, N)

CPU times: user 3.53 s, sys: 30.9 ms, total: 3.56 s
Wall time: 3.55 s
