In [None]:
height = 2_000
width = 3_000
maxiterations = 20

In [None]:
import numpy as np
import numba
import math
import matplotlib.pyplot as plt
import cupy as cp
import cupyx

Let's start by checking to see which GPU we have, using a shell command (sadly CuPy does not seem to be able to query names; it is currently limited to numerical attributes):


In [None]:
!nvidia-smi

If you have a V100, the examples should run roughly 5x faster than on the K40's.

# Mandelbrot Fractal

From the CPU course, we had the Mandelbrot fractal, which we will be covering today as well.

You can generate a Mandelbrot fractal by applying the transform:

$$
z_{n+1}=z_{n}^{2}+c
$$

repeatedly to a regular matrix of complex numbers $c$, and recording the iteration number where the value $|z|$ surpassed some bound $N$, usually $N=2$. You start at $z_0 = c$.



Let's set up some initial parameters and a helper matrix:

In [None]:
def prepare(height, width, xp=np):
    x,y = xp.ogrid[-1.5j:1.5j:height*1j, -2:2:width*1j]
    c = x + y
    fractal = xp.zeros(c.shape, dtype=xp.int32)
    return c, fractal

## Numpy

Let's try a Numpy run (we will use `%%time` instead of `%%timeit`, since this takes several seconds to run so we don't need a precision measurement and don't want to waste time):

In [None]:
def fractal_x(c, f, maxiterations):
    xp = cp.get_array_module(c)
    f *= 0 # set to 0
    z = c.copy()

    for i in range(1, maxiterations+1):
        z = z**2 + c                    # Compute z
        diverge = xp.abs(z**2) > 2**2   # Divergence criteria

        z[diverge] = 2               # Keep number size small
        f[~diverge] = i              # Fill in non-diverged iteration number
        
    return f

In [None]:
c, fractal = prepare(height, width, np)

In [None]:
%%time
_ = fractal_x(c, fractal, 20)

In [None]:
plt.imshow(fractal)

## Numba

Let's do a quick check with Numba from the CPU course, just to see how fast we can get on single CPU:

In [None]:
@numba.vectorize([numba.int32(numba.complex128, numba.int32)])
def on_each_numba(cxy, maxiterations):
    z = cxy
    for i in range(maxiterations):
        z = z**2 + cxy
        if abs(z) > 2:
            return i
    return maxiterations

In [None]:
c, fractal = prepare(height, width, np)

In [None]:
%%time
r = on_each_numba(c, 20)

In [None]:
plt.imshow(r);

## CuPy: Numpy interface

Now, let's try a CuPy run (We will run a synchronize call just for good measure, since we are not using the output):

In [None]:
import cupy as cp

In [None]:
c, fractal = prepare(height, width, cp)

In [None]:
%%timeit
fractal_x(c, fractal, 20)
cp.cuda.get_current_stream().synchronize()

## CuPy: Fuse interface

This is a "Numba vectorize"-like interface for making elementwise interfaces and simple reductions. It's quite limited, though.

In [None]:
@cp.fuse()
def cupy_fuse_combine(z, c):
    x = z**2 + c
    return x, cp.abs(x**2)

def fractal_fuse(c, f, maxiterations):
    xp = cp.get_array_module(c)
    f *= 0 # set to 0
    z = c.copy()

    for i in range(1, maxiterations+1):
        z, az2 = cupy_fuse_combine(z, c)     # Compute z
        diverge = az2  > 2**2       # Divergence criteria

        z[diverge] = 2               # Keep number size small
        f[~diverge] = i              # Fill in non-diverged iteration number
        
    return f

In [None]:
c, fractal = prepare(height, width, cp)

In [None]:
%%timeit
fractal_fuse(c, fractal, 20)
cp.cuda.get_current_stream().synchronize()

## CuPy: Elementwise Kernel

Now, let's try a custom elementwise kernel.

In [None]:
cupy_single = cp.ElementwiseKernel(
    "complex128 c, int32 maxiterations",
    "int32 res",
    """
    res = 0;
    complex<double> z = c;

    for (int i=0; i<maxiterations; i++) {
        z = z*z + c;

        if(z.real()*z.real() + z.imag()*z.imag() > 4)
            break;

        res = i;
    }
    
    """,                                
    "fract_el")

In [None]:
%%timeit
f = cupy_single(c, 20).get()
cp.cuda.get_current_stream().synchronize()

In [None]:
f = cupy_single(c, 20)
plt.imshow(f.get())

We could also try writing everything ourselves with a pure, raw CUDA kernel:

> Note: width/height are confusing here

In [None]:
cupy_kernel = cp.RawKernel("""
extern "C" 
__global__ void fractal(double* c, int* fractal, int height, int width, int maxiterations) {
    const int x = threadIdx.x + blockIdx.x*blockDim.x;
    const int y = threadIdx.y + blockIdx.y*blockDim.y;
    
    // Manual check for out-of-bounds (since blocks may be partial)
    if (x >= height || y >= width)
        return;
    
    // Access c
    double creal = c[2 * (x + height*y)];
    double cimag = c[2 * (x + height*y) + 1];
    
    // z = c
    double zreal = creal;
    double zimag = cimag;
    
    fractal[x + height*y] = 0;
    for (int i = 0;  i < maxiterations;  i++) {
        // z = z*z + c
        double zreal_new = zreal*zreal - zimag*zimag + creal;
        double zimag_new = 2*zreal*zimag + cimag;
        zreal = zreal_new;
        zimag = zimag_new;
        
        if (zreal*zreal + zimag*zimag > 4) {
            break;
        }
        fractal[x + height*y] = i;
    }
}
""", "fractal")

In [None]:
def prepare_pycuda(c, fractal, maxiterations):
    threadsperblock = (32, 32)
    blockspergrid = (
        math.ceil(c.shape[0] / threadsperblock[0]),
        math.ceil(c.shape[1] / threadsperblock[1]),
    )
    
    return (
        blockspergrid,
        threadsperblock,
        [
            c.view(cp.double),
            fractal,
            cp.int32(height),
            cp.int32(width),
            cp.int32(maxiterations)
        ]
    )

In [None]:
c, fractal = prepare(height, width, cp)
args = prepare_pycuda(c, fractal, maxiterations)

In [None]:
%%timeit
cupy_kernel(*args)
fractal.get()
cp.cuda.get_current_stream().synchronize()

In [None]:
plt.imshow(fractal.get());

# Extra features

I've skipped a key example not included above: reduction kernels. These let you perform an element-wise calculation as well as a binary reduction (like a sum).

You can also use generic (template in C++) types "T", and you can use "raw" generics, which are arrays that do not participate in the element-wise portion of the kernel (that is, they do not broadcast in Numpy terms).

# New version

CuPy 7.0 brought a host of new features, including:

* Remove Python 2 support
* RawModule, for building larger projects
* NVCC support (instead of just NVRTC)
* TensorCore support
* High speed CUB routines, like sum and more

CuPy 8.0 brought even more:

* Optional activateion of more CUB routines
* More kernel fusion, with more reducers
* More Scipy support, better external library integration

CuPy 9.0 gave even more performance and filled out of the library. JIT support is experimental.