# Prerequisites

This notebook presents libraries and tools for making Python code run on
NVIDIA GPUs!! The path to running on GPUs goes through improving CPU
performance but the target is to run on GPUs.

The notebook was tested on the Anaconda Python distribution for Python 3.x.
This container should have all of the package dependencies.

This notebook follows the accompanying slides.

Let's check which CPU and GPU you're using. 

In [1]:
!nvidia-smi

zsh:1: command not found: nvidia-smi


In [None]:
!lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:            0
CPU MHz:             2199.998
BogoMIPS:            4399.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            56320K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_sin

We can write Python and execute it inside a code block: 

In [None]:
print("Goodbye world!!!!")

Goodbye world!!!!


<a id='Simple Python Example with Numba'></a>
# 1. Simple Python Example Using Numba - CPU and GPU


This is a simple example that adds two vectors to create a third vector.
The computational work is done in a function. To create enough computational
work, the vector length is quite large (1,000,000,000).
These are single precision vectors.

The example starts with a simple Python example with no compiling.
Just in case you are interested a simple timing of the addition is
printed at the end.

<a id='Simple Python Example'></a>
## 1.1 Starting Point - Simple Python Code

The code below is the baseline Python code (the starting point). It takes the tanh of each elemental along the diagonal of a matrix, and adds it to the matrix.
The matrix is created using NumPy. The time it takes to perform the operation
is measured using the "timeit" module in Python.

In [None]:
import numpy as np
from timeit import default_timer as timer

x = np.arange(100000000).reshape(10000, 10000)


def trace_func(a): 
    trace = 0.0
    for i in range(a.shape[0]):   
        trace += np.tanh(a[i, i])
    return a + trace             

start = timer()
trace_func(x)
dt = timer() - start

print("Computed in %f s" % dt)

Computed in 0.201477 s


<a id='Simple Python Example - CPU'></a>
## 1.2 Simple Python - @jit Examples

The following examples present code that uses Numba @jit to compile the <tt>trace_func</tt> function in the code. Various options are presented for both CPU compiling
and compiling for the NVIDIA GPU. These options use various decorators and
options. The discussion to accompany these examples are in the slide deck.

<a id='Simple Python Example single core'></a>
### 1.2.1 <tt>@jit</tt> with default target (single CPU core)

This example uses the <tt>@jit</tt> decorator in Numba. By default this targets a
single core on the CPU.

You might try running the cell a few times to get an idea of the average
wall clock time.

In [None]:
from numba import jit
import numpy as np
from timeit import default_timer as timer

x = np.arange(100000000).reshape(10000, 10000)

@jit
def trace_func(a): 
    trace = 0.0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

trace_func(x)

start = timer()
trace_func(x)
dt = timer() - start

print("Computed in %f s" % dt)

Computed in 0.152036 s


<a id='Simple Python Example multi-core'></a>
### 1.2.2 <tt>@jit</tt> with multi-core target and no Python fallback

This targets multiple CPU cores using the  <tt>parallel=True</tt>  option in  <tt>@jit</tt>.

I also use the option  <tt>nopython=True</tt>  so I can catch my code
mistakes. It is a default option, but I like to specify so I know exactly
what options I'm using. This option disables any Python fall-back in case
Numba cannot compile the code.

You might try running the cell a few times to get an idea of the average
wall clock time.

In [None]:
from numba import jit
import numpy as np
from timeit import default_timer as timer

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True, parallel=True)
def trace_func(a): 
    trace = 0.0
    for i in range(a.shape[0]):  
        trace += np.tanh(a[i, i]) 
    return a + trace              

trace_func(x)

start = timer()
trace_func(x)
dt = timer() - start

print("Computed in %f s" % dt)



Computed in 0.209793 s


<strong>Note</strong>: Currently, we only measure the time for the compiled
code. We don't measure
the time that includes the compile time. Therefore, the elapsed wall clock
times should not vary too much.

<a id='Simple Python Example defaults cache'></a>
### 1.2.3 <tt>@jit</tt> with defaults, caching, and no Python fallback

This example uses the defaults again, but it also caches the compiled code
so it can be reused (<tt>cache=True</tt>). 

In [None]:
from numba import jit
import numpy as np
from timeit import default_timer as timer

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True, cache=True)
def trace_func(a): 
    trace = 0.0
    for i in range(a.shape[0]):  
        trace += np.tanh(a[i, i]) 
    return a + trace              

trace_func(x)

start = timer()
trace_func(x)
dt = timer() - start

print("Computed in %f s" % dt)

:<a id='Simple Python Example multicore cache'></a>
### 1.2.4 <tt>@jit</tt> with parallel, caching, and no Python fallback

This example adds the option <tt>parallel=True</tt> (multi-core) to the
previous options.

In [None]:
from numba import jit
import numpy as np
from timeit import default_timer as timer

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True, parallel=True, cache=True)
def trace_func(a): 
    trace = 0.0
    for i in range(a.shape[0]):  
        trace += np.tanh(a[i, i]) 
    return a + trace              

trace_func(x)

start = timer()
trace_func(x)
dt = timer() - start

print("Computed in %f s" % dt)

<a id='Simple Python Example vectorize no type signature single core'></a>
### 1.3 Numba <tt>@vectorize</tt> decorator 


Below is the Python code example for this exercise, without any decorator applied.

### 1.3.1 Python Example

In [None]:
import numpy as np
from timeit import default_timer as timer

def rel_diff(x, y): 
  return 2 * (x - y) / (x + y)

a = np.arange(1000, dtype = np.float32)
b = a * 2 + 1        

start = timer()
rel_diff(a, b)
dt = timer() - start

print("Computed in %f s" % dt)

This example uses the @vectorize decorator targeting the CPU (default target). 
Remember that the <tt>@vectorize</tt> decorator compiles functions that perform
element-by-element operations (that is, the same operation to every element
in the array). 

In general, you give the <tt>@vectorize</tt> decorator a type signature that
tells Numba how to build the compiled code for a specific data type (you can
specify more than one which is really cool). Here is an example,


<codeblock>
    @vectorize(['float32(float32, float32)'])
</codeblock>


The data type specification after the decorator is the "type signature".

You don't have to specify a type signature. If you don't, then Numba will
create a dynamic universal function (DUFunc). This dynamically compilers a
new kernel when the function is called with a data type that wasn't previously
used. That is, it will compile the function for every specific data type in
your code. This can have an impact if your use multiple data types with the same
function. You can think of this approach as "call-time" or "lazy" compilation.

Below, see the example function with the addition of the @vectorize decorator. Note that no type signature is supplied in the function definition, so Numba will infer the types when the function is first called.

### 1.3.2 <tt>@vectorize</tt> Without Type Signature

In [None]:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize

@vectorize
def rel_diff(x, y): 
  return 2 * (x - y) / (x + y)

a = np.arange(1000, dtype = np.float32)
b = a * 2 + 1        

rel_diff(a, b)

start = timer()
rel_diff(a, b)
dt = timer() - start

print("Computed in %f s" % dt)

And finally the same function with the type signature defined. Since the signature is supplied with the function definition, Numba will know the type signature when the function definition is first encountered.

### 1.3.3 <tt>@vectorize</tt> With Type Signature

In [None]:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize

@vectorize(['float32(float32, float32)'])
def rel_diff(x, y): 
  return 2 * (x - y) / (x + y)

a = np.arange(1000, dtype = np.float32)
b = a * 2 + 1        

rel_diff(a, b)

start = timer()
rel_diff(a, b)
dt = timer() - start

print("Computed in %f s" % dt)

<a id='Simple Python Example vectorize type signature single core'></a>
### 1.3.4 @vectorize decorator for CPUs with single CPU core target

There are multiple hardware targets available with the @vectorize decorator. This is a trivial example, but it illustrates that the default target for the
<tt>@vectorize</tt> decorator is a single core ( <tt>target='cpu'</tt> ). 

In [None]:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize

@vectorize(['float32(float32, float32)'], target='cpu')
def rel_diff(x, y): 
  return 2 * (x - y) / (x + y)

a = np.arange(1000, dtype = np.float32)
b = a * 2 + 1        

rel_diff(a, b)

start = timer()
rel_diff(a, b)
dt = timer() - start

print("Computed in %f s" % dt)

<a id='Simple Python Example vectorize type signature parallel'></a>
### 1.3.5 @vectorize decorator using all CPU cores

This example changes the target to use all of the cores (multi-core). The target name
is simple <tt>parallel</tt>. Even though all of the cores are being used, the performance
may not improve. It takes time to move data around as needed across the cores.

In [None]:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize

@vectorize(['float32(float32, float32)'], target='parallel')
def rel_diff(x, y): 
  return 2 * (x - y) / (x + y)

a = np.arange(1000, dtype = np.float32)
b = a * 2 + 1        

rel_diff(a, b)

start = timer()
rel_diff(a, b)
dt = timer() - start

print("Computed in %f s" % dt)

<a id='Simple Python Example - GPU'></a>
## 1.4 Simple Python - GPU Examples

The examples below explore using the GPU as a target with the vectorize decorator
and the <tt>cuda</tt> target, as well as using the  <tt>@cuda.jit</tt>  decorator.

<a id='Simple Python Example - Vectorize GPU'></a>
### 1.4.1 <tt>@vectorize</tt> Decorator with a CUDA Target

Porting Python code to the GPU can be very easy using the code from the  <tt>@vectorize</tt>
decorator. For most code, all you need to do is change the *target* in the decorator to
<tt>cuda</tt>.

Notice how Numba takes care of copying the data to the GPU and copying it back. Numba
also takes care of defining the arrays on the GPU.

Try running the cell several times to get an idea of the compute time.

In [None]:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize

@vectorize(['float32(float32, float32)'], target='cuda')
def rel_diff(x, y): 
  return 2 * (x - y) / (x + y)

a = np.arange(1000, dtype = np.float32)
b = a * 2 + 1        

rel_diff(a, b)

start = timer()
rel_diff(a, b)
dt = timer() - start

print("Computed in %f s" % dt)

<a id='Simple Python Example - cud.jit GPU'></a>
### 1.4.2 <tt>@cuda.jit</tt> Decorator

You can also use a different decorator,  <tt>@cuda.jit</tt>, that allows you to
write Python code that is more CUDA-like. This offers
you greater flexibility and control over your code and possibly better performance.
However, you will need to know a fair amount about CUDA  before using it.
The link below is a good place to get more information.


http://numba.pydata.org/numba-doc/dev/cuda/index.html

    
As explained in the slides, using the  <tt>cuda.jit</tt>  decorator requires some extra coding.
You have to explicitly write the loops in your code.
Also, everything you pass into or out-of the compiled function, has to be a NumPy
array (even scalars which are NumPy arrays of size 1). However, you can define simple
scalars in the function that are not NumPy arrays. For example, loop counters.

Pay close attention to how the data is passed to the function - it is much more
like functions in C where everything all data is passed through the function call. 

To prepare for using <tt>@cuda.jit</tt>, let's introduce a new example. In this exercise, we will accelerate a Mandelbrot fractal computation using CUDA Python via Numba. Starting with the plain Python implementation...

In [None]:
import numpy as np
from pylab import imshow, show
from timeit import default_timer as timer

def mandel(x, y, max_iters):
  """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
  """
  c = complex(x, y)
  z = 0.0j
  for i in range(max_iters):
    z = z*z + c
    if (z.real*z.real + z.imag*z.imag) >= 4:
      return i

  return max_iters


#The whole image loop...
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  for x in range(width):
    real = min_x + x * pixel_size_x
    for y in range(height):
      imag = min_y + y * pixel_size_y
      color = mandel(real, imag, iters)
      image[y, x] = color

      
image = np.zeros((1024, 1536), dtype = np.uint8)
start = timer()
create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20) 
dt = timer() - start

print ("Mandelbrot created in %f s" % dt)
imshow(image)
show()

Now let's add @jit decorators

In [None]:
import numpy as np
from pylab import imshow, show
from timeit import default_timer as timer
from numba import jit

@jit
def mandel(x, y, max_iters):
  """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
  """
  c = complex(x, y)
  z = 0.0j
  for i in range(max_iters):
    z = z*z + c
    if (z.real*z.real + z.imag*z.imag) >= 4:
      return i

  return max_iters


#The whole image loop...
@jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  for x in range(width):
    real = min_x + x * pixel_size_x
    for y in range(height):
      imag = min_y + y * pixel_size_y
      color = mandel(real, imag, iters)
      image[y, x] = color

      
image = np.zeros((1024, 1536), dtype = np.uint8)
start = timer()
create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20) 
dt = timer() - start

print ("Mandelbrot created in %f s" % dt)
imshow(image)
show()

And now with @cuda.jit. Try changing the decorator and see what happens.

In [None]:
import numpy as np
from pylab import imshow, show
from timeit import default_timer as timer
from numba import cuda

@cuda.jit
def mandel(x, y, max_iters):
  """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
  """
  c = complex(x, y)
  z = 0.0j
  for i in range(max_iters):
    z = z*z + c
    if (z.real*z.real + z.imag*z.imag) >= 4:
      return i

  return max_iters


#The whole image loop...
@cuda.jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  for x in range(width):
    real = min_x + x * pixel_size_x
    for y in range(height):
      imag = min_y + y * pixel_size_y
      color = mandel(real, imag, iters)
      image[y, x] = color

      
image = np.zeros((1024, 1536), dtype = np.uint8)
start = timer()
create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20) 
dt = timer() - start

print ("Mandelbrot created in %f s" % dt)
imshow(image)
show()

First the caller. We have to define a configuration of threads to do our work for us.

In [None]:
image = np.zeros((1024, 1536), dtype = np.uint8)

#Create grid of 32x32 blocks, one thread per pixel
height = image.shape[0]
width = image.shape[1]

nthreads = 32
nblocksy = (height // nthreads) + 1 #33
nblocksx = (width // nthreads) + 1 #49


config = (nblocksx, nblocksy), (nthreads, nthreads)



#start = timer()
#create_fractal[config](-2.0, 1.0, -1.0, 1.0, image, 20) 
#dt = timer() - start

#print ("Mandelbrot created in %f s" % dt)
#imshow(image)
#show()

Next let's look at the whole-image loop, create_fractal. We will have a team of many GPU threads doing our work for us - one thread per pixel. So we can remove the loops that distribute multiple work items to a single thread.

In [None]:
from numba import cuda
from numba import *

#mandel_gpu = cuda.jit(device=True)(mandel)

@cuda.jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  x, y = cuda.grid(2) # x = blockIdx.x * blockDim.x + threadIdx.x
  if x < width and y < height:
      real = min_x + x * pixel_size_x
      imag = min_y + y * pixel_size_y
      color = mandel_gpu(real, imag, iters)
      image[y,x] = color


Finally let's look at mandel. We want it to be a function, which is itself called from another GPU function create_fractal. To specify that it's a device function, use @cuda.jit(device=True) - which is similar to \_\_device\_\_ in CUDA C.

In [None]:
@cuda.jit(device=True)
def mandel_gpu(x, y, max_iters):
  """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
  """
  c = complex(x, y)
  z = 0.0j
  for i in range(max_iters):
    z = z*z + c
    if (z.real*z.real + z.imag*z.imag) >= 4:
      return i

  return max_iters

Putting it all together...

In [None]:
import numpy as np
from pylab import imshow, show
from timeit import default_timer as timer
from numba import cuda

@cuda.jit(device=True)
def mandel_gpu(x, y, max_iters):
  """
    Given the real and imaginary parts of a complex number,
    determine if it is a candidate for membership in the Mandelbrot
    set given a fixed number of iterations.
  """
  c = complex(x, y)
  z = 0.0j
  for i in range(max_iters):
    z = z*z + c
    if (z.real*z.real + z.imag*z.imag) >= 4:
      return i

  return max_iters

@cuda.jit
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  x, y = cuda.grid(2) # x = blockIdx.x * blockDim.x + threadIdx.x
  if x < width and y < height:
      real = min_x + x * pixel_size_x
      imag = min_y + y * pixel_size_y
      color = mandel_gpu(real, imag, iters)
      image[y,x] = color

image = np.zeros((1024, 1536), dtype = np.uint8)

#Create grid of 32x32 blocks, one thread per pixel
height = image.shape[0]
width = image.shape[1]

nthreads = 32
nblocksy = (height // nthreads) + 1 # = 33
nblocksx = (width // nthreads) + 1 # = 49


config = (nblocksx, nblocksy), (nthreads, nthreads)


start = timer()
create_fractal[config](-2.0, 1.0, -1.0, 1.0, image, 20) 
dt = timer() - start

print ("Mandelbrot created in %f s" % dt)
imshow(image)
show()

<a id='Python Porting Example'></a>
# 2. Python Porting Example

This is an example of porting functions to use Numba. It is really an example of the porting
<em>lifecycle</em> of a Python application. It starts with a serial Python code
that has been written to test the idea and to make sure it works and gets correct answers.

Then it moves to putting computational intensive portions of the code into functions.

Then it uses Numba to compile for the CPU, using the <tt>@jit</tt> decorator. 

The next step is to switch to the <tt>@vectorize</tt> decorator, targeting a single core
(target is <tt>cpu</tt>), then all of the CPU cores (target is <tt>parallel</tt>), and
finally to NVIDIA GPUs (target is <tt>cuda</tt>).

The last step is to modify the code to use the <tt>@cuda.jit</tt> decorator. This
requires some code changes.

The new example code is a very, very simplified version of the start of a Molecular
Dynamic (MD) mini-app. It focuses on loops that are initialized (the values are arbitrary
and don't mean anything). The code an obviously be made much simpler and faster, but that is
not the point. The point is to show you how start with a code that uses loops and "port"
it to use Numba and GPUs. Several steps will be used in this porting process. Hopefully it
gives you some "feel" in porting your applications to use Numba.
                                                    
Let's start with the serial Python code.

In [None]:
import numpy as np
from time import perf_counter

# main loop
start_time = perf_counter( )
d_num = 5000
p_num = 5000

pos = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
accel = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
for j in range(0, p_num):
    for i in range(0, d_num):
        pos[i,j] = 6.5
    # end for i
    for i in range(0, d_num):
        accel[i,j] = 4.2*pos[i,j]
    # end for
# end for j

stop_time = perf_counter( )

print(pos)
print(accel)
print('')
print('    Elapsed wall clock time = %g seconds.' % (stop_time - start_time) )
print('')

<a id='Non-Numba Python Code - Function'></a>
## 2.1 Non-Numba Python Code

The next code creates a function for the computationally intense part of the code (the loop). 
This gets the code ready for Numba (Remember that Numba compiles functions, not entire codes).
Notice that the two arrays are created in the function and are returned to the calling function
(both arrays are returned)

In [None]:
import numpy as np
from time import perf_counter

def init(p_num, d_num):
    pos = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
    accel = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
    for j in range(0, p_num):
        for i in range(0, d_num):
            pos[i,j] = 6.5
        # end for i
        for i in range(0, d_num):
            accel[i,j] = 4.2*pos[i,j]
        # end for i
    # end for j
    return pos, accel
# end def


# main
d_num = 5000
p_num = 5000

start_time = perf_counter( )
pos, accel = init(p_num, d_num)
stop_time = perf_counter( )

print(pos)
print(accel)
print('')
print('    Elapsed wall clock time = %g seconds.' % (stop_time - start_time) )
print('')

Always be sure to check that you answers are correct after you take another
step in porting it. Checking answers for this simple code is fairly easy
and can be done by simply printing the arrays.

<a id='Python Porting Example jit decorator single'></a>
## 2.2 <tt>@jit</tt> Decorator Targeting a Single Core on the CPU

Next, let us use the <tt>@jit</tt> decorator to compile the function. The first one uses the
default target which is a single core.

Note: If you want to eliminate the time to compile, run the cell again without changing the code.
You can do this several times to get an understanding of the true run time without including
the compilation time.

In [None]:
import numpy as np
from time import perf_counter
from numba import jit

@jit
def init(p_num, d_num):
    pos = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
    accel = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
    for j in range(0, p_num):
        for i in range(0, d_num):
            pos[i,j] = 6.5
        # end for i
        for i in range(0, d_num):
            accel[i,j] = 4.2*pos[i,j]
        # end for i
    # end for j
    return pos, accel
# end def


# main
d_num = 5000
p_num = 5000

start_time = perf_counter( )
pos, accel = init(p_num, d_num)
stop_time = perf_counter( )

print(pos)
print(accel)
print('')
print('    Elapsed wall clock time = %g seconds.' % (stop_time - start_time) )
print('')

<a id='Python Porting Example jit decorator parallel'></a>
## 2.3 <tt>@jit</tt> Decorator Targeting Multi-Core (parallel=True)

This code simply compiles for multi-core and also disables Python fallback.

Note: If you want to eliminate the time to compile, run the cell again without changing the code.
You can do this several times to get an understanding of the true run time without including
the compilation time.

In [None]:
import numpy as np
from time import clock
from numba import jit

@jit(nopython=True, parallel=True)
def init(p_num, d_num):
    pos = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
    accel = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
    for j in range(0, p_num):
        for i in range(0, d_num):
            pos[i,j] = 6.5
        # end for i
        for i in range(0, d_num):
            accel[i,j] = 4.2*pos[i,j]
        # end for i
    # end for j
    return pos, accel
# end def


# main
d_num = 5000
p_num = 5000

start_time = perf_counter( )
pos, accel = init(p_num, d_num)
stop_time = perf_counter( )

print(pos)
print(accel)
print('')
print('    Elapsed wall clock time = %g seconds.' % (stop_time - start_time) )
print('')

<a id='Python Porting Example vectorize single'></a>
## 2.4 <tt>@vectorize</tt> Decorator for a Single Core (default)

This example rewrites the code to use the <tt>@vectorize</tt> decorator. Remember that this
requires the code to be executed element-by-element. It really also requires that
each function return one result. This will require some code changes to split the
two loops so that each has its own function.

Recall that we write the function as if it were a scalar. Numba takes care of
all of the details of creating the ufunc based on the code.

Another thing to note is that we have to create the arrays in the main routine. They
can no longer be created in the functions. The <tt>@vectorize</tt> decorator only wants
very simple element-by-element code. The precludes creating the arrays in the functions.

Pay close attention to the type signature(s) for the decorator and how the functions
are called. It is a bit counter intuitive. 

Note: If you want to eliminate the time to compile, run the cell again without changing the code.
You can do this several times to get an understanding of the true run time without including
the compilation time.

In [None]:
import numpy as np
from time import perf_counter
from numba import vectorize

@vectorize(['float32(float32)'])
def set_pos(pos):
    return 6.5
# end def

@vectorize(['float32(float32, float32)'])
def set_accel(pos, accel):
    return 4.2*pos
# end def


# main
d_num = 5000
p_num = 5000
pos = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
accel = np.zeros( shape=(d_num, p_num), dtype=np.float32 )

start_time = perf_counter( )
pos = set_pos(pos)
accel = set_accel(pos,accel)
stop_time = perf_counter( )

print(pos)
print(accel)
print('')
print('    Elapsed wall clock time = %g seconds.' % (stop_time - start_time) )
print('')

<a id='Python Porting Example vectorize parallel'></a>
## 2.5 <tt>@vectorize</tt> Decorator for Multi-Core

This example simply changes the target for the <tt>@vectorize</tt> decorator to multi-core (parallel).

Note: If you want to eliminate the time to compile, run the cell again without changing the code.
You can do this several times to get an understanding of the true run time without including
the compilation time.

In [None]:
import numpy as np
from time import perf_counter
from numba import vectorize

@vectorize(['float32(float32)'], target='parallel')
def set_pos(pos):
    return 6.5
# end def

@vectorize(['float32(float32, float32)'], target='parallel')
def set_accel(pos, accel):
    return 4.2*pos
# end def


# main
d_num = 5000
p_num = 5000
pos = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
accel = np.zeros( shape=(d_num, p_num), dtype=np.float32 )

start_time = perf_counter( )
pos = set_pos(pos)
accel = set_accel(pos,accel)
stop_time = perf_counter( )

print(pos)
print(accel)
print('')
print('    Elapsed wall clock time = %g seconds.' % (stop_time - start_time) )
print('')

<a id='Python Porting Example vectorize cuda'></a>
## 2.6 <tt>@vectorize</tt> Decorator - CUDA Target

This next example also simply changes the target, but this time it is for NVIDIA GPUs.
THis is a benefit of being able to write your function code as scalars and using Numba
to vectorize it. You can just change the target to get either a single CPU core,
multiple CPU cores, or an NVIDIA GPU.

Note: If you want to eliminate the time to compile, run the cell again without changing the code.
You can do this several times to get an understanding of the true run time without including
the compilation time.

In [None]:
import numpy as np
from time import perf_counter
from numba import vectorize

@vectorize(['float32(float32)'], target='cuda')
def set_pos(pos):
    return 6.5
# end def

@vectorize(['float32(float32, float32)'], target='cuda')
def set_accel(pos, accel):
    return 4.2*pos
# end def


# main
d_num = 5000
p_num = 5000
pos = np.zeros( shape=(d_num, p_num), dtype=np.float32 )
accel = np.zeros( shape=(d_num, p_num), dtype=np.float32 )

start_time = perf_counter( )
pos = set_pos(pos)
accel = set_accel(pos,accel)
stop_time = perf_counter( )

print(pos)
print(accel)
print('')
print('    Elapsed wall clock time = %g seconds.' % (stop_time - start_time) )
print('')

<a id='CUPY'></a>
## 3. CuPy

As discussed in the slides, CuPy is roughly a GPU equivalent for NumPy. It covers virtually
all of the NumPy functions but runs them on a GPU. In addition, CuPy is also starting to port
some SciPy routine to the GPU in a new library named <tt>cupyx.scipy</tt>.

The following examples illustrate how CuPy can be used in your Python code.

<a id='CUPY SVD Data on GPU'></a>
## 3.1 SVD Example, Leaving the Data on the GPU

This example performs a Singular Value Decomposition (SVD) on a random matrix that is
created on the GPU. Note that the results of the SVD, the <tt>u</tt>, <tt>v</tt>, and
<tt>s</tt> matrices, are available on the GPU after the computation but in this example,
they are not copied back to the host.

In [None]:
@


    Elapsed wall clock time for numpy = 0.976656 seconds.


    Elapsed wall clock time for cupy = 0.399943 seconds.



<a id='CUPY SVD Data back to host'></a>
## 3.2 SVD Example - Copy Data Back to the Host

This example is the same as the previous, but the <tt>u</tt> matrix is copied
back to the host using the <tt>asnumpy</tt> method. The "type" of the <tt>u</tt>
matrix on the GPU and the <tt>u</tt> matrix on the CPU are both printed so you
can tell that one is on the device (GPU) and one is on the host (CPU). It also
checks the difference between the reconstructed <tt>A</tt> matrix from the SVD
components, versus the original <tt>A</tt> matrix.

Reference: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html

In [None]:
import cupy as cp
import numpy as np

A_cpu = np.random.uniform(low=-1., high=1., size=(64, 64)).astype(np.float32)
A_gpu = cp.asarray(A_cpu)

u_gpu, s_gpu, v_gpu = cp.linalg.svd(A_gpu)
print("type(u_gpu) = ",type(u_gpu) )

u_cpu = cp.asnumpy(u_gpu)
print("type(u_cpu) = ",type(u_cpu) )

# ----- Check answer -----
v_cpu = cp.asnumpy(v_gpu)
s_cpu = cp.asnumpy(s_gpu)

s_cpu_test = np.diag(s_cpu)
result = np.allclose(A_cpu, np.dot(u_cpu, np.dot(s_cpu_test, v_cpu)), atol=1e-05)
print("Check result = ",result)

<a id='CUPY Matrix Mult Data on GPU'></a>
## 3.3 Matrix Multiplication Example - Create Data on GPU

This example creates random data on the CPU or GPU and then performs the matrix
multiplication. Note that the result of the multiplication, the <tt>C</tt> matrix,
is available on the GPU if you need it.

Notice the similarity of the two parts of the code (numpy and cupy).
They are virtually identical.

In [None]:
import math
import cupy as cp
import numpy as np
from time import perf_counter

size = 8000

start_time = perf_counter( )
A = np.random.uniform(low=-1.0, high=1.0, size=(size,size) ).astype(np.float32)
B = np.random.uniform(low=-1., high=1., size=(size,size) ).astype(np.float32)
C = np.matmul(A,B)
stop_time = perf_counter( )

print('')
print('    Elapsed wall clock time for numpy = %g seconds.' % (stop_time - start_time) )
print('')

del A
del B
del C


A = cp.random.uniform(low=-1.0, high=1.0, size=(size,size) ).astype(cp.float32)
B = cp.random.uniform(low=-1., high=1., size=(size,size) ).astype(cp.float32)

start_time = perf_counter( )
#A = cp.random.uniform(low=-1.0, high=1.0, size=(size,size) ).astype(cp.float32)
#B = cp.random.uniform(low=-1., high=1., size=(size,size) ).astype(cp.float32)
C = cp.matmul(A,B)
cp.cuda.Device(0).synchronize()
stop_time = perf_counter( )

print('')
print('    Elapsed wall clock time for cupy = %g seconds.' % (stop_time - start_time) )
print('')

del A
del B
del C


    Elapsed wall clock time for numpy = 2.18571 seconds.


    Elapsed wall clock time for cupy = 1.89381 seconds.



<a id='CUPY Matrix Mult Data on CPU'></a>
## 3.4 Matrix Multiplication Example - Copy Data from Host to GPU

This example creates the data on the CPU and then copies it to
the GPU. Matrix Multiplication on the CPU and GPU is timed. The
GPU timing includes the time for the data movement.

In [None]:
import math
import cupy as cp
import numpy as np
from time import perf_counter

size = 8000

start_time = perf_counter( )
A = np.random.uniform(low=-1.0, high=1.0, size=(size,size) ).astype(np.float32)
B = np.random.uniform(low=-1., high=1., size=(size,size) ).astype(np.float32)
C = np.matmul(A,B)
stop_time = perf_counter( )

print('')
print('    Elapsed wall clock time for numpy = %g seconds.' % (stop_time - start_time) )
print('')

start_time = perf_counter( )
A_gpu = cp.asarray(A)
B_gpu = cp.asarray(B)
C_gpu = cp.matmul(A_gpu,B_gpu)
C_cpu = cp.asnumpy(C_gpu)
stop_time = perf_counter( )

print('')
print('    Elapsed wall clock time for cupy = %g seconds.' % (stop_time - start_time) )
print('')




# Appendix

Setup for colab environment:


In [None]:
%%bash
MINICONDA_INSTALLER_SCRIPT=Miniconda3-4.5.4-Linux-x86_64.sh
MINICONDA_PREFIX=/usr/local
wget https://repo.continuum.io/miniconda/$MINICONDA_INSTALLER_SCRIPT
chmod +x $MINICONDA_INSTALLER_SCRIPT
./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX

In [None]:
%%bash
conda install --channel defaults conda python=3.6 --yes
conda update --channel defaults --all --yes

In [None]:
!conda --version

In [None]:
import sys
_ = (sys.path.append("/usr/local/lib/python3.6/site-packages"))

In [None]:
!conda install --channel conda-forge cupy numba dask tbb --yes

In [None]:
!conda list -e

In [None]:
!conda --version

In [None]:
!python --version

In [None]:
!export NUMBA_THREADING_LAYER='omp' 