## Homework 10: GPUs

## Due Date: April 26, 2023, 11:59pm

#### Firstname Lastname: Giulio Duregon

#### E-mail: gjd9961@nyu.edu

#### Enter your solutions and submit this notebook

---

**Problem 1 (100p)**


Write two programs which will be able to run in parallel on a GPU, one using Numba/CUDA (50p), one using PyOpenCL (50p).


Each program will:

- draw two random vectors $\vec u$ and $\vec v$ from $[0,1]^N$ where $N = 10^7$;


- calculate and output similarity between $\vec u$ and $\vec v$.




The similarity between two vectors $\vec u$ and $\vec v$ is defined here as a `cosine` value of the angle between them $\measuredangle \left( \vec u, \vec v \right)$. That is, the program returns: 

$$\cos \left( \measuredangle \left( \vec u, \vec v \right) \right).$$


Note that the output is a real value and must belong to $[-1, 1]$.

## PyOpenCL implementation

In [1]:
# the same above algorithm but written in a different way
from time import time
import numpy as np
import pyopencl as cl

np.random.seed(10)
# Set start time / number of samples to draw
num_samples = 10 ** 7
start_time = time()

# generate random vectors
v, u  = np.random.rand(num_samples).astype(np.float32), np.random.rand(num_samples).astype(np.float32)

# PyOpenCL setup
ctx   = cl.create_some_context()
q = cl.CommandQueue(ctx)
memory_flags = cl.mem_flags

# Initialize buffers to read in data to GPU
u_buffer = cl.Buffer(ctx, memory_flags.READ_ONLY | memory_flags.COPY_HOST_PTR, hostbuf=u)
v_buffer = cl.Buffer(ctx, memory_flags.READ_ONLY | memory_flags.COPY_HOST_PTR, hostbuf=v)

# Add context to program (More PyOpenCL setup)
prg = cl.Program(ctx, """
__kernel void fnct(
__global const float *u_buffer, 
__global const float *v_buffer, 
__global float *u_norm_buffer,
__global float *v_norm_buffer,
__global float *dot_product_buffer){

int gid = get_global_id(0);
u_norm_buffer[gid] = u_buffer[gid] * u_buffer[gid];
v_norm_buffer[gid] = v_buffer[gid] * v_buffer[gid];
dot_product_buffer[gid] = u_buffer[gid] * v_buffer[gid];
}""").build()

# Build buffers for our vector dot products / norm calculuations
u_norm_buffer = cl.Buffer(ctx, memory_flags.WRITE_ONLY, u.nbytes)
v_norm_buffer = cl.Buffer(ctx, memory_flags.WRITE_ONLY, u.nbytes)
dot_product_buffer = cl.Buffer(ctx, memory_flags.WRITE_ONLY, u.nbytes)

# More PyOpenCL setup
prg.fnct(q, u.shape, None, u_buffer, v_buffer, u_norm_buffer, v_norm_buffer, dot_product_buffer)

# Initliaze empty np.arrays for results
uv = np.empty_like(u)
uu = np.empty_like(u)
vv = np.empty_like(v)

# Add arrays to queue
cl.enqueue_copy(q, uu, u_norm_buffer)
cl.enqueue_copy(q, vv, v_norm_buffer)
cl.enqueue_copy(q, uv, dot_product_buffer)

# Calculate Cosine Similarity as 
cosine_similarity = np.sum(uv) / (np.sqrt(np.sum(uu)) * np.sqrt(np.sum(vv)))

# Print output
print(f"Run Time: {(time() - start_time):.05f}s")
print(f"Cosine Similarity Value: {cosine_similarity:.05f}")

Run Time: 0.27574s
Cosine Similarity Value: 0.75027


## Numba + Cuda Implementation 

## As I don't have a GPU on my laptop
## If testing with a GPU, comment out the two cells below

When running on a google colab GPU results are as follows: 
- Total Time (s): 0.34177327156066895, Cosine Similarity: 0.7498810839929098

In [2]:
!export NUMBA_ENABLE_CUDASIM=1


In [3]:
%env NUMBA_ENABLE_CUDASIM=1

env: NUMBA_ENABLE_CUDASIM=1


In [4]:
from numba import cuda
import math
import numpy as np

print(cuda.gpus)
cuda.select_device(0)


@cuda.jit
def cosine_similarity_gpu(u, v,uu ,vv, uv ,res):
    # Get the global id of the thread
    x = cuda.grid(1)
    if x >= u.shape[0]:
        return
    
    # Compute the dot product between a and b
    # Compute the norm of a and b
    cuda.atomic.add(uu, 0, u[x]*u[x])
    cuda.atomic.add(uv, 0, u[x]*v[x])
    cuda.atomic.add(vv, 0, v[x]*v[x])

    # Wait for threads to be done
    cuda.syncthreads()
    
    if x == 0:
        res[0] = uv[0] / (math.sqrt(uu[0]) * math.sqrt(vv[0]))


<Managed Device 0>


In [5]:
from numba import cuda
import math
import numpy as np

print(cuda.gpus)
cuda.select_device(0)


@cuda.jit
def cosine_similarity_gpu(u, v,uu ,vv, uv ,res):
    # Get the global id of the thread
    x = cuda.grid(1)
    if x >= u.shape[0]:
        return
    
    # Compute the dot product between a and b
    # Compute the norm of a and b
    cuda.atomic.add(uu, 0, u[x]*u[x])
    cuda.atomic.add(uv, 0, u[x]*v[x])
    cuda.atomic.add(vv, 0, v[x]*v[x])

    # Wait for threads to be done
    cuda.syncthreads()
    
    res[0] = uv[0] / (math.sqrt(uu[0]) * math.sqrt(vv[0]))


<Managed Device 0>


In [6]:
from time import time

# Init start time
start_time = time()

# Controls threads per block and shared memory usage.
TPB = 16

# Initliaze 2* 10^7 randomly drawn numbers for our vectors
num_samples = 10 ** 7
v, u  = np.random.rand(num_samples).astype(np.float32), np.random.rand(num_samples).astype(np.float32)

# Buffer for temp results
vv = np.zeros(1)
uv = np.zeros(1)
uu = np.zeros(1)
res = np.zeros(1, float)

# Calculate blocks per brid
blocks_per_grid = int(math.ceil(u.shape[0] / TPB))

# Run our function, output result

cosine_similarity_gpu[blocks_per_grid, TPB](u, v, uu, vv, uv, res)

print(f"Total Time (s): {time()- start_time}, Cosine Similarity: {res[0]}")

KeyboardInterrupt: 