<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/ml_tricks/gpu_jit_programming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cupy
 CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python. CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.

https://pypi.org/project/cupy/

https://docs.cupy.dev/en/stable/

https://docs.cupy.dev/en/stable/user_guide/basic.html



# Numba
Numba is an open source, NumPy-aware optimizing compiler for Python sponsored by Anaconda, Inc. It uses the LLVM compiler project to generate machine code from Python syntax.

 it analyses Python code, turns it into an LLVM IR (intermediate representation), then creates bytecode for the selected architecture (by default, the architecture the host Python runtime is running on). This allows additional enhancements, such as parallelisation and compiling for CUDA as well––given the near-ubiquitous support for LLVM, code can be generated to run on a fairly wide range of architectures (x86, x86_64, PPC, ARMv7, ARMv8) and a number of OSs (Windows, OS X, Linux), as well as on CUDA and AMD’s equivalent, ROC.

https://github.com/numba/numba


# Install
conda create -n gpu-prog python=3.11 pip

pip install cupy-cuda12x numba

In [2]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [1]:
! nvidia-smi

Sun Mar  3 10:02:32 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
import numpy as np
import datetime
from cupyx.profiler import benchmark
import cupy as cp
from numba import jit

t1 = datetime.datetime.now()
size = 4096 * 4096
input = np.random.random(size).astype(np.float64)
# sort the array
input= np.sort(input)
t2 = datetime.datetime.now()
extime = t2 - t1
extime = extime.total_seconds()
print(f"time execution in CPU = {extime:.6f} seconds")

# create a cupy array
input_gpu = cp.asarray(input)
# execute 1 time. overload to copy vector to GPU
execution_gpu = benchmark(cp.sort, (input_gpu,), n_repeat=1)
gpu_avg_time = np.average(execution_gpu.gpu_times)

print(f"GPU Time {gpu_avg_time:.6f} seconds")
print(f"GPU is  {(extime/gpu_avg_time):.6f} faster")

time execution in CPU = 0.551503 seconds
GPU Time 0.023199 seconds
GPU is  23.773156 faster


In [5]:
def sum2d(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [6]:
test = np.random.random((10000,10000)).astype(np.float64)
t1 = datetime.datetime.now()
sum2d(test)
t2 = datetime.datetime.now()
extime = t2 - t1
extime1 = extime.total_seconds()
print(f"time execution in CPU = {extime1:.6f} seconds")

time execution in CPU = 24.947120 seconds


Numba has two compilation modes: nopython mode and object mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass nopython=True

In [7]:
@jit(nopython=True)
def sum2dj(arr):
    M, N = arr.shape
    result = 0.0
    for i in range(M):
        for j in range(N):
            result += arr[i,j]
    return result

In [8]:
test = np.random.random((10000,10000)).astype(np.float64)
t1 = datetime.datetime.now()
sum2dj(test)
t2 = datetime.datetime.now()
extime = t2 - t1
extime2 = extime.total_seconds()
print(f"time execution in JiT = {extime2:.6f} seconds")

time execution in JiT = 0.702635 seconds


In [9]:
print(f"JiT is  {(extime1/extime2):.6f} faster")

JiT is  35.505092 faster


### First execution we need to compile the code and move the function to the GPU

In [10]:
from numba import cuda

@cuda.jit(cache=True)
def sum2dc(arr, res):
    M, N = arr.shape
    for i in range(M):
        for j in range(N):
            res[0] += arr[i,j]


In [11]:
test = np.random.random((10000,10000)).astype(np.float32)
t1 = datetime.datetime.now()
res = np.zeros((1), np.float32)
sum2dc[1,1](test, res)
t2 = datetime.datetime.now()
extime = t2 - t1
extime3 = extime.total_seconds()
print(f"time execution in GPU = {extime3:.6f} seconds")



time execution in GPU = 6.720016 seconds


In [12]:
res[0]

16777216.0

In [13]:
print(f"cuda is  {(extime1/extime3):.6f} faster")

cuda is  3.712360 faster


#### Second call the function it is already compiled so it will run faster (Compilation overhead)

In [14]:
test = np.random.random((10000,10000)).astype(np.float32)
t1 = datetime.datetime.now()
res = np.zeros((1), np.float32)
sum2dc[1,1](test, res)
t2 = datetime.datetime.now()
extime = t2 - t1
extime3 = extime.total_seconds()
print(f"time execution in GPU = {extime3:.6f} seconds")
print(f"cuda is  {(extime1/extime3):.6f} faster")

time execution in GPU = 6.539455 seconds
cuda is  3.814862 faster


#### If we execute the function 100 time as the function it is already compiled and after the first execution already in the GPU, the 99 next executions are done in the GPU, so the process it is faster

In [15]:
%timeit -n 10 -r 1 sum2dc[1,1](test, res)

6.43 s ± 0 ns per loop (mean ± std. dev. of 1 run, 10 loops each)


In [16]:
extime4 = 6.43

In [17]:
print(f"cuda is  {(extime1/extime4):.6f} faster")

cuda is  3.879801 faster


In [18]:
test.sum()

49996884.0

In [19]:

# create a cupy array
input_gpu = cp.asarray(test)
# execute 1 time. overload to copy vector to GPU
execution_gpu = benchmark(cp.sum, (input_gpu,), n_repeat=1)
gpu_avg_time_1 = np.average(execution_gpu.gpu_times)

print(f"GPU Time {gpu_avg_time_1:.6f} seconds")
print(f"GPU is  {(extime1/gpu_avg_time_1):.6f} faster")

GPU Time 0.001561 seconds
GPU is  15985.841514 faster
