# Intro to CuPy

Query information about the GPUs available.

In [1]:
!nvidia-smi

Fri Jun 20 16:43:47 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A40                     On  |   00000000:21:00.0 Off |                    0 |
|  0%   29C    P0             69W /  300W |     271MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A40                     On  |   00

Query informatoin about the CPUs available.

In [2]:
!lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7313 16-Core Processor
Stepping:            1
CPU MHz:             2994.624
BogoMIPS:            5989.24
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy sv

## CuPy vs NumPy

CuPy(CUDA Python) has very similar syntax to NumPy(Numerical Python).

While NumPy arrays are stored on the CPU, CuPy arrays are stored on the GPU.

In [3]:
import numpy as np
import cupy as cp

print(f'cupy version: {cp.__version__}')
print(f'numpy version: {np.__version__}')

cupy version: 13.4.1
numpy version: 2.2.6


In [4]:
size = 2048

# Initializes a random 2048x2048 matrix on the CPU
A_cpu = np.random.rand(size, size)#.astype(np.float64)

# Initializes a random 2048x2048 matrix on the GPU
A_gpu = cp.random.rand(size, size)#.astype(np.float64)

In [5]:
print(A_cpu.dtype)
print(A_gpu.dtype)

float64
float64


NumPy arrays can be changed into CuPy arrays by copying them from the CPU to the GPU, and vice versa. This conversion is not implicit, so you can't apply CuPy operations on NumPy arrays without copying them over first.

In [6]:
# Array is initialized on the CPU
B_cpu = np.random.randn(size, size)
print(f"B_cpu type: {type(B_cpu)}")

# Copy array from CPU(host) —> GPU(device)
B_gpu = cp.asarray(B_cpu)
print(f"B_gpu type: {type(B_gpu)}")

# Apply calculations on the GPU
B_gpu = cp.sin(B_gpu)

# Copy array from GPU(device) —> CPU(host)
B_cpu = cp.asnumpy(B_gpu) 
print(f"B_cpu type: {type(B_cpu)}")

B_cpu type: <class 'numpy.ndarray'>
B_gpu type: <class 'cupy.ndarray'>
B_cpu type: <class 'numpy.ndarray'>


In [7]:
# Cannot do:
cp.sin(B_cpu)

TypeError: Unsupported type <class 'numpy.ndarray'>

CuPy also lets us work with data on multiple GPUs. Similar to the host/device, data has to be copied from one GPU to the other.

In [8]:
# (Will only work if you have more than 1 GPU)

# Create array on GPU 1
with cp.cuda.Device(1):
    C_gpu1 = cp.zeros((size, size), dtype=cp.float64)

# Copy array from GPU 1 —> GPU 0
with cp.cuda.Device(0): # not necessary, default device is 0
    C_gpu0 = cp.asarray(C_gpu1)

Since operations on CuPy arrays are done on the GPU, they can be much faster than NumPy operations on the CPU, especially for dense linear algebra on large matrices.

Note: `cp.cuda.Device().synchronize()` is used to ensure that the GPU operations are completed in order to time it accurately; it's not usually necessary.

In [9]:
# NumPy matrix multiplication
%timeit -n 5 C_cpu = np.matmul(A_cpu, B_cpu);

31.2 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [10]:
# CuPy matrix multiplication
%timeit -n 5 C_gpu = cp.matmul(A_gpu, B_gpu); cp.cuda.Device().synchronize()

37.2 ms ± 4.98 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [11]:
# WAIT!!! Let's rerun this!
%timeit -n 5 C_gpu = cp.matmul(A_gpu, B_gpu); cp.cuda.Device().synchronize()

35.2 ms ± 23.1 μs per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Overhead

There are 2 types of overhead to keep in mind when using the GPU with CuPy: **kernel overhead** and **data movement overhead**.

### Kernel & Launch Overheads in CuPy

| Overhead Type              | Description                                                                                                     | File Format       | Location                        | Approx. Latency |
|----------------------------|-----------------------------------------------------------------------------------------------------------------|-------------------|---------------------------------|-----------------|
| **First-call JIT compile** | JIT-compiles your kernel via NVRTC/nvcc on first invocation → noticeable latency                                  | PTX → CUBIN       | *(in RAM until persisted)*      | ~50–200 ms      |
| **In-process cache**       | Keeps the compiled kernel in memory for the life of your Python process → instant subsequent calls              | N/A               | RAM                             | ~0 ms           |
| **Persistent disk cache**  | Loads the cached CUBIN from disk so new processes skip recompilation                                             | CUBIN             | `~/.cupy/kernel_cache`          | ~5–10 ms        |
| **Driver compute cache**   | NVIDIA driver JIT-compiles embedded PTX and stores device binaries for all CUDA apps                             | CUBIN             | `~/.nv/ComputeCache`            | ~5–10 ms        |
| **Kernel launch latency**  | Each kernel launch has a fixed dispatch overhead, independent of JIT → amortizes with larger grid/block sizes   | N/A               | N/A                             | ~2–20 µs        |

> **Note:** Cache files may take a moment to appear due to OS write buffering.


In [12]:
size = 256

for i in range(4):
    D_gpu  = cp.random.rand(size,size)#.astype(np.float64)
    Dh_gpu = 0.5*(D_gpu+D_gpu.T) 
    %time cp.linalg.eigh(Dh_gpu); cp.cuda.Device().synchronize() 

#Note: `cp.linalg.eig` is coming to the next version!

CPU times: user 21.2 ms, sys: 13 ms, total: 34.1 ms
Wall time: 36.1 ms
CPU times: user 8.55 ms, sys: 16 μs, total: 8.57 ms
Wall time: 8.6 ms
CPU times: user 8.57 ms, sys: 0 ns, total: 8.57 ms
Wall time: 8.59 ms
CPU times: user 8.54 ms, sys: 0 ns, total: 8.54 ms
Wall time: 8.57 ms


- Wall time is simply “clock on the wall” duration.\
- CPU times: total CPU work (user + sys) summed across all cores/threads. So, if more than 1 CPU works -> `CPU time > Wall time`
  - If `i=1,2,3` -> similar wall and cpu times but not at `i=0`
  - the first time we run a function, JIT compilation happens -> lots of IO, memory mapping, driver interaction etc -> shows up as `sys`

There is also a CUDA kernel launch overhead of a couple microseconds every time a new GPU kernel is launched. This overhead amortized by larger problem sizes.

In [13]:
for size in [128, 256, 512, 1024]:
    print(f"\nArray size {size}x{size}")
    
    # NumPy
    print("- NumPy time")
    E_cpu = np.random.rand(size,size).astype(np.float64)
    %time np.linalg.eigh(E_cpu);

    # CuPy
    print("- CuPy time")
    E_gpu = cp.random.rand(size,size).astype(np.float64)
    cp.linalg.eigh(E_gpu); #isolate out JIT compilation overhead
    %time cp.linalg.eigh(E_gpu); cp.cuda.Device().synchronize()
    
    print()


Array size 128x128
- NumPy time
CPU times: user 24.7 ms, sys: 573 μs, total: 25.3 ms
Wall time: 2.44 ms
- CuPy time
CPU times: user 119 ms, sys: 4 μs, total: 119 ms
Wall time: 6.06 ms


Array size 256x256
- NumPy time
CPU times: user 179 ms, sys: 5 μs, total: 179 ms
Wall time: 9.71 ms
- CuPy time
CPU times: user 338 ms, sys: 0 ns, total: 338 ms
Wall time: 16.7 ms


Array size 512x512
- NumPy time
CPU times: user 996 ms, sys: 6 μs, total: 996 ms
Wall time: 49.9 ms
- CuPy time
CPU times: user 657 ms, sys: 19 μs, total: 657 ms
Wall time: 32.7 ms


Array size 1024x1024
- NumPy time
CPU times: user 3.69 s, sys: 8.01 ms, total: 3.7 s
Wall time: 185 ms
- CuPy time
CPU times: user 761 ms, sys: 641 μs, total: 762 ms
Wall time: 45.3 ms



---
**WAIT!** Then, why is now `CPU time` larger than the `Wall time`?
- `user` + `sys` CPU times (~1.2 s) is the sum of CPU work across threads—if it used ~10 threads at once for the factorization, you’d see roughly 10× the wall-clock (10 x 120 ms = 1.20 s).
- We cannot measure the *pure* gpu time here in this method... We are measuring the CPU side work which wraps in the GPU work

Now, why do we still see the same behavior (cpu time ~ 10x wall time) when running the `cupy` function?
<details>
    Again 10 threads are involved here, but those threads weren’t doing the eigen-decomposition themselves. They were either:
	•	spinning in the driver’s synchronization routine, or
	•	participating in host‐side parallel work (e.g., multi-threaded marshalling)
Either way, IPython dutifully sums all their CPU usage, yielding the inflated number.
</details>

---
The CUDA kernel launch overhead can also be reduced by merging multiple kernels together. We can see that by using the `@cupy.fuse` decorator, running the second fused kernel takes less time that the first kernel because it has no launch overhead

In [14]:
def double_multiply(x, y):
    return 2*x*y

@cp.fuse
def double_multiply_fused(x,y):
    return 2*x*y

In [15]:
size = 2**16
F1 = cp.random.rand(size)
F2 = cp.random.rand(size)

double_multiply(F1, F2) #isolate out JIT compilation overhead
%timeit -n 7 double_multiply(F1, F2); cp.cuda.Device().synchronize()

double_multiply_fused(F1, F2) #isolate out JIT compilation overhead
%timeit -n 7 double_multiply_fused(F1, F2); cp.cuda.Device().synchronize()

30.7 μs ± 3.71 μs per loop (mean ± std. dev. of 7 runs, 7 loops each)
16.7 μs ± 2.02 μs per loop (mean ± std. dev. of 7 runs, 7 loops each)


### Data Movement Overhead

Transferring data between the CPU and the GPU is slower than processing the data on the GPU, so minimizing data movement in or out of the GPU is best for performance.

In [16]:
import time

In [17]:
# All data and operations on CPU

times = []
for i in range(10):
    start = time.perf_counter()
    
    G_cpu = np.random.rand(size).astype(np.float64)
    H_cpu = np.random.rand(size).astype(np.float64)
    np.vdot(H_cpu, G_cpu);
    
    times.append(time.perf_counter() - start)

print(f"All CPU takes on average {np.mean(times[-9:])*1000} ms")

All CPU takes on average 1.2337859047369824 ms


In [18]:
# All data and operations on GPU

times = []
for i in range(10):
    start = time.perf_counter()
    
    G_gpu = cp.random.rand(size).astype(np.float64)
    H_gpu = cp.random.rand(size).astype(np.float64)
    cp.vdot(H_gpu, G_gpu)
    cp.cuda.Device().synchronize()
    
    times.append(time.perf_counter() - start)

print(f"All GPU takes on average {np.mean(times[-9:])*1000} ms")

All GPU takes on average 0.16291385206083456 ms


In [19]:
# Transfer data from CPU to GPU to operate on GPU

times = []
for i in range(10):
    start = time.perf_counter()
    
    G_gpu = cp.asarray(G_cpu)
    H_gpu = cp.asarray(H_cpu)
    cp.vdot(H_gpu, G_gpu)
    cp.cuda.Device().synchronize()

    times.append(time.perf_counter() - start)

print(f"CPU —> GPU takes on average {np.mean(times[-9:])*1000} ms")

CPU —> GPU takes on average 0.5159224124832286 ms


## GPU Memory Management

Query the free and total memory with `nvidia-smi` shell commands or in Python using CuPy.

In [34]:
!nvidia-smi -i 0 --query-gpu=memory.free,memory.total --format=csv

memory.free [MiB], memory.total [MiB]
45008 MiB, 46068 MiB


In [35]:
print("(memory free, memory total) in bytes:")
print(cp.cuda.Device().mem_info)

(memory free, memory total) in bytes:
(47194177536, 47729344512)


If you try to allocate too much memory on the GPU, you get an `OutOfMemory` error.

In [36]:
size = 2**16
I_gpu = cp.zeros((size, size))
J_gpu = cp.zeros((size, size)) 

OutOfMemoryError: Out of memory allocating 34,359,738,368 bytes (allocated so far: 34,493,956,096 bytes).

Clear all GPU memory.

In [37]:
cp.get_default_memory_pool().free_all_blocks()

In [38]:
!nvidia-smi -i 0 --query-gpu=memory.free,memory.total --format=csv

memory.free [MiB], memory.total [MiB]
12306 MiB, 46068 MiB


One way to resolve `OutOfMemory` errors is by using unified memory, where CUDA transfers data between the CPU and GPU on-demand (when page faults).

In [39]:
cp.cuda.set_allocator(cp.cuda.MemoryPool(cp.cuda.malloc_managed).malloc)

size = 2**16
I_gpu = cp.zeros((size, size))
J_gpu = cp.zeros((size, size))
# works when unified memory

Operations on these arrays can be slower due to the GPU moving pages in and out of its memory.

In [40]:
%time
cp.multiply(I_gpu, J_gpu)
cp.cuda.Device().synchronize()

CPU times: user 5 μs, sys: 0 ns, total: 5 μs
Wall time: 7.39 μs


### Bonus Overhead:
There is also an overhead associated when you run the **very first CuPy function of a program**, which is due to GPU warm-up, memory pool allocation, creating the CUDA context (conceptually is a container that bundles together all the GPU-side state in one place) by the CUDA driver.

Different kernels in one program share the same context but different programs on the same gpu have different context!

## Cleanup

In [41]:
# restart the kernel
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}