Summary
GPUArray.transpose() and GPUArray.reshape() currently use CPU fallback (numpy) for all cases, which involves GPU→CPU→GPU data transfer overhead.
Current Implementation
def transpose(self, *axes: int) -> GPUArray:
np_data = self.to_numpy() # GPU→CPU
result = np_data.transpose(*axes)
return from_numpy(result.copy()) # CPU→GPU
Existing Native Implementations
- 2D transpose:
ops.matmul.transpose() - native CUDA
- 3D (0,2,1):
ops.tensor.transpose_3d_021() - native CUDA
- 4D transpose: Not implemented
Proposed Changes
-
GPUArray.transpose(): Use native implementations when available
- 2D: delegate to
matmul.transpose()
- 3D with axes (0,2,1): delegate to
transpose_3d_021()
- Other cases: CPU fallback (or implement new kernels)
-
GPUArray.reshape():
- For contiguous reshapes, use zero-copy view via
narrow()
- For non-contiguous, CPU fallback
-
New native kernels (optional):
transpose_4d_0213 for attention Q/K/V reshape
transpose_4d_0132 for K^T in attention
- General
transpose_nd kernel
Use Case
Whisper encoder/decoder attention uses 4D transposes heavily:
q = q.transpose(0, 2, 1, 3) # [batch, seq, heads, dim] → [batch, heads, seq, dim]
k.transpose(0, 1, 3, 2) # For K^T in attention scores
Priority
Medium - Current CPU fallback works but adds latency for real-time ASR.
Summary
GPUArray.transpose()andGPUArray.reshape()currently use CPU fallback (numpy) for all cases, which involves GPU→CPU→GPU data transfer overhead.Current Implementation
Existing Native Implementations
ops.matmul.transpose()- native CUDAops.tensor.transpose_3d_021()- native CUDAProposed Changes
GPUArray.transpose(): Use native implementations when available
matmul.transpose()transpose_3d_021()GPUArray.reshape():
narrow()New native kernels (optional):
transpose_4d_0213for attention Q/K/V reshapetranspose_4d_0132for K^T in attentiontranspose_ndkernelUse Case
Whisper encoder/decoder attention uses 4D transposes heavily:
Priority
Medium - Current CPU fallback works but adds latency for real-time ASR.