Skip to content

perf: Add native GPU implementation for GPUArray.transpose() and reshape() #106

@m96-chan

Description

@m96-chan

Summary

GPUArray.transpose() and GPUArray.reshape() currently use CPU fallback (numpy) for all cases, which involves GPU→CPU→GPU data transfer overhead.

Current Implementation

def transpose(self, *axes: int) -> GPUArray:
    np_data = self.to_numpy()  # GPU→CPU
    result = np_data.transpose(*axes)
    return from_numpy(result.copy())  # CPU→GPU

Existing Native Implementations

  • 2D transpose: ops.matmul.transpose() - native CUDA
  • 3D (0,2,1): ops.tensor.transpose_3d_021() - native CUDA
  • 4D transpose: Not implemented

Proposed Changes

  1. GPUArray.transpose(): Use native implementations when available

    • 2D: delegate to matmul.transpose()
    • 3D with axes (0,2,1): delegate to transpose_3d_021()
    • Other cases: CPU fallback (or implement new kernels)
  2. GPUArray.reshape():

    • For contiguous reshapes, use zero-copy view via narrow()
    • For non-contiguous, CPU fallback
  3. New native kernels (optional):

    • transpose_4d_0213 for attention Q/K/V reshape
    • transpose_4d_0132 for K^T in attention
    • General transpose_nd kernel

Use Case

Whisper encoder/decoder attention uses 4D transposes heavily:

q = q.transpose(0, 2, 1, 3)  # [batch, seq, heads, dim] → [batch, heads, seq, dim]
k.transpose(0, 1, 3, 2)      # For K^T in attention scores

Priority

Medium - Current CPU fallback works but adds latency for real-time ASR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions