Skip to content

ENH: Native bfloat16 (bf16) dtype support for unified memory AI workloads #30659

@bkpaine1

Description

@bkpaine1

Proposed new feature or change:

Summary

As unified memory architectures become mainstream for AI inference (AMD APUs, Apple Silicon, NVIDIA Grace Hopper, upcoming DGX Spark), the lack of native bfloat16 support in numpy is causing systematic failures at GPU↔CPU boundaries.

The Problem

Modern AI models output bfloat16 tensors for memory efficiency. On discrete GPUs, the explicit .cpu() copy across PCIe often triggers implicit dtype conversion, masking the issue.

On unified memory architectures, GPU and CPU share the same memory space - no copy occurs. When code calls .numpy() on a bf16 tensor:

# Works on discrete GPU (implicit conversion during PCIe copy)
# Crashes on unified memory (tensor already visible to CPU)
waveform = audio_tensor.squeeze(0).numpy()

Error:

TypeError: Got unsupported ScalarType BFloat16

Affected Architectures

This impacts ALL unified memory systems:

Platform Status Use Case
AMD Strix Halo APU Shipping now 128GB unified, AI workstations
Apple Silicon M1-M4 Shipping Mac AI/ML workflows
NVIDIA Grace Hopper Shipping Datacenter AI
NVIDIA DGX Spark Coming 2025 Desktop AI workstation
Intel Arc w/ unified mem Shipping Consumer AI

Current Workaround

Every GPU→numpy boundary requires explicit conversion:

# Before (breaks on unified memory)
array = tensor.numpy()

# After (works everywhere)
array = tensor.float().cpu().numpy()

This requires patching dozens of libraries: torchaudio, librosa, soundfile integrations, video I/O, etc.

Real-World Impact

In a single ComfyUI AI video pipeline, I've had to patch:

  • torchaudio resampler
  • VibeVoice TTS output
  • LatentSync lip-sync
  • Wav2Lip lip-sync
  • Video frame I/O

Each library assumes numpy can handle whatever dtype PyTorch gives it. On unified memory, this assumption fails.

Proposed Solution

Add native bfloat16 dtype to numpy:

import numpy as np

# Proposed
arr = np.array([1.0, 2.0, 3.0], dtype=np.bfloat16)

# Or at minimum, automatic conversion on array creation
bf16_tensor = torch.tensor([1.0], dtype=torch.bfloat16, device='cuda')
arr = bf16_tensor.numpy()  # Should auto-convert to float32 with warning

Alternatives Considered

  1. Do nothing - Force every AI library to add .float() calls (current state, not scalable)
  2. Warn and convert - numpy accepts bf16, warns, converts to float32 automatically
  3. Full support - Native bf16 dtype with arithmetic operations

Option 2 would solve 90% of the pain with minimal implementation effort.

Context

bfloat16 was designed by Google Brain specifically for ML workloads. It's now the default output dtype for:

  • Hugging Face Transformers
  • PyTorch model inference
  • Most modern AI models

As unified memory becomes the standard for AI hardware (it's more efficient), this gap between PyTorch's bf16 support and numpy's lack thereof will cause increasing friction.

Environment

  • numpy: 1.26.x / 2.x
  • PyTorch: 2.5+ with ROCm/CUDA
  • Hardware: AMD Strix Halo (gfx1151), 128GB unified memory
  • OS: Ubuntu 24.04

References


The unified memory future is here. numpy's dtype support should reflect that reality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    57 - Close?Issues which may be closable unless discussion continued

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions