ENH: Native bfloat16 (bf16) dtype support for unified memory AI workloads

### Proposed new feature or change:

## Summary

As unified memory architectures become mainstream for AI inference (AMD APUs, Apple Silicon, NVIDIA Grace Hopper, upcoming DGX Spark), the lack of native bfloat16 support in numpy is causing systematic failures at GPU↔CPU boundaries.

## The Problem

Modern AI models output bfloat16 tensors for memory efficiency. On **discrete GPUs**, the explicit `.cpu()` copy across PCIe often triggers implicit dtype conversion, masking the issue. 

On **unified memory architectures**, GPU and CPU share the same memory space - no copy occurs. When code calls `.numpy()` on a bf16 tensor:

```python
# Works on discrete GPU (implicit conversion during PCIe copy)
# Crashes on unified memory (tensor already visible to CPU)
waveform = audio_tensor.squeeze(0).numpy()
```

**Error:**
```
TypeError: Got unsupported ScalarType BFloat16
```

## Affected Architectures

This impacts ALL unified memory systems:

| Platform | Status | Use Case |
|----------|--------|----------|
| AMD Strix Halo APU | Shipping now | 128GB unified, AI workstations |
| Apple Silicon M1-M4 | Shipping | Mac AI/ML workflows |
| NVIDIA Grace Hopper | Shipping | Datacenter AI |
| NVIDIA DGX Spark | Coming 2025 | Desktop AI workstation |
| Intel Arc w/ unified mem | Shipping | Consumer AI |

## Current Workaround

Every GPU→numpy boundary requires explicit conversion:

```python
# Before (breaks on unified memory)
array = tensor.numpy()

# After (works everywhere)
array = tensor.float().cpu().numpy()
```

This requires patching dozens of libraries: torchaudio, librosa, soundfile integrations, video I/O, etc.

## Real-World Impact

In a single ComfyUI AI video pipeline, I've had to patch:
- torchaudio resampler
- VibeVoice TTS output
- LatentSync lip-sync
- Wav2Lip lip-sync
- Video frame I/O

Each library assumes numpy can handle whatever dtype PyTorch gives it. On unified memory, this assumption fails.

## Proposed Solution

Add native bfloat16 dtype to numpy:

```python
import numpy as np

# Proposed
arr = np.array([1.0, 2.0, 3.0], dtype=np.bfloat16)

# Or at minimum, automatic conversion on array creation
bf16_tensor = torch.tensor([1.0], dtype=torch.bfloat16, device='cuda')
arr = bf16_tensor.numpy()  # Should auto-convert to float32 with warning
```

## Alternatives Considered

1. **Do nothing** - Force every AI library to add `.float()` calls (current state, not scalable)
2. **Warn and convert** - numpy accepts bf16, warns, converts to float32 automatically
3. **Full support** - Native bf16 dtype with arithmetic operations

Option 2 would solve 90% of the pain with minimal implementation effort.

## Context

bfloat16 was designed by Google Brain specifically for ML workloads. It's now the default output dtype for:
- Hugging Face Transformers
- PyTorch model inference
- Most modern AI models

As unified memory becomes the standard for AI hardware (it's more efficient), this gap between PyTorch's bf16 support and numpy's lack thereof will cause increasing friction.

## Environment

- numpy: 1.26.x / 2.x
- PyTorch: 2.5+ with ROCm/CUDA
- Hardware: AMD Strix Halo (gfx1151), 128GB unified memory
- OS: Ubuntu 24.04

## References

- PyTorch bfloat16: https://pytorch.org/docs/stable/tensors.html
- Google Brain bfloat16: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
- AMD ROCm unified memory: https://rocm.docs.amd.com/
- NVIDIA Grace Hopper unified memory: https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/

---

The unified memory future is here. numpy's dtype support should reflect that reality.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Native bfloat16 (bf16) dtype support for unified memory AI workloads #30659

Proposed new feature or change:

Summary

The Problem

Affected Architectures

Current Workaround

Real-World Impact

Proposed Solution

Alternatives Considered

Context

Environment

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Platform	Status	Use Case
AMD Strix Halo APU	Shipping now	128GB unified, AI workstations
Apple Silicon M1-M4	Shipping	Mac AI/ML workflows
NVIDIA Grace Hopper	Shipping	Datacenter AI
NVIDIA DGX Spark	Coming 2025	Desktop AI workstation
Intel Arc w/ unified mem	Shipping	Consumer AI

Uh oh!

ENH: Native bfloat16 (bf16) dtype support for unified memory AI workloads #30659

Description

Proposed new feature or change:

Summary

The Problem

Affected Architectures

Current Workaround

Real-World Impact

Proposed Solution

Alternatives Considered

Context

Environment

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions