feat: Heterogeneous distributed training (CUDA + wgpu multi-node)

## SPEC-DIST-2026-001: Heterogeneous Distributed Training

**Spec**: `docs/specifications/distributed-training-spec.md`
**Contract**: `provable-contracts/contracts/entrenar/distributed-training-v1.yaml`
**Status**: Phase 1 complete (single-machine multi-GPU), Phase 2 in progress (multi-node)

### Motivation

SSC v3 classifier training on the intel machine (2× Radeon Pro W5700X) was non-functional: the wgpu compute path did 14,000 CPU↔GPU round-trips per step. Meanwhile, the CUDA path only worked on NVIDIA hardware. No mechanism existed to combine both machines' GPU resources.

### Architecture

Data-parallel training across heterogeneous GPU backends:

```
Coordinator (intel:9000)
├── Worker 0: intel:gpu0 (wgpu, Radeon 8GB)
├── Worker 1: intel:gpu1 (wgpu, Radeon 8GB)
└── Worker 2: lambda:gpu0 (CUDA, RTX 4090 24GB)
```

- **Replicated model**: Each worker holds full frozen Qwen3-4B + LoRA adapters
- **Sharded data**: Mini-batch split across workers (F-DP-002)
- **CPU AllReduce**: LoRA gradients are ~5.3 MB, averaged over TCP (<53ms over GbE)
- **Backend fallback**: CUDA → wgpu → CPU (F-DP-004)

### Contracts (7 invariants)

| ID | Invariant | Severity |
|----|-----------|----------|
| F-DP-001 | All workers have identical weights after sync | P0 |
| F-DP-002 | Sharding covers all samples exactly once | P0 |
| F-DP-003 | NaN in gradient halts training (Jidoka) | P0 |
| F-DP-004 | Backend fallback always produces finite output | P0 |
| F-DP-005 | Multi-GPU loss within 1% of single-GPU | P1 |
| F-DP-006 | Training continues after worker disconnect | P1 |
| F-WGPU-001 | GPU matmul matches CPU within 1e-4 | P0 |

### Implementation Phases

**Phase 1 (COMPLETE)**: Single-machine multi-GPU
- trueno: `GpuDevice::new_with_adapter_index()`, `GpuCommandBatch::matmul()`, `GpuDevicePool`
- entrenar: `ComputeDevice::Wgpu`, `WgpuForwardPass`, `DataParallelCoordinator`
- aprender: `--gpus 0,1`, `--gpu-backend wgpu|cuda|auto`

**Phase 2 (IN PROGRESS)**: Multi-node heterogeneous training
- entrenar: `DistributedConfig`, `GradientServer`, `WorkerClient`, `NodeDiscovery`
- aprender: `--nodes`, `--role coordinator|worker`, `--bind`, `--coordinator`
- forjar: `recipes/gpu-training.yaml` for environment provisioning

**Phase 3 (FUTURE)**: DiLoCo local SGD (arXiv:2311.08105)
- `LocalSGDTrainer` with H=500 inner steps before AllReduce (500× less communication)

### References

- R1: Li et al. (2020). PyTorch DDP. [arXiv:2006.15704](https://arxiv.org/abs/2006.15704)
- R2: Dettmers et al. (2023). QLoRA. [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)
- R3: Hu et al. (2021). LoRA. [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)
- R4: Douillard et al. (2024). DiLoCo. [arXiv:2311.08105](https://arxiv.org/abs/2311.08105)
- R5: Ben-Nun & Hoefler (2019). Parallel DL survey. [arXiv:1802.09941](https://arxiv.org/abs/1802.09941)

### Verification

```bash
# Single-machine multi-GPU (Phase 1, must complete step 1 within 5 min)
apr finetune --task classify --gpus 0,1 --gpu-backend wgpu \
  model_dir --data train.jsonl --num-classes 2 -o checkpoints/

# Multi-node (Phase 2)
# Coordinator:
apr finetune --task classify --role coordinator --bind 0.0.0.0:9000 \
  --expect-workers 3 --gpus 0,1 --gpu-backend wgpu model_dir --data train.jsonl

# Worker:
apr finetune --task classify --role worker --coordinator intel:9000 \
  --gpus 0 --gpu-backend cuda model_dir --data train.jsonl
```

### Labels

- enhancement
- P1
- training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Heterogeneous distributed training (CUDA + wgpu multi-node) #393

SPEC-DIST-2026-001: Heterogeneous Distributed Training

Motivation

Architecture

Contracts (7 invariants)

Implementation Phases

References

Verification

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ID	Invariant	Severity
F-DP-001	All workers have identical weights after sync	P0
F-DP-002	Sharding covers all samples exactly once	P0
F-DP-003	NaN in gradient halts training (Jidoka)	P0
F-DP-004	Backend fallback always produces finite output	P0
F-DP-005	Multi-GPU loss within 1% of single-GPU	P1
F-DP-006	Training continues after worker disconnect	P1
F-WGPU-001	GPU matmul matches CPU within 1e-4	P0

feat: Heterogeneous distributed training (CUDA + wgpu multi-node) #393

Description

SPEC-DIST-2026-001: Heterogeneous Distributed Training

Motivation

Architecture

Contracts (7 invariants)

Implementation Phases

References

Verification

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions