-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Labels
enhancementNew feature or requestNew feature or request
Description
SPEC-DIST-2026-001: Heterogeneous Distributed Training
Spec: docs/specifications/distributed-training-spec.md
Contract: provable-contracts/contracts/entrenar/distributed-training-v1.yaml
Status: Phase 1 complete (single-machine multi-GPU), Phase 2 in progress (multi-node)
Motivation
SSC v3 classifier training on the intel machine (2× Radeon Pro W5700X) was non-functional: the wgpu compute path did 14,000 CPU↔GPU round-trips per step. Meanwhile, the CUDA path only worked on NVIDIA hardware. No mechanism existed to combine both machines' GPU resources.
Architecture
Data-parallel training across heterogeneous GPU backends:
Coordinator (intel:9000)
├── Worker 0: intel:gpu0 (wgpu, Radeon 8GB)
├── Worker 1: intel:gpu1 (wgpu, Radeon 8GB)
└── Worker 2: lambda:gpu0 (CUDA, RTX 4090 24GB)
- Replicated model: Each worker holds full frozen Qwen3-4B + LoRA adapters
- Sharded data: Mini-batch split across workers (F-DP-002)
- CPU AllReduce: LoRA gradients are ~5.3 MB, averaged over TCP (<53ms over GbE)
- Backend fallback: CUDA → wgpu → CPU (F-DP-004)
Contracts (7 invariants)
| ID | Invariant | Severity |
|---|---|---|
| F-DP-001 | All workers have identical weights after sync | P0 |
| F-DP-002 | Sharding covers all samples exactly once | P0 |
| F-DP-003 | NaN in gradient halts training (Jidoka) | P0 |
| F-DP-004 | Backend fallback always produces finite output | P0 |
| F-DP-005 | Multi-GPU loss within 1% of single-GPU | P1 |
| F-DP-006 | Training continues after worker disconnect | P1 |
| F-WGPU-001 | GPU matmul matches CPU within 1e-4 | P0 |
Implementation Phases
Phase 1 (COMPLETE): Single-machine multi-GPU
- trueno:
GpuDevice::new_with_adapter_index(),GpuCommandBatch::matmul(),GpuDevicePool - entrenar:
ComputeDevice::Wgpu,WgpuForwardPass,DataParallelCoordinator - aprender:
--gpus 0,1,--gpu-backend wgpu|cuda|auto
Phase 2 (IN PROGRESS): Multi-node heterogeneous training
- entrenar:
DistributedConfig,GradientServer,WorkerClient,NodeDiscovery - aprender:
--nodes,--role coordinator|worker,--bind,--coordinator - forjar:
recipes/gpu-training.yamlfor environment provisioning
Phase 3 (FUTURE): DiLoCo local SGD (arXiv:2311.08105)
LocalSGDTrainerwith H=500 inner steps before AllReduce (500× less communication)
References
- R1: Li et al. (2020). PyTorch DDP. arXiv:2006.15704
- R2: Dettmers et al. (2023). QLoRA. arXiv:2305.14314
- R3: Hu et al. (2021). LoRA. arXiv:2106.09685
- R4: Douillard et al. (2024). DiLoCo. arXiv:2311.08105
- R5: Ben-Nun & Hoefler (2019). Parallel DL survey. arXiv:1802.09941
Verification
# Single-machine multi-GPU (Phase 1, must complete step 1 within 5 min)
apr finetune --task classify --gpus 0,1 --gpu-backend wgpu \
model_dir --data train.jsonl --num-classes 2 -o checkpoints/
# Multi-node (Phase 2)
# Coordinator:
apr finetune --task classify --role coordinator --bind 0.0.0.0:9000 \
--expect-workers 3 --gpus 0,1 --gpu-backend wgpu model_dir --data train.jsonl
# Worker:
apr finetune --task classify --role worker --coordinator intel:9000 \
--gpus 0 --gpu-backend cuda model_dir --data train.jsonlLabels
- enhancement
- P1
- training
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request