Skip to content

feat: Heterogeneous distributed training (CUDA + wgpu multi-node) #393

@noahgift

Description

@noahgift

SPEC-DIST-2026-001: Heterogeneous Distributed Training

Spec: docs/specifications/distributed-training-spec.md
Contract: provable-contracts/contracts/entrenar/distributed-training-v1.yaml
Status: Phase 1 complete (single-machine multi-GPU), Phase 2 in progress (multi-node)

Motivation

SSC v3 classifier training on the intel machine (2× Radeon Pro W5700X) was non-functional: the wgpu compute path did 14,000 CPU↔GPU round-trips per step. Meanwhile, the CUDA path only worked on NVIDIA hardware. No mechanism existed to combine both machines' GPU resources.

Architecture

Data-parallel training across heterogeneous GPU backends:

Coordinator (intel:9000)
├── Worker 0: intel:gpu0 (wgpu, Radeon 8GB)
├── Worker 1: intel:gpu1 (wgpu, Radeon 8GB)
└── Worker 2: lambda:gpu0 (CUDA, RTX 4090 24GB)
  • Replicated model: Each worker holds full frozen Qwen3-4B + LoRA adapters
  • Sharded data: Mini-batch split across workers (F-DP-002)
  • CPU AllReduce: LoRA gradients are ~5.3 MB, averaged over TCP (<53ms over GbE)
  • Backend fallback: CUDA → wgpu → CPU (F-DP-004)

Contracts (7 invariants)

ID Invariant Severity
F-DP-001 All workers have identical weights after sync P0
F-DP-002 Sharding covers all samples exactly once P0
F-DP-003 NaN in gradient halts training (Jidoka) P0
F-DP-004 Backend fallback always produces finite output P0
F-DP-005 Multi-GPU loss within 1% of single-GPU P1
F-DP-006 Training continues after worker disconnect P1
F-WGPU-001 GPU matmul matches CPU within 1e-4 P0

Implementation Phases

Phase 1 (COMPLETE): Single-machine multi-GPU

  • trueno: GpuDevice::new_with_adapter_index(), GpuCommandBatch::matmul(), GpuDevicePool
  • entrenar: ComputeDevice::Wgpu, WgpuForwardPass, DataParallelCoordinator
  • aprender: --gpus 0,1, --gpu-backend wgpu|cuda|auto

Phase 2 (IN PROGRESS): Multi-node heterogeneous training

  • entrenar: DistributedConfig, GradientServer, WorkerClient, NodeDiscovery
  • aprender: --nodes, --role coordinator|worker, --bind, --coordinator
  • forjar: recipes/gpu-training.yaml for environment provisioning

Phase 3 (FUTURE): DiLoCo local SGD (arXiv:2311.08105)

  • LocalSGDTrainer with H=500 inner steps before AllReduce (500× less communication)

References

Verification

# Single-machine multi-GPU (Phase 1, must complete step 1 within 5 min)
apr finetune --task classify --gpus 0,1 --gpu-backend wgpu \
  model_dir --data train.jsonl --num-classes 2 -o checkpoints/

# Multi-node (Phase 2)
# Coordinator:
apr finetune --task classify --role coordinator --bind 0.0.0.0:9000 \
  --expect-workers 3 --gpus 0,1 --gpu-backend wgpu model_dir --data train.jsonl

# Worker:
apr finetune --task classify --role worker --coordinator intel:9000 \
  --gpus 0 --gpu-backend cuda model_dir --data train.jsonl

Labels

  • enhancement
  • P1
  • training

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions