Personal-scale GPU cluster orchestration for distributed AI training
UGRO transforms your multi-GPU setup into a cohesive training platform with one-command job launching, automatic resource management, and intelligent failure recovery.
# Clone and setup
git clone git@github.com:ollieb89/ugro.git
cd ugro
pip install -e .
# Check cluster health
ugro health
# Launch distributed training
ugro launch --name exp1 --model unsloth/tinyllama-bnb-4bit --epochs 3
# Monitor progress
ugro logs exp1
ugro results exp1Before UGRO (Manual multi-node training):
# Terminal 1 (master):
torchrun --nnodes=3 --node_rank=0 --master_addr=192.168.1.100 --master_port=29500 train_production.py
# Terminal 2 (worker):
ssh ob@192.168.1.101 "torchrun --nnodes=3 --node_rank=1 --master_addr=192.168.1.100 --master_port=29500 train_production.py"
# Terminal 3 (worker):
ssh ollie@192.168.1.102 "torchrun --nnodes=3 --node_rank=2 --master_addr=192.168.1.100 --master_port=29500 train_production.py"After UGRO (One command):
ugro launch --name exp1 --model llama-7b --epochs 3
# UGRO handles all nodes, monitoring, checkpointing, and recovery automaticallyUGRO orchestrates your GPU cluster through three layers:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 1: USER INTERFACE β
β CLI + Dashboard for job submission & monitoring β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 2: CONTROL PLANE (UGRO CORE) β
β Scheduler, job registry, resource allocation, recovery β
β - Runs on: gpu-master (192.168.1.100) β
ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 3: EXECUTION (WORKER AGENTS) β
β GPU control, resource reporting, job execution β
β - Runs on: gpu1 (192.168.1.101), gpu2 (192.168.1.102) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Node | IP | GPU | VRAM | Role |
|---|---|---|---|---|
| gpu-master | 192.168.1.100 | RTX 5070 Ti | 12GB | Control Plane + Rank 0 |
| gpu1 | 192.168.1.101 | RTX 4070 | 8GB | Worker + Rank 1 |
| gpu2 | 192.168.1.102 | RTX 3070 Ti | 8GB | Worker + Rank 2 |
Total Cluster Capacity:
- VRAM: 28 GB (45 GB effective with checkpointing)
- Training Speed: ~2.5x single GPU
- Network Efficiency: ~85% (15% overhead)
- One-Command Launch - Single command starts multi-node training
- Automatic Resource Management - Smart GPU allocation and load balancing
- Intelligent Failure Recovery - Checkpoint-aware restart with optimized parameters
- Real-time Monitoring - Live metrics, logs, and progress tracking
- Job Queue Management - FIFO with priority support
- Multi-User Support - Isolated job queues and permissions
- Distributed First - Built around PyTorch DDP and NCCL
- Human-in-the-Loop - Automation assists, never surprises
- Visibility Over Abstraction - Users see where & why jobs run
- Composable Architecture - Each layer is separable & testable
- Backward Compatible - Existing training scripts work unchanged
- Python 3.10+ on all nodes
- PyTorch 2.1+ with CUDA support
- SSH passwordless authentication between nodes
- NVIDIA drivers and CUDA toolkit
# Clone repository
git clone git@github.com:ollieb89/ugro.git
cd ugro
# Install dependencies
pip install -e .
# Verify installation
ugro --help# Clone with development dependencies
git clone git@github.com:ollieb89/ugro.git
cd ugro
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Code formatting
black src/
ruff check src/UGRO uses YAML configuration files in the config/ directory:
cluster:
name: "Home AI Lab"
master:
hostname: "gpu-master"
ip: "192.168.1.100"
user: "${USER}"
workers:
- name: "gpu1"
ip: "192.168.1.101"
user: "ob"
hardware:
gpu_model: "RTX 4070"
vram_gb: 8
- name: "gpu2"
ip: "192.168.1.102"
user: "ollie"
hardware:
gpu_model: "RTX 3070 Ti"
vram_gb: 8
training:
default_model: "unsloth/tinyllama-bnb-4bit"
batch_size_per_gpu: 1
gradient_accumulation_steps: 8model:
max_seq_length: 2048
dtype: "float16"
load_in_4bit: true
training:
learning_rate: 0.0002
warmup_steps: 100
weight_decay: 0.01
lora:
enabled: true
rank: 16
alpha: 32# Check cluster health
ugro health
# Show cluster status
ugro status
# Launch training job
ugro launch --name exp1 --model llama-7b --dataset wikitext --epochs 3
# View job logs
ugro logs exp1
# See job results
ugro results exp1
# View specific rank logs
ugro logs exp1 --rank 1# Launch with custom parameters
ugro launch \
--name "llama-7b-finetune" \
--model "meta-llama/Llama-2-7b-hf" \
--dataset "custom-dataset" \
--epochs 5 \
--lr 0.0001 \
--batch-size 2 \
--verbose
# Launch with priority
ugro launch --name urgent-exp --model tinyllama --priority high
# Cancel running job
ugro cancel exp1
# Resume from checkpoint
ugro resume exp1 --checkpoint step-5000# View current configuration
ugro config show
# Test cluster connectivity
ugro config test
# Update worker configuration
ugro config set worker.gpu1.memory_limit 6GB- GPU utilization and memory usage
- Training loss and throughput
- Network latency and bandwidth
- Node health and availability
- Structured JSON logs with timestamps
- Per-rank log aggregation
- Automatic log rotation and archival
- Integration with external logging systems
- Automatic experiment directory creation
- Configuration and metadata storage
- Checkpoint management
- Results visualization
ugro/
βββ src/ # Core orchestration code
β βββ cli.py # CLI interface
β βββ agent.py # Main orchestrator
β βββ cluster.py # Cluster management
β βββ job.py # Job management
β βββ ssh_utils.py # SSH operations
βββ config/ # Configuration files
βββ scripts/ # Training scripts
βββ data/ # Runtime data and experiments
βββ tests/ # Unit tests
βββ docs/ # Documentation
βββ tools/ # Utility scripts
- New CLI Commands: Add to
src/cli.py - Worker Agents: Extend
src/agent.py - Cluster Operations: Modify
src/cluster.py - Job Management: Update
src/job.py
# Run all tests
pytest
# Run specific test categories
pytest tests/test_cluster.py
pytest tests/test_ssh_utils.py
# Run with coverage
pytest --cov=src tests/- Basic cluster orchestration
- SSH integration and health monitoring
- Job launching and tracking
- CLI interface
- Advanced job queueing and scheduling
- Web dashboard for monitoring
- Metrics collection and visualization
- Enhanced failure recovery
- Multi-user support with authentication
- Model registry and versioning
- Automatic hyperparameter optimization
- Integration with MLflow and Weights & Biases
- Kubernetes integration
- Cloud provider support
- Advanced security features
- SLA monitoring and alerting
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Run the test suite:
pytest - Submit a pull request
- Architecture Guide - Complete system design
- Setup Instructions - Detailed installation guide
- API Reference - API documentation
- Troubleshooting - Common issues and solutions
SSH Connection Failed
# Test SSH manually
ssh ob@192.168.1.101 "echo OK"
# Check SSH keys
ssh-copy-id ob@192.168.1.101GPU Not Available
# Check GPU status on each node
ssh ob@192.168.1.101 "nvidia-smi"
ssh ollie@192.168.1.102 "nvidia-smi"
# Check PyTorch CUDA
ssh ob@192.168.1.101 "python -c 'import torch; print(torch.cuda.is_available())'"Job Fails to Start
# Check cluster health
ugro health
# View detailed logs
ugro logs job-name --verbose
# Check configuration
ugro config testFor more troubleshooting, see Troubleshooting Guide.
This project is licensed under the MIT License - see the LICENSE file for details.
- PyTorch - For the excellent distributed training framework
- Click - For the beautiful CLI interface
- FastAPI - For the high-performance API server
- Paramiko - For reliable SSH operations
- π§ Email: ollie@example.com
- π¬ Discord: Join our community
- π Issues: GitHub Issues
- π Docs: Full Documentation
UGRO: Making distributed AI training accessible to everyone π