Skip to content

ollieb89/ugro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UGRO: Unified GPU Resource Orchestrator

License: MIT Python 3.10+ PyTorch

Personal-scale GPU cluster orchestration for distributed AI training

UGRO transforms your multi-GPU setup into a cohesive training platform with one-command job launching, automatic resource management, and intelligent failure recovery.


πŸš€ Quick Start

# Clone and setup
git clone git@github.com:ollieb89/ugro.git
cd ugro
pip install -e .

# Check cluster health
ugro health

# Launch distributed training
ugro launch --name exp1 --model unsloth/tinyllama-bnb-4bit --epochs 3

# Monitor progress
ugro logs exp1
ugro results exp1

🎯 What UGRO Solves

Before UGRO (Manual multi-node training):

# Terminal 1 (master):
torchrun --nnodes=3 --node_rank=0 --master_addr=192.168.1.100 --master_port=29500 train_production.py

# Terminal 2 (worker):
ssh ob@192.168.1.101 "torchrun --nnodes=3 --node_rank=1 --master_addr=192.168.1.100 --master_port=29500 train_production.py"

# Terminal 3 (worker):
ssh ollie@192.168.1.102 "torchrun --nnodes=3 --node_rank=2 --master_addr=192.168.1.100 --master_port=29500 train_production.py"

After UGRO (One command):

ugro launch --name exp1 --model llama-7b --epochs 3
# UGRO handles all nodes, monitoring, checkpointing, and recovery automatically

πŸ—οΈ Architecture

UGRO orchestrates your GPU cluster through three layers:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LAYER 1: USER INTERFACE                     β”‚
β”‚           CLI + Dashboard for job submission & monitoring       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              LAYER 2: CONTROL PLANE (UGRO CORE)                 β”‚
β”‚    Scheduler, job registry, resource allocation, recovery       β”‚
β”‚    - Runs on: gpu-master (192.168.1.100)                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               LAYER 3: EXECUTION (WORKER AGENTS)                 β”‚
β”‚          GPU control, resource reporting, job execution          β”‚
β”‚    - Runs on: gpu1 (192.168.1.101), gpu2 (192.168.1.102)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Current Cluster Configuration

Node IP GPU VRAM Role
gpu-master 192.168.1.100 RTX 5070 Ti 12GB Control Plane + Rank 0
gpu1 192.168.1.101 RTX 4070 8GB Worker + Rank 1
gpu2 192.168.1.102 RTX 3070 Ti 8GB Worker + Rank 2

Total Cluster Capacity:

  • VRAM: 28 GB (45 GB effective with checkpointing)
  • Training Speed: ~2.5x single GPU
  • Network Efficiency: ~85% (15% overhead)

✨ Features

🎯 Core Capabilities

  • One-Command Launch - Single command starts multi-node training
  • Automatic Resource Management - Smart GPU allocation and load balancing
  • Intelligent Failure Recovery - Checkpoint-aware restart with optimized parameters
  • Real-time Monitoring - Live metrics, logs, and progress tracking
  • Job Queue Management - FIFO with priority support
  • Multi-User Support - Isolated job queues and permissions

πŸ› οΈ Technical Features

  • Distributed First - Built around PyTorch DDP and NCCL
  • Human-in-the-Loop - Automation assists, never surprises
  • Visibility Over Abstraction - Users see where & why jobs run
  • Composable Architecture - Each layer is separable & testable
  • Backward Compatible - Existing training scripts work unchanged

πŸ“¦ Installation

Prerequisites

  • Python 3.10+ on all nodes
  • PyTorch 2.1+ with CUDA support
  • SSH passwordless authentication between nodes
  • NVIDIA drivers and CUDA toolkit

Quick Install

# Clone repository
git clone git@github.com:ollieb89/ugro.git
cd ugro

# Install dependencies
pip install -e .

# Verify installation
ugro --help

Development Setup

# Clone with development dependencies
git clone git@github.com:ollieb89/ugro.git
cd ugro

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Code formatting
black src/
ruff check src/

βš™οΈ Configuration

UGRO uses YAML configuration files in the config/ directory:

Cluster Configuration (config/cluster.yaml)

cluster:
  name: "Home AI Lab"
  master:
    hostname: "gpu-master"
    ip: "192.168.1.100"
    user: "${USER}"

workers:
  - name: "gpu1"
    ip: "192.168.1.101"
    user: "ob"
    hardware:
      gpu_model: "RTX 4070"
      vram_gb: 8
  - name: "gpu2"
    ip: "192.168.1.102"
    user: "ollie"
    hardware:
      gpu_model: "RTX 3070 Ti"
      vram_gb: 8

training:
  default_model: "unsloth/tinyllama-bnb-4bit"
  batch_size_per_gpu: 1
  gradient_accumulation_steps: 8

Training Defaults (config/training_defaults.yaml)

model:
  max_seq_length: 2048
  dtype: "float16"
  load_in_4bit: true

training:
  learning_rate: 0.0002
  warmup_steps: 100
  weight_decay: 0.01

lora:
  enabled: true
  rank: 16
  alpha: 32

πŸš€ Usage Guide

Basic Commands

# Check cluster health
ugro health

# Show cluster status
ugro status

# Launch training job
ugro launch --name exp1 --model llama-7b --dataset wikitext --epochs 3

# View job logs
ugro logs exp1

# See job results
ugro results exp1

# View specific rank logs
ugro logs exp1 --rank 1

Advanced Usage

# Launch with custom parameters
ugro launch \
  --name "llama-7b-finetune" \
  --model "meta-llama/Llama-2-7b-hf" \
  --dataset "custom-dataset" \
  --epochs 5 \
  --lr 0.0001 \
  --batch-size 2 \
  --verbose

# Launch with priority
ugro launch --name urgent-exp --model tinyllama --priority high

# Cancel running job
ugro cancel exp1

# Resume from checkpoint
ugro resume exp1 --checkpoint step-5000

Configuration Management

# View current configuration
ugro config show

# Test cluster connectivity
ugro config test

# Update worker configuration
ugro config set worker.gpu1.memory_limit 6GB

πŸ“Š Monitoring & Observability

Real-time Metrics

  • GPU utilization and memory usage
  • Training loss and throughput
  • Network latency and bandwidth
  • Node health and availability

Logging

  • Structured JSON logs with timestamps
  • Per-rank log aggregation
  • Automatic log rotation and archival
  • Integration with external logging systems

Experiment Tracking

  • Automatic experiment directory creation
  • Configuration and metadata storage
  • Checkpoint management
  • Results visualization

πŸ”§ Development

Project Structure

ugro/
β”œβ”€β”€ src/                    # Core orchestration code
β”‚   β”œβ”€β”€ cli.py             # CLI interface
β”‚   β”œβ”€β”€ agent.py           # Main orchestrator
β”‚   β”œβ”€β”€ cluster.py         # Cluster management
β”‚   β”œβ”€β”€ job.py             # Job management
β”‚   └── ssh_utils.py       # SSH operations
β”œβ”€β”€ config/                # Configuration files
β”œβ”€β”€ scripts/               # Training scripts
β”œβ”€β”€ data/                  # Runtime data and experiments
β”œβ”€β”€ tests/                 # Unit tests
β”œβ”€β”€ docs/                  # Documentation
└── tools/                 # Utility scripts

Adding New Features

  1. New CLI Commands: Add to src/cli.py
  2. Worker Agents: Extend src/agent.py
  3. Cluster Operations: Modify src/cluster.py
  4. Job Management: Update src/job.py

Testing

# Run all tests
pytest

# Run specific test categories
pytest tests/test_cluster.py
pytest tests/test_ssh_utils.py

# Run with coverage
pytest --cov=src tests/

πŸ“ˆ Roadmap

Phase 1: Foundation βœ…

  • Basic cluster orchestration
  • SSH integration and health monitoring
  • Job launching and tracking
  • CLI interface

Phase 2: Production Features (In Progress)

  • Advanced job queueing and scheduling
  • Web dashboard for monitoring
  • Metrics collection and visualization
  • Enhanced failure recovery

Phase 3: Advanced Features (Planned)

  • Multi-user support with authentication
  • Model registry and versioning
  • Automatic hyperparameter optimization
  • Integration with MLflow and Weights & Biases

Phase 4: Enterprise Features (Future)

  • Kubernetes integration
  • Cloud provider support
  • Advanced security features
  • SLA monitoring and alerting

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes and add tests
  4. Run the test suite: pytest
  5. Submit a pull request

πŸ“š Documentation


πŸ› Troubleshooting

Common Issues

SSH Connection Failed

# Test SSH manually
ssh ob@192.168.1.101 "echo OK"

# Check SSH keys
ssh-copy-id ob@192.168.1.101

GPU Not Available

# Check GPU status on each node
ssh ob@192.168.1.101 "nvidia-smi"
ssh ollie@192.168.1.102 "nvidia-smi"

# Check PyTorch CUDA
ssh ob@192.168.1.101 "python -c 'import torch; print(torch.cuda.is_available())'"

Job Fails to Start

# Check cluster health
ugro health

# View detailed logs
ugro logs job-name --verbose

# Check configuration
ugro config test

For more troubleshooting, see Troubleshooting Guide.


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • PyTorch - For the excellent distributed training framework
  • Click - For the beautiful CLI interface
  • FastAPI - For the high-performance API server
  • Paramiko - For reliable SSH operations

πŸ“ž Support


UGRO: Making distributed AI training accessible to everyone πŸš€

About

This project aims to transform a small set of independently managed GPU machines into a **single, cohesive, high-power compute system** that can be dynamically orchestrated for **distributed training**, **advanced inference**, and **developer workflows**.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages