Summary
When the cluster detects a GPU worker (e.g. Fedora with RTX 3060), the memory system should automatically route LLM extraction to the better/faster model. If the worker goes offline, fall back to on-device models seamlessly.
Architecture
- Orange Pi (host/server): runs taOSmd with MiniLM ONNX embeddings + Qwen3-4B on NPU for extraction
- Fedora (worker/node): runs Ollama with Qwen3-4B on RTX 3060 (10x faster) or potentially Gemma 4 (better quality when supported)
- Smart routing: check worker health, route extraction tasks to fastest available, fallback to local
Model Priority Chain
- Gemma 4 on GPU worker (when RKLLM supports it) — best quality
- Qwen3-4B on GPU worker — same quality as Pi, 10x faster
- Qwen3-4B on Pi NPU — benchmark baseline (72% extraction recall)
- Regex extraction on Pi CPU — instant fallback (39% recall)
Implementation
- Health check endpoint on workers:
GET /api/worker/health
- Model capability advertisement: worker reports available models + speed
- Extraction router: picks best available backend per request
- Automatic failover: if GPU worker times out, route to local NPU
- No user configuration needed — discovery is automatic via cluster
Related
Summary
When the cluster detects a GPU worker (e.g. Fedora with RTX 3060), the memory system should automatically route LLM extraction to the better/faster model. If the worker goes offline, fall back to on-device models seamlessly.
Architecture
Model Priority Chain
Implementation
GET /api/worker/healthRelated