Skip to content

Smart model routing — use best available LLM across cluster nodes #200

@jaylfc

Description

@jaylfc

Summary

When the cluster detects a GPU worker (e.g. Fedora with RTX 3060), the memory system should automatically route LLM extraction to the better/faster model. If the worker goes offline, fall back to on-device models seamlessly.

Architecture

  • Orange Pi (host/server): runs taOSmd with MiniLM ONNX embeddings + Qwen3-4B on NPU for extraction
  • Fedora (worker/node): runs Ollama with Qwen3-4B on RTX 3060 (10x faster) or potentially Gemma 4 (better quality when supported)
  • Smart routing: check worker health, route extraction tasks to fastest available, fallback to local

Model Priority Chain

  1. Gemma 4 on GPU worker (when RKLLM supports it) — best quality
  2. Qwen3-4B on GPU worker — same quality as Pi, 10x faster
  3. Qwen3-4B on Pi NPU — benchmark baseline (72% extraction recall)
  4. Regex extraction on Pi CPU — instant fallback (39% recall)

Implementation

  • Health check endpoint on workers: GET /api/worker/health
  • Model capability advertisement: worker reports available models + speed
  • Extraction router: picks best available backend per request
  • Automatic failover: if GPU worker times out, route to local NPU
  • No user configuration needed — discovery is automatic via cluster

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestkilo-duplicateAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions