Smart model routing — use best available LLM across cluster nodes

## Summary
When the cluster detects a GPU worker (e.g. Fedora with RTX 3060), the memory system should automatically route LLM extraction to the better/faster model. If the worker goes offline, fall back to on-device models seamlessly.

## Architecture
- Orange Pi (host/server): runs taOSmd with MiniLM ONNX embeddings + Qwen3-4B on NPU for extraction
- Fedora (worker/node): runs Ollama with Qwen3-4B on RTX 3060 (10x faster) or potentially Gemma 4 (better quality when supported)
- Smart routing: check worker health, route extraction tasks to fastest available, fallback to local

## Model Priority Chain
1. Gemma 4 on GPU worker (when RKLLM supports it) — best quality
2. Qwen3-4B on GPU worker — same quality as Pi, 10x faster
3. Qwen3-4B on Pi NPU — benchmark baseline (72% extraction recall)
4. Regex extraction on Pi CPU — instant fallback (39% recall)

## Implementation
- Health check endpoint on workers: `GET /api/worker/health`
- Model capability advertisement: worker reports available models + speed
- Extraction router: picks best available backend per request
- Automatic failover: if GPU worker times out, route to local NPU
- No user configuration needed — discovery is automatic via cluster

## Related
- #199 (Fedora as cluster worker)
- #197 (Gemma 4 conversion — blocked on RKLLM support)
- #198 (taOSmd optimisations)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Smart model routing — use best available LLM across cluster nodes #200

Summary

Architecture

Model Priority Chain

Implementation

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Smart model routing — use best available LLM across cluster nodes #200

Description

Summary

Architecture

Model Priority Chain

Implementation

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions