Skip to content

v4.6.0

Latest

Choose a tag to compare

@mudler mudler released this 04 Jul 07:20
38350d3

🎉 LocalAI 4.6.0 Release! 🚀




LocalAI 4.6.0 is out!

This is a reliability-focused release: AMD ROCm backends now run on-GPU at full speed, distributed model loads no longer wedge when a worker dies, and realtime sessions warm up predictably. It also brings conversation forking to the built-in chat UI, a Prometheus counter for PII/audit events, and an SSRF fix for the model gallery.

Highlights:

  • 🔴 AMD ROCm runs correctly - ggml audio backends offload to the GPU, hipBLASLt kernel-tuning data is bundled (no more slow generic kernels), rocm-vllm installs the right wheel, and the ASIC ID table is found.
  • 🎙️ Predictable realtime - sessions eagerly warm the whole pipeline (VAD, ASR, LLM, TTS) up front, so the first turn no longer pays per-model cold-start stalls, plus a new POST /backend/load API and "Load into memory" UI button.
  • 🌿 Forking chat - retry any assistant answer, branch a new chat from any point, duplicate, or copy the whole conversation, directly in the built-in UI.
  • 🛡️ Distributed hardening - a dead worker can no longer pin the model-load advisory lock (the ~15-minute wedge is gone), and orphaned backend workers self-terminate instead of holding VRAM.
  • 📊 PII/audit metrics - PII detections/masks/blocks are exported as a Prometheus counter, so you can alert when the filter stops firing.
  • 🔒 Gallery SSRF fix - POST /models/apply config-URL fetches are validated against private/loopback/metadata addresses.

Plus idempotent backend installs, tool-calling and reasoning fixes across the vLLM and Python/MLX backends, cloud-proxy compatibility with the newest reasoning models, and the usual set of dependency updates.


📌 TL;DR

Area Summary
🔴 AMD ROCm reliability ggml audio backends now compile with -DGGML_HIP=ON and link HIP (real GPU offload); hipBLASLt TensileLibrary data bundled + HIPBLASLT_TENSILE_LIBPATH exported; rocm-vllm installs from the AMD wheel index on Python 3.12; amdgpu.ids symlinked so the ASIC table is found.
🎙️ Realtime warm-up + load API Sessions block-warm the full pipeline at start (errors surface up front); new POST /backend/load / POST /v1/backend/load, a "Load into memory" UI action, and a load_model MCP tool. Opt out per pipeline with disable_warmup: true.
🌿 Forking chat Regenerate any assistant answer (not just the last), branch a new chat from any turn, duplicate a chat, or copy it as Markdown - all client-side in the React UI.
🛡️ Process & distributed lifecycle A dead worker no longer pins the per-model PostgreSQL advisory lock (bounded load ceiling + context-scoped lock_timeout); backend workers self-terminate on parent death (LOCALAI_BACKEND_PARENT_WATCH); the watchdog stops logging optional Free() as an error.
⚙️ Idempotent backend installs POST /backends/apply and the LOCALAI_EXTERNAL_BACKENDS boot loop no longer re-pull an already-installed backend unless force: true.
📊 PII/audit Prometheus counter localai_pii_events_total{kind,origin,action,direction} on /metrics, complementing the /api/pii/events ring buffer.
🔒 Gallery SSRF hardening Gallery config URL fetches run through ValidateExternalURL, blocking private, loopback, link-local, and cloud-metadata addresses.
🧩 Tool-calling & reasoning fixes Non-streaming vLLM tool calls restored; MLX/Python backends decode tool-call arguments for chat templates and split closing-only </think> reasoning blocks.

🚀 New Features & Major Enhancements

🔴 AMD ROCm backends run correctly on-GPU

Four coupled fixes make ROCm/hipBLAS backends actually run on AMD hardware, and at full speed, instead of silently falling back to CPU or slow generic kernels:

  • GPU offload for ggml audio backends (#10667): rocm-qwen3-tts-cpp, rocm-omnivoice-cpp, acestep-cpp, and vibevoice-cpp were building CPU-only because their Makefiles passed the no-op -DGGML_HIPBLAS=ON (upstream ggml only understands -DGGML_HIP=ON) and the CMake link loop omitted hip. They now use the same hipblas recipe as llama-cpp and link the HIP backend.
  • hipBLASLt kernel-tuning data (#10660, #10672): the packager bundled rocBLAS data but not the parallel hipBLASLt TensileLibrary_lazy_gfx*.dat files, so every arch silently used slow kernels and logged Cannot read "TensileLibrary_lazy_gfx*.dat". The data is now bundled and HIPBLASLT_TENSILE_LIBPATH is exported by the llama-cpp and turboquant run.sh.
  • rocm-vllm installs the right wheel (#10642, #10651): the backend was pulling the CUDA-only PyPI vllm (fatal ModuleNotFoundError: No module named 'vllm' on AMD). It now pins CPython 3.12 and installs vLLM from the ROCm wheel index (https://wheels.vllm.ai/rocm/).
  • ASIC ID table found (#10624, #10627): the compute-only hipblas image lacks /opt/amdgpu/share/libdrm/amdgpu.ids, so every model load warned. Ubuntu's libdrm-common copy is now symlinked into place.

🔗 PRs: #10667, #10672, #10651, #10627

🎙️ Realtime: eager pipeline warm-up + a load-into-memory API

Realtime voice sessions now eagerly and blockingly warm the entire pipeline (VAD, transcription, LLM, TTS, sound detection, voice recognition) at session start instead of lazy-loading each sub-model on first use. The first turn no longer pays per-model cold-start stalls, and model-load errors surface up front at session start (as model_load_error) rather than mid-stream. Pipeline sub-models load concurrently, so a session warms in the time of its slowest stage, not the sum, and a failed stage names every broken model in a joined error.

This also adds a LocalAI-native POST /backend/load (and /v1/backend/load), the inverse of /backend/shutdown, exposed as a "Load into memory" UI action and a load_model MCP admin tool, so admins can pre-warm any model (including full pipelines) on demand. The --load-to-memory startup flag now routes through the same engine. Opt out per pipeline with disable_warmup: true.

🔗 PRs: #10662

🌿 Forking chat in the built-in UI

The React chat UI gains conversation-management tools: regenerate any assistant answer (not just the last), branch a new chat from any answer, duplicate a chat into an independent copy, or copy the whole conversation to the clipboard as Markdown. Retrying a mid-conversation answer correctly truncates the conversation before re-asking, both in the DOM and in the request payload (this also fixes a latent stale-closure bug where a mid-conversation retry sent the downstream turns back to the model). All client-side, no backend changes.

🔗 PRs: #10654

🛡️ Sturdier process and distributed lifecycle

  • Dead-worker advisory-lock wedge (#10600): a distributed worker going mid-load could pin a per-model PostgreSQL advisory lock and fail every subsequent request to that model with 55P03 for ~15 minutes. The detached load context is now bounded by a model-load ceiling, the install wait honors cancellation via singleflight.DoChan, and lock_timeout is scoped to the caller's context budget instead of a deployment-global GUC.
  • Parent-death safety net (#10639): if LocalAI is SIGKILLed before teardown, spawned backend workers used to get reparented to init and linger, holding VRAM and their port. Each backend now polls its parent PID and self-terminates on reparenting. Configurable via LOCALAI_BACKEND_PARENT_WATCH (default on, auto-off on Windows) and LOCALAI_BACKEND_PARENT_WATCH_INTERVAL (default 2s). C++ coverage is llama-cpp for now; Python covers all backends.
  • Quieter watchdog (#10602, #10607): the optional Free() RPC returns gRPC Unimplemented for many backends and the federation proxy, so the watchdog no longer logs a misleading Error freeing GPU resources on eviction. A new grpcerrors.IsUnimplemented helper distinguishes it from genuine failures.
  • Idempotent backend installs (#10643): POST /backends/apply and the LOCALAI_EXTERNAL_BACKENDS boot loop no longer re-download and re-extract an already-installed backend on every apply/boot. Pass "force": true (the UI's install button still does, doubling as "Reinstall").

🔗 PRs: #10600, #10639, #10607, #10643

📊 PII/audit events as a Prometheus counter

The PII middleware / MITM audit pipeline now emits a single monotonic counter, localai_pii_events_total{kind, origin, action, direction}, on /metrics, instrumented at the EventStore.Record choke point. Labels are cardinality-bounded (no pattern or user IDs). This complements the capacity-bound /api/pii/events ring buffer and, crucially, makes silent filter failure alertable: rate() on the counter detects that the PII filter stopped firing after a deploy.

🔗 PRs: #10641

🔒 Gallery SSRF hardening

POST /models/apply with an empty id fetches the supplied url directly; in a default Docker setup (no API key) any reachable client could probe internal services or cloud-metadata (169.254.169.254) and exfiltrate a slice via the job error. Gallery config fetches now run through the existing ValidateExternalURL guard (the same one protecting the CORS proxy and media downloads), blocking private, loopback, link-local, unspecified, and metadata addresses. Only plain http(s):// is validated; huggingface://, github:, oci://, ollama://, and file:// are untouched.

🔗 PRs: #10673


🐛 Bug Fixes (recap)

  • fix(vllm): restore non-streaming tool-call extraction that regressed after #10351 (a capability flag was mistaken for run state) - #10638
  • fix(python-backends): decode tool-call arguments for chat templates (unbreaks MLX/Qwen3.5 agent loops) and split reasoning when a model emits only a closing </think> - #10658
  • fix(cloud-proxy): drop temperature/top_p and send max_completion_tokens so routing to the newest reasoning models (Claude Opus 4.x, GPT-5.x) stops 400ing - #10640
  • fix(config): revert defaulting swa_full:true for sliding-window-attention models (restores the memory-light reduced KV cache; still available as an explicit per-model opt-in) - #10674
  • fix(kokoros): implement the AudioTranscriptionLive trait stub so the backend compiles against the updated proto - #10612
  • fix(launcher): keep the desktop launcher's data/config under ~/.localai instead of the GUI's working directory - #10610, #10613

👒 Dependencies

Submodule and backend bumps this cycle:

  • ggml-org/llama.cpp x4
  • ikawrakow/ik_llama.cpp x4
  • CrispStrobe/CrispASR x4
  • leejet/stable-diffusion.cpp x3
  • vllm-metal (darwin) x3
  • ggml-org/whisper.cpp x2
  • mudler/parakeet.cpp x1
  • localai-org/privacy-filter.cpp x1
  • vllm-project/vllm cu130 wheel to 0.24.0

Plus new gallery models added via the gallery agent (#10663, #10644).


📖 Documentation

  • Docs version bump for the release - #10614

🙌 New Contributors


Full Changelog: v4.5.6...v4.6.0