v4.5.0 #10470
mudler
announced in
Announcements
v4.5.0
#10470
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
🎉 LocalAI 4.5.0 Release! 🚀
LocalAI 4.5.0 is out!
This release widens what LocalAI can perceive, sharpens the realtime voice API, and makes multi-user serving fast with zero configuration. Four new backends land, the React UI redesign ships in full, and distributed mode gets a robustness pass.
Highlights:
depth-anythingbackend (Depth Anything 3): monocular metric depth + camera pose, with a typedDepthRPC andPOST /v1/depth.cedbackend tags 527 AudioSet sound classes (baby cry, glass breaking, alarms) over REST and a VAD-decoupled realtime stream.supertonicONNX TTS backend: multilingual, espeak-free, fast cold start.privacy-filter.cppengine adds named-entity token classification alongside a regex secret detector.Plus model aliases, word-level ASR timestamps, self-contained Vulkan backends, ds4 SSD streaming for 128 GB-class models, hardened distributed staging, and a broad set of fixes.
The redesigned Home: console with a built-in assistant and chat.
📌 TL;DR
depth-anythingC++/ggml backend (Depth Anything 3) - metric depth + camera pose, typedDepthRPC +POST /v1/depth, 8 GGUFs. Plus Depth Anything V2 gallery models.cedbackend (CED AudioSet tagger, 527 classes) -POST /v1/audio/classification+ VAD-decoupled realtime sound detection.supertonicONNX backend - multilingual, no espeak/G2P, 10 voices, fast cold start (CPU).privacy-filter.cppbackend - encoder/NER token classification scanning whole conversations, alongside a restricted-regex secret detector; NER-centric PII editor in the UI.item.delete/item.truncate/input_audio_buffer.clear.n_parallel(continuous batching on out of the box) - concurrent throughput with no KV blow-up.🚀 New Features & Major Enhancements
👁️ Depth Perception:
depth-anythingA new native Go gRPC backend (#10352) dlopens depth-anything.cpp (a ggml port of Depth Anything 3) via purego - no Python at inference - for monocular metric depth + camera pose estimation on CPU. Depth has no native OpenAI endpoint, so the model is exposed three ways:
DepthgRPC RPC +POST /v1/depththat returns the full output surface (depth map, stats, camera extrinsics 3×4 / intrinsics 3×3).GenerateImage(src, dst)writes a min-max-normalized grayscale depth PNG.Predictreturns the depth + pose JSON blob.Eight Depth Anything 3 GGUFs ship at
mudler/depth-anything.cpp-gguf(base/small/large/giant + a monocularmono-large, q4_k/q8_0/f16/f32), with per-CPU-variant self-contained.sobuilds and the full hardware matrix (cpu, cuda12/13, intel-sycl, vulkan, l4t-arm64). This cycle also adds Depth Anything V2 gallery models (#10413, native version bump) and metric-large + nested metric entries (#10363).🔊 Sound-Event Classification:
cedA new backend (#10425) backed by ced.cpp - a C++/ggml port of CED (Xiaomi), a 527-class AudioSet tagger (baby cry, footsteps, glass breaking, alarms, dog bark...) with full PyTorch parity (f32 e2e 1.7e-7) and Apache-2.0 weights. CPU perf: f16 is ~1.55× faster than the PyTorch reference (~100× realtime), q8_0 uses 6.5× less memory.
POST /v1/audio/classification(fully capability-registered: swagger,/api/instructions, auth feature, Reactcapabilities.js, docs).pipeline.sound_detectionemitsconversation.item.sound_detectionevents, decoupled from VAD (a sound-only session runs withturn_detection: none, activating on sounds not speech), with client-driven or server-side windowing.ced-{base,tiny,mini,small}-{f16,q8}, 6 MB → 86 MB) atmudler/ced-gguf.🗣️ On-Device TTS:
supertonicA new native Go gRPC TTS backend (#10342) runs Supertone's
supertonic-3flow-matching model (4 ONNX graphs) via ONNX Runtime - no Python, no espeak-ng / G2P (text preprocessing is NFKD + a Unicode-codepoint→token-id lookup). Upstream's MIT Go pipeline is vendored at a pinned commit and driven from a LocalAI gRPC server, mirroringsherpa-onnx's ONNX-runtime bundling - small image, fast cold start. Ships asupertonic-3gallery entry (4 ONNX + 10 voice styles F1-F5/M1-M5, SHA256-pinned), withvoice/languagerequest mapping andsteps/speed/silenceknobs. CPU-only in this release; CUDA wiring is scaffolded for a follow-up.🛡️ PII Filtering Gets a NER Tier:
privacy-filter.cppPII filtering moves off the patched llama.cpp
TokenClassifypath onto a new standalone GGML backend,privacy-filter.cpp(#10360), serving OpenAI Privacy Filter NER token-classification models (CPU/CUDA/Vulkan). The filter is reworked to be NER-centric - an encoder/NER detection tier scans whole conversations as a single document - alongside a bounded restricted-regex secret-matching detector tier. Detections are labelled by source (nervspattern) with backend trace / confidence / debug observability,analyze/redactexposed as a synchronous API, and request filtering extended to completions, embeddings, edits and Ollama. The React UI gains a NER-centric PII editor, detector-models table, and middleware default-policy controls; the gallery gets aprivacy-filter-multilingualtoken-classify model + an/import-modelauto-detect importer. A post-merge pass (#10401) added live NER e2e coverage and review fixes.🎙️ Realtime Voice: Speaker-Aware and Self-Compacting
Speaker-aware conversations (#10424). The realtime voice-recognition gate now surfaces the recognized speaker to the client (a new
conversation.item.speakerevent - a non-breaking LocalAI extension) and feeds identity to the LLM for personalized replies (per-message OpenAInamefield and/or aThe current speaker is <Name>.system note). Newpipeline.voice_recognitionkeys decouple surfacing from authorization:enforce: falseresolves and surfaces a speaker without ever dropping a turn, while the gate still fails closed when enforcing. Multi-speaker histories stay correctly attributed (each user item carries its own speaker).Conversation compaction - summarize-then-drop (#10446). Long realtime sessions used to either feed the whole growing buffer to the LLM (expensive on CPU as it grows) or silently forget old turns. Now the server can fold aged-out turns into a rolling summary instead, via an async, post-turn snapshot → summarize → commit compactor that never holds the conversation lock across the summarizer call and never evicts items without a summary replacing them. Plus the OpenAI-parity history events that were missing:
conversation.item.delete,conversation.item.truncate,input_audio_buffer.clear.Also: configurable
pipeline.max_history_items(#10331) and a WebRTC data-channel max-message-size raise + keep-alive fix (#10407).⚙️ Multi-User Serving, On by Default
Two related, config-only (no kernel) changes make concurrent serving fast without any tuning. Both only fill values the user left unset - explicit config always wins.
Hardware-tuned defaults (#10411). When
batch:is unset, defaultn_batch/n_ubatchto 2048 on NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10 / DGX Spark) for a higher prefill ceiling. More importantly, the llama.cpp backend shipsn_parallel = 1, which serializes concurrent requests and leaves continuous batching off - so multi-user serving was effectively disabled. This folds in a VRAM-scaled parallel-slot default:Because the unified KV cache makes slots share the context budget, this is concurrency without multiplying KV memory. Works for both single-host (
LocalGPU()) and distributed (the worker reports compute capability + VRAM on registration, and the router re-applies the heuristics for the selected node).Prefix caching on by default (#10415). The backend ships
n_cache_reuse = 0(cross-request KV prefix reuse disabled). This enables it by default (256), so system prompts, RAG context, agent scaffolds and multi-turn chat aren't recomputed every request - a TTFT + throughput win for shared-prefix workloads, no-op for unique prompts. Same PR consolidatesSetDefaultsinto clean domain-grouped tiers (ApplyInferenceDefaults/ApplyHardwareDefaults/ApplyServingDefaults/ApplyGenericDefaults), completed by a single-source-of-truth defaults refactor (#10418).🔀 Model Aliases
A new
alias:field (#10414) makes a model config transparently route all its traffic to another configured model, so operators can rename or redirect a model without reconfiguring any clients - and swap the target live.1:1 and runtime-swappable; strict (target must be an existing, enabled, non-alias model; alias→alias chains are rejected at load, request and create/swap time). Both names appear in
GET /v1/models, usage accounting recordsrequested=alias/served=target, and resolution lives in the universal request middleware so all modalities (chat, completions, embeddings, audio, images) inherit it - including composition with the Router.⏱️ Word-Level ASR Timestamps, Everywhere
AudioTranscriptionwrapper now forwarding word timestamps end to end (fix(grpc): forward word-level timestamps in AudioTranscription wrapper #10402) and a filter for garbage words on the parakeet path (fix(crispasr): filter garbage words from parakeet word-level timestamps #10421).🧰 vLLM, ds4, Vulkan & Watchdog
parser.extract_tool_calls_streaming(follow-up to fix(vllm): don't stream raw tool-call markup as content when a tool parser is active #10346), plus a fix for structured outputs silently ignored on vLLM ≥ 0.23 (GuidedDecodingParamsremoved upstream) (fix(vllm): structured outputs silently ignored on vLLM >= 0.23 (GuidedDecodingParams removed) #10343)..so+ deps (rewritinglibrary_pathto bare sonames), so you can mix a CPU/native/Intel core image with a Vulkan backend and actually get the GPU instead of a silent CPU fallback.--size-aware-eviction/LOCALAI_SIZE_AWARE_EVICTION, live-reloadable viaPOST /api/settings.🖼️ A Calmer, Sharper Interface
The React UI (
core/http/react-ui) gets a top-to-bottom redesign this cycle - a calmer, more editorial look with the rough edges sanded off:PageHeader,SectionHeading,EmptyState,Skeleton,StatusPill), a rebuilt Home landing page, andPageHeaderrolled across all ~29 pages with a navigation scroll-reset fix.ResponsiveTablethat reflows into label/value cards on mobile, and anUnsavedChangesGuardprotecting Settings / Agent / Fine-Tuning forms. 195/195 Playwright specs green.ClusterPulseheader + conditional attention callout replace the metric-card grid; aNodePanelroster shows per-node models without a click (newGET /api/nodes/models); a deep-linkable/app/nodes/:iddetail page replaces nested table drawers; Scheduling moves to its own/app/schedulingpage.Nodes.jsxdrops from ~1743 to ~360 lines.
- **More:** localized model strings + "Import" typo fix (#10341), paste images from the clipboard into chat (#10428), and console-based navigation + a drop-in API endpoint section (#10377).Left: the conversation canvas. Right: the Operate console (system resources, sortable model tables).
🛰️ Distributed Staging Robustness
context.WithoutCancel(ctx)(keeping request values, dropping cancellation), each long step keeping its own bound./api/operationspolls rotated between frontends. Now mirrored over NATS (staging.<model>.progress) with leading-edge debounce, TTL'd remote ops, and locally-owned ops staying authoritative - the same pattern as gallery-install progress.UpdateProgresssignature updated in fix(test): update e2e UpdateProgress calls for new cancellable arg #10460), plus staging of backend companion assets to remote nodes (fix(distributed): stage backend companion assets to remote nodes #10330).DownloadStallTimeout, 60s) turns an indefinite hang into a fast retryable error, cancellation keeps the.partialso the next attempt resumes viaRange, and stale partials older than 24h are reaped on startup.🧩 Other Enhancements
chat_template_kwargs(feat: generic chat_template_kwargs (model config + per-request metadata) #10359). Pass arbitrary jinja chat-template variables (e.g. Qwen3'spreserve_thinking) from model YAML (chat_template_kwargs:) or per-request via the OpenAImetadatafield - no more hardcoded template levers ingrpc-server.cpp. (Closes Add support for preserve-thinking to llama.cpp or a way to add custom chat-template-kwargs #10329.)LocalAI/<version>(via a newoci.UserAgent()helper) so operators can attribute traffic. (Implements Pull models from OCI registries with a specific UserAgent #6258.)🐛 Bug Fixes (recap)
fix(distributed): detach cold-load staging from the request context- fix(distributed): detach cold-load staging from the request context #10438fix(distributed): broadcast file-staging progress across replicas- fix(distributed): broadcast file-staging progress across replicas #10440fix(distributed): stage backend companion assets to remote nodes- fix(distributed): stage backend companion assets to remote nodes #10330fix(galleryop): persist cancellable so restarted in-flight ops stay cancellable- fix(galleryop): persist cancellable so restarted in-flight ops stay cancellable #10454fix(downloader): stall timeout, resume-safe cancel, and stale-partial reaping- fix(downloader): stall timeout, resume-safe cancel, and stale-partial reaping #10406fix(vllm): structured outputs silently ignored on vLLM >= 0.23 (GuidedDecodingParams removed)- fix(vllm): structured outputs silently ignored on vLLM >= 0.23 (GuidedDecodingParams removed) #10343fix(grpc): forward word-level timestamps in AudioTranscription wrapper- fix(grpc): forward word-level timestamps in AudioTranscription wrapper #10402fix(crispasr): filter garbage words from parakeet word-level timestamps- fix(crispasr): filter garbage words from parakeet word-level timestamps #10421fix(whisperx): use whisperx.diarize.DiarizationPipeline with token kwarg- fix(whisperx): use whisperx.diarize.DiarizationPipeline with token kwarg #10389fix(diffusers): pin diffusers and transformers to a known-good pair- fix(diffusers): pin diffusers and transformers to a known-good pair (#9979) #10442fix: the trl backend's _do_training method directly initializes the trainer- fix: the trl backend's _do_training method directly ... in backend.py #10422fix(realtime): raise WebRTC data-channel max-message-size + keep sendLoop alive- fix(realtime): raise WebRTC data-channel max-message-size + keep sendLoop alive #10407fix(settings): merge partial /api/settings updates instead of overwriting- fix(settings): merge partial /api/settings updates instead of overwriting #10463fix(settings): start watchdog on cold-enable from the React UI- fix(settings): start watchdog on cold-enable from the React UI (#9125) #10287fix(ui): keep row action menu anchored and stop scroll snap on /app/manage- fix(ui): keep row action menu anchored and stop scroll snap on /app/manage #10419fix(react-ui): restore sidebar collapse in dev + stop Talk page auto-scroll- fix(react-ui): restore sidebar collapse in dev + stop Talk page auto-scroll #10383fix(launcher): truncate download status labels to stop progress dialog blowout- fix(launcher): truncate download status labels to stop progress dialog blowout #10357fix(backend): call vram.EstimateModelMultiContext (master build broken)- fix(backend): call vram.EstimateModelMultiContext (master build broken: undefined vram.EstimateModel) #10426fix(nix flake): ensure nix flake builds successfully- fix(nix flake): ensure nix flake builds successfully #10399fix(gallery): hide broken Gemma 4 QAT MTP entries- fix(gallery): hide broken Gemma 4 QAT MTP entries #10348👒 Dependencies
Another steady bump cycle across submodules and Go/Python deps:
ggml-org/llama.cpp: 7 bumps ·ikawrakow/ik_llama.cpp: 7 bumpsggml-org/whisper.cpp: 5 bumps ·leejet/stable-diffusion.cpp: 5 bumpsantirez/ds4: 3 bumps ·mudler/parakeet.cpp: 2 bumps ·CrispStrobe/CrispASR: 2 bumps ·ServeurpersoCom/qwentts.cpp: 2 bumpsServeurpersoCom/omnivoice.cpp: 1 bump ·localai-org/privacy-filter.cpp: 1 bumpgrpcio1.81.0→1.81.1 (vllm)actions/checkout6→7:)📖 Documentation
docs: document all available backends and add "built by us" list- docs: document all available backends and add "built by us" list #10376docs: document the privacy-filter.cpp backend- docs: document the privacy-filter.cpp backend #10386docs: mention apex-quant in the README- docs: mention apex-quant in the README #10412docs: add translated README links- docs: add translated README links #10353fix(docs): use relearn notice shortcode instead of unsupported alert- fix(docs): use relearn notice shortcode instead of unsupported alert #10364docs: update docs version- docs: ⬆️ update docs version mudler/LocalAI #10333🙌 New Contributors
Enjoy!
Full Changelog: v4.4.3...v4.5.0
This discussion was created from the release v4.5.0.
Beta Was this translation helpful? Give feedback.
All reactions