Releases: mohitsoni48/TurboLLM
v1.4.2
Bugfix release — vLLM (safetensors) models now load and chat correctly.
Fixed
- Chat on vLLM no longer fails with "Engine returned 400." Tool definitions were attached to every engine, but vLLM rejects a
toolsarray unless launched with--enable-auto-tool-choice+ a--tool-call-parser. Tools are now sent only to engines that accept them (the llama.cpp family). Tool-calling on vLLM remains unsupported for now. - Correct quant classification for vLLM/safetensors models. Compressed-tensors checkpoints were mislabeled as MLX
fp16; the quant is now read fromquantization_config(e.g.w4a16), so the model card shows the real quant instead of "MLX". - The vLLM "Max model length" control is settable again. Multimodal configs nest
max_position_embeddingsundertext_config; the scanner now reads it, so a model's native context length is no longer reported as0(which had clamped the input to 0).
v1.4.1 — repo rename to TurboLLM + link consistency
Maintenance — brand-name consistency. The GitHub repository was renamed Turbo-LLM → TurboLLM so the name matches everywhere (product = TurboLLM, npm package = turbollm, repo = TurboLLM). No runtime or behavior changes.
Changed
- Repository renamed to
github.com/mohitsoni48/TurboLLM; all in-repo links updated (package metadata, README badges/images + license link, and the in-app "Register your engine" issue link). Old links continue to redirect.
Verified: 495/498 tests pass (3 skipped), tsc clean, CI green.
v1.4.0
Hygiene release — the small gaps that made the tool feel unfinished.
Added
turbollm --stop— gracefully stop a running daemon from any terminal (pidfile at~/.turbollm/daemon.pid). Unix SIGTERM→SIGKILL; Windowstaskkill /T /F. Confirms a TurboLLM daemon is answering on the recorded port before killing, so a stale pidfile whose PID the OS reused is never mistaken for the daemon.turbollm launch claude --model <key|name>— load a specific model, then launch Claude Code against it (resolves by key, exact name, or partial name).turbollm launch claudeauto-loads a model when none is running — the last-used model if known, otherwise the first in your library.- Import chats from OpenAI-format JSON — the importer now also accepts a standard
[{role, content}]array (or{messages: [...]}) from ChatGPT/Claude/LM Studio, auto-detected alongside.turbollm-chat.json.
Fixed
- Models loaded by the gateway now show as loaded on the Models page (keep-N > 1 auto-swapped models were previously invisible).
v1.3.2
TurboQuant now installs on macOS end-to-end.
Fixed
- macOS Gatekeeper blocking engine binaries. Downloaded engine binaries carry the
com.apple.quarantineattribute, so Gatekeeper blocked execution and the probe timed out even after the right binary downloaded. TurboLLM now strips the quarantine attribute from every extracted engine on macOS (and on re-install). Thanks @manish026 (#16). - TurboQuant install/update failing on macOS with
no_release_asset— now scans for the newest release carrying a binary for the current platform. - Metal engines timing out on first launch (shader JIT) — probe now allows 60s on macOS, 15s elsewhere.
Added
- macOS CPU backend variant; TurboQuant listed installable on Linux (x64 Vulkan).
Changed
- Unified TurboQuant's install and update paths onto one per-platform resolver.
Supersedes v1.3.1, which was tagged but never published to npm.
v1.3.1
⚠️ Superseded by v1.3.2 — this version was never published to npm. Its macOS fix was incomplete (Gatekeeper quarantine still blocked the binary); v1.3.2 completes it. Install v1.3.2 instead.
(original notes) TurboQuant install/update macOS fixes, Metal probe timeout, Linux build, macOS CPU backend.
v1.3.0
End-to-end engine builds — compile a CUDA llama.cpp (or any fork) from inside the app, downloading CUDA itself if you don't have it — plus each llama.cpp backend is now its own engine.
Added
- 1-click build from source (Windows + CUDA). The build guide now compiles for you: clone →
cmakeconfigure → compilellama-server→ bundle its CUDA runtime → auto-register + activate, with a live phase + streaming compiler log and a success screen. Builds with Ninja inside the MSVC dev environment (drivingnvccdirectly), so a standalone / conda CUDA works where the Visual Studio generator can't. Manual command path kept as a fallback. - Automatic CUDA download. No CUDA Toolkit? Click Download CUDA — TurboLLM fetches NVIDIA's official build components (nvcc + cudart + cuBLAS + headers, ~0.5 GB) and assembles a toolkit, picking a version your GPU driver supports. No NVIDIA installer, no account.
- Self-contained builds. The built engine bundles the CUDA runtime DLLs next to its binary, so it runs even without a CUDA Toolkit on PATH.
- Build environment (PATH override). CUDA / compiler in a conda env or custom path? Add that folder under Build environment and hit Re-check — those dirs are prepended to PATH for both detection and the build.
- One-click rebuild. The "newer source available" chip on source-built engines recompiles at the latest commit in place.
Changed
- Compile-from-source is no longer guidance-only; the prerequisite checker and the build both honor the configured toolchain dirs.
- Each llama.cpp backend is its own engine. CUDA, ROCm, CPU, Vulkan and SYCL builds now appear individually — switch between them (and TurboQuant, forks, …) straight from the Running now dropdown, with no per-row "Use" button. The recommended backend sits in Install & manage; the rest live in a collapsible Other llama.cpp builds section. Multiple builds of one backend collapse into that engine, newest first.
Fixed
- Updating / managing a llama.cpp backend no longer claims it's "not installed." A backend updated to a newer build than the bundled default was wrongly reported as missing — so Update could fail, or re-download a duplicate. Backend install state is now resolved by what's actually on disk, regardless of build number; deleting a backend cleans up all of its builds.
Full changelog: https://github.com/mohitsoni48/Turbo-LLM/blob/main/turbollm/CHANGELOG.md
v1.2.1
Auto-tuning that knows the model, a roomier config panel, and a built-in update check. Bundles the work tracked internally as 1.1.0 + 1.2.0 + 1.2.1 into one release off 1.0.0.
Added
- Auto-tune reads the model card — after a sweep, TurboLLM reads the model's Hugging Face card and prefills the profile's sampling (temperature / top_k / top_p / min_p) with the author's recommended values, shown in the results dialog and applied on Save. Hybrid extraction: a deterministic scan first, then the just-tuned model itself as a fallback for prose-only cards. No card / no recommendation → your sampling is left unchanged.
- Base-model fallback for recommended sampling — most local GGUFs are third-party requants whose card omits the recommendation, so TurboLLM resolves the original model (via HF
base_model) and reads its card. Well-known models (Gemma, Qwen, GLM, …) now get their recommended sampling even from a bare requant repo. Gated bases (e.g. Gemma) need a configured HuggingFace token. - Complete tuned config as a table in the auto-tune results dialog — runtime (GPU layers, MoE offload, context, KV cache, flash attention), the full sampling (card values tagged "from card"), and measured speed / VRAM / first-token latency.
- App self-update check — Settings → About shows the running version and, when a newer TurboLLM is published on npm, an "update available" chip with a copy-paste
npm i -g turbollmcommand. Cached 24h; silent when offline; never auto-updates.
Changed
- Model config is now a resizable side panel — load/tune settings open as a right-docked panel that resizes the page instead of overlaying it (drag to resize; width remembered), shared by the Models screen and the Chat header. On narrow screens it becomes a full-screen takeover.
Fixed
- Card-sampling extraction now works on reasoning models (Gemma 4, Qwen3) — thinking is disabled for the extraction step, so they emit usable JSON instead of empty or truncated output.
- Large model cards (e.g. Qwen3.5, ~80–95k chars) — the recommended-settings block deep in the card is now within the extraction window; values inside usage code blocks are ignored so demo numbers aren't mistaken for recommendations.
v1.0.0 — the engine overhaul
TurboLLM 1.0.0 — engines reimagined: hardware-aware, self-updating, and bring-your-own from any source.
Added
- Hardware-aware recommendation + a unified, fit-labeled engine catalog (llama.cpp, KoboldCpp, llamafile, MLX, vLLM, + ik_llama / TurboQuant forks).
- KoboldCpp and llamafile as first-class engine kinds (GGUF, OpenAI-compatible), verified end-to-end.
- Guided "Add your own engine" (folder scan) + a build-from-source guide (Windows + CUDA: prereq check + commands).
- Honest engine updates — real upstream check, per-engine Off/Notify/Auto (default Notify), rollback-safe apply, "Rebuild available" for source builds.
- "Register my engine" funnel, HF-cache default model dir (zero-config onboarding), grouped engine/version dropdown.
Changed
- Redesigned, beginner-first Engines screen (status hero + Running-now switcher → unified Install & manage catalog → collapsed Advanced).
- De-pinned official llama.cpp for updates. Route-level code-splitting (~1 MB → ~314 kB initial JS).
Fixed
- The misleading "you're on the latest" for official llama.cpp (now checks real upstream).
- llamafile launch on current versions (
--no-webui); cross-engine KV-cache-type bleed (turbo* gated to supporting engines). - Loopback guard on engine add/scan (block LAN-triggered arbitrary binary execution).
Full changelog: see CHANGELOG.md.
v0.8.0
TurboLLM v0.8.0 — Research v2, chat portability, engine lifecycle, and an auto-tune overhaul.
Added
- Research v2 — pluggable web-search providers (Tavily / Kagi / SearXNG); a deterministic retrieval service with a confidence loop and a sources panel; and a heuristic referee that flags reply claims not supported by their cited sources.
- Chat portability — share a chat via a LAN link or a debug snapshot, and export/import chats as
.turbollm-chat.json(imported chats are fully continuable). - Agentic tool security — SSRF/RFC-1918 block on
fetch_urland a confirmation gate onrun_code. - vLLM load controls — max model length, GPU memory utilization, max concurrent sequences, dtype, KV-cache dtype, enforce-eager, trust-remote-code.
- Engine lifecycle — 3-state engine rows (Install / Update / Disable / Enable / Delete) for both the catalog engines (vLLM / MLX / TurboQuant) and the llama.cpp backends.
- "All" models view — list models unfiltered by the active engine, with compatibility badges.
- Auto-tune — live prefill-% progress and a Save / Cancel results dialog.
Changed
- Auto-tune rewritten — binary search over GPU offload, a realistic bench prompt (
min(50k, 0.75 × ctx)), a 3-minute-per-test cap, GPU settle between candidates, and a spill-aware peak confirmation (a config that spills VRAM to system memory is PCIe-bottlenecked, so throughput peaks at the no-spill edge). - Stop / restart / load now act as kill switches — they cancel a running auto-tune and abort in-flight chat generations.
- The model load dialog is driven by the active engine kind (vLLM shows its real controls, not MLX copy); slim custom scrollbar; real GPU-layer count instead of "99".
turbollm launch clauderaises the request timeout so slow local models don't trigger retries.
Fixed
- Claude Code context meter and cache-hit now show real numbers.
- Qwen tool-loop empty reply after web searches.
- vLLM now fails fast with a clear message where it can't run (e.g. Windows).
- ComfyUI reverse-gate log noise when ComfyUI is configured but not running.
- A stale engine error now resets when you switch the active engine.
Install / upgrade: npx turbollm@latest
v0.7.2
Engine lifecycle hardening
Reliability fix for a user-reported cascade: cancelling/closing Claude left requests queuing forever, and stopping TurboLLM left the model loaded in RAM with the UI showing nothing.
Fixed
-
Engine load lock — static
Manager.loadGateshared across every Manager instance ensures at most one model load/reload is ever in flight. Newload()method is the single atomic entry point: stop → ComfyUI reverse gate → spawn → readiness wait, all under the lock. Eliminates the double-VRAM-allocation race between gateway auto-swap and a concurrent HTTP load. -
Orphan-engine reaping — each engine writes a pidfile (
run/engine-{pid}.pid) with its port and owner-daemon pid. On startup,reapStaleEngines()kills any engine whose port is live but whose owner daemon is gone (terminal closed, killed, crashed).killTrackedEnginesSync()on processexitcovers abrupt exits that bypass signal handlers. Owner-aware: a restarting daemon never reaps engines the incoming process already owns. -
Client-cancel propagation — gateway wires an
AbortControllerinto every upstream engine fetch.stream.onAbortfiresac.abort()so a cancelled Claude turn actually stops the engine generating rather than running to completion and clogging the queue.streamToAnthropicnow usesreader.cancel()(notreleaseLock()) to tear down the upstream body on disconnect. -
Daemon crash on client disconnect — guarded the final
writeSSE('done')in chat routes;unhandledRejectionhandler in CLI swallows expectedAbortErrors. A disconnecting client can no longer crash the daemon and orphan the engine. -
SIGHUPhandled — added to graceful-shutdown signals.
Tests added
manager.loadlock.test.ts— proves two concurrentload()calls on different Managers serialise under the global lockmanager.reap.test.ts— reap live orphan, skip live-owner, skip dead-port (recycled-pid guard)anthropic.cancel.test.ts—reader.cancel()propagates to the upstream body on generator teardown