Skip to content

Releases: mohitsoni48/TurboLLM

v1.4.2

23 Jun 16:08
524087e

Choose a tag to compare

Bugfix release — vLLM (safetensors) models now load and chat correctly.

Fixed

  • Chat on vLLM no longer fails with "Engine returned 400." Tool definitions were attached to every engine, but vLLM rejects a tools array unless launched with --enable-auto-tool-choice + a --tool-call-parser. Tools are now sent only to engines that accept them (the llama.cpp family). Tool-calling on vLLM remains unsupported for now.
  • Correct quant classification for vLLM/safetensors models. Compressed-tensors checkpoints were mislabeled as MLX fp16; the quant is now read from quantization_config (e.g. w4a16), so the model card shows the real quant instead of "MLX".
  • The vLLM "Max model length" control is settable again. Multimodal configs nest max_position_embeddings under text_config; the scanner now reads it, so a model's native context length is no longer reported as 0 (which had clamped the input to 0).

v1.4.1 — repo rename to TurboLLM + link consistency

23 Jun 12:44

Choose a tag to compare

Maintenance — brand-name consistency. The GitHub repository was renamed Turbo-LLMTurboLLM so the name matches everywhere (product = TurboLLM, npm package = turbollm, repo = TurboLLM). No runtime or behavior changes.

Changed

  • Repository renamed to github.com/mohitsoni48/TurboLLM; all in-repo links updated (package metadata, README badges/images + license link, and the in-app "Register your engine" issue link). Old links continue to redirect.

Verified: 495/498 tests pass (3 skipped), tsc clean, CI green.

v1.4.0

23 Jun 10:33
e57b60a

Choose a tag to compare

Hygiene release — the small gaps that made the tool feel unfinished.

Added

  • turbollm --stop — gracefully stop a running daemon from any terminal (pidfile at ~/.turbollm/daemon.pid). Unix SIGTERM→SIGKILL; Windows taskkill /T /F. Confirms a TurboLLM daemon is answering on the recorded port before killing, so a stale pidfile whose PID the OS reused is never mistaken for the daemon.
  • turbollm launch claude --model <key|name> — load a specific model, then launch Claude Code against it (resolves by key, exact name, or partial name).
  • turbollm launch claude auto-loads a model when none is running — the last-used model if known, otherwise the first in your library.
  • Import chats from OpenAI-format JSON — the importer now also accepts a standard [{role, content}] array (or {messages: [...]}) from ChatGPT/Claude/LM Studio, auto-detected alongside .turbollm-chat.json.

Fixed

  • Models loaded by the gateway now show as loaded on the Models page (keep-N > 1 auto-swapped models were previously invisible).

v1.3.2

22 Jun 12:42
ad2af6d

Choose a tag to compare

TurboQuant now installs on macOS end-to-end.

Fixed

  • macOS Gatekeeper blocking engine binaries. Downloaded engine binaries carry the com.apple.quarantine attribute, so Gatekeeper blocked execution and the probe timed out even after the right binary downloaded. TurboLLM now strips the quarantine attribute from every extracted engine on macOS (and on re-install). Thanks @manish026 (#16).
  • TurboQuant install/update failing on macOS with no_release_asset — now scans for the newest release carrying a binary for the current platform.
  • Metal engines timing out on first launch (shader JIT) — probe now allows 60s on macOS, 15s elsewhere.

Added

  • macOS CPU backend variant; TurboQuant listed installable on Linux (x64 Vulkan).

Changed

  • Unified TurboQuant's install and update paths onto one per-platform resolver.

Supersedes v1.3.1, which was tagged but never published to npm.

v1.3.1

22 Jun 11:37
dd7b160

Choose a tag to compare

⚠️ Superseded by v1.3.2 — this version was never published to npm. Its macOS fix was incomplete (Gatekeeper quarantine still blocked the binary); v1.3.2 completes it. Install v1.3.2 instead.


(original notes) TurboQuant install/update macOS fixes, Metal probe timeout, Linux build, macOS CPU backend.

v1.3.0

22 Jun 10:32
60374bc

Choose a tag to compare

End-to-end engine builds — compile a CUDA llama.cpp (or any fork) from inside the app, downloading CUDA itself if you don't have it — plus each llama.cpp backend is now its own engine.

Added

  • 1-click build from source (Windows + CUDA). The build guide now compiles for you: clone → cmake configure → compile llama-server → bundle its CUDA runtime → auto-register + activate, with a live phase + streaming compiler log and a success screen. Builds with Ninja inside the MSVC dev environment (driving nvcc directly), so a standalone / conda CUDA works where the Visual Studio generator can't. Manual command path kept as a fallback.
  • Automatic CUDA download. No CUDA Toolkit? Click Download CUDA — TurboLLM fetches NVIDIA's official build components (nvcc + cudart + cuBLAS + headers, ~0.5 GB) and assembles a toolkit, picking a version your GPU driver supports. No NVIDIA installer, no account.
  • Self-contained builds. The built engine bundles the CUDA runtime DLLs next to its binary, so it runs even without a CUDA Toolkit on PATH.
  • Build environment (PATH override). CUDA / compiler in a conda env or custom path? Add that folder under Build environment and hit Re-check — those dirs are prepended to PATH for both detection and the build.
  • One-click rebuild. The "newer source available" chip on source-built engines recompiles at the latest commit in place.

Changed

  • Compile-from-source is no longer guidance-only; the prerequisite checker and the build both honor the configured toolchain dirs.
  • Each llama.cpp backend is its own engine. CUDA, ROCm, CPU, Vulkan and SYCL builds now appear individually — switch between them (and TurboQuant, forks, …) straight from the Running now dropdown, with no per-row "Use" button. The recommended backend sits in Install & manage; the rest live in a collapsible Other llama.cpp builds section. Multiple builds of one backend collapse into that engine, newest first.

Fixed

  • Updating / managing a llama.cpp backend no longer claims it's "not installed." A backend updated to a newer build than the bundled default was wrongly reported as missing — so Update could fail, or re-download a duplicate. Backend install state is now resolved by what's actually on disk, regardless of build number; deleting a backend cleans up all of its builds.

Full changelog: https://github.com/mohitsoni48/Turbo-LLM/blob/main/turbollm/CHANGELOG.md

v1.2.1

22 Jun 04:40

Choose a tag to compare

Auto-tuning that knows the model, a roomier config panel, and a built-in update check. Bundles the work tracked internally as 1.1.0 + 1.2.0 + 1.2.1 into one release off 1.0.0.

Added

  • Auto-tune reads the model card — after a sweep, TurboLLM reads the model's Hugging Face card and prefills the profile's sampling (temperature / top_k / top_p / min_p) with the author's recommended values, shown in the results dialog and applied on Save. Hybrid extraction: a deterministic scan first, then the just-tuned model itself as a fallback for prose-only cards. No card / no recommendation → your sampling is left unchanged.
  • Base-model fallback for recommended sampling — most local GGUFs are third-party requants whose card omits the recommendation, so TurboLLM resolves the original model (via HF base_model) and reads its card. Well-known models (Gemma, Qwen, GLM, …) now get their recommended sampling even from a bare requant repo. Gated bases (e.g. Gemma) need a configured HuggingFace token.
  • Complete tuned config as a table in the auto-tune results dialog — runtime (GPU layers, MoE offload, context, KV cache, flash attention), the full sampling (card values tagged "from card"), and measured speed / VRAM / first-token latency.
  • App self-update check — Settings → About shows the running version and, when a newer TurboLLM is published on npm, an "update available" chip with a copy-paste npm i -g turbollm command. Cached 24h; silent when offline; never auto-updates.

Changed

  • Model config is now a resizable side panel — load/tune settings open as a right-docked panel that resizes the page instead of overlaying it (drag to resize; width remembered), shared by the Models screen and the Chat header. On narrow screens it becomes a full-screen takeover.

Fixed

  • Card-sampling extraction now works on reasoning models (Gemma 4, Qwen3) — thinking is disabled for the extraction step, so they emit usable JSON instead of empty or truncated output.
  • Large model cards (e.g. Qwen3.5, ~80–95k chars) — the recommended-settings block deep in the card is now within the extraction window; values inside usage code blocks are ignored so demo numbers aren't mistaken for recommendations.

v1.0.0 — the engine overhaul

21 Jun 06:58
2a41d0a

Choose a tag to compare

TurboLLM 1.0.0 — engines reimagined: hardware-aware, self-updating, and bring-your-own from any source.

Added

  • Hardware-aware recommendation + a unified, fit-labeled engine catalog (llama.cpp, KoboldCpp, llamafile, MLX, vLLM, + ik_llama / TurboQuant forks).
  • KoboldCpp and llamafile as first-class engine kinds (GGUF, OpenAI-compatible), verified end-to-end.
  • Guided "Add your own engine" (folder scan) + a build-from-source guide (Windows + CUDA: prereq check + commands).
  • Honest engine updates — real upstream check, per-engine Off/Notify/Auto (default Notify), rollback-safe apply, "Rebuild available" for source builds.
  • "Register my engine" funnel, HF-cache default model dir (zero-config onboarding), grouped engine/version dropdown.

Changed

  • Redesigned, beginner-first Engines screen (status hero + Running-now switcher → unified Install & manage catalog → collapsed Advanced).
  • De-pinned official llama.cpp for updates. Route-level code-splitting (~1 MB → ~314 kB initial JS).

Fixed

  • The misleading "you're on the latest" for official llama.cpp (now checks real upstream).
  • llamafile launch on current versions (--no-webui); cross-engine KV-cache-type bleed (turbo* gated to supporting engines).
  • Loopback guard on engine add/scan (block LAN-triggered arbitrary binary execution).

Full changelog: see CHANGELOG.md.

v0.8.0

19 Jun 13:45
4af2036

Choose a tag to compare

TurboLLM v0.8.0 — Research v2, chat portability, engine lifecycle, and an auto-tune overhaul.

Added

  • Research v2 — pluggable web-search providers (Tavily / Kagi / SearXNG); a deterministic retrieval service with a confidence loop and a sources panel; and a heuristic referee that flags reply claims not supported by their cited sources.
  • Chat portability — share a chat via a LAN link or a debug snapshot, and export/import chats as .turbollm-chat.json (imported chats are fully continuable).
  • Agentic tool security — SSRF/RFC-1918 block on fetch_url and a confirmation gate on run_code.
  • vLLM load controls — max model length, GPU memory utilization, max concurrent sequences, dtype, KV-cache dtype, enforce-eager, trust-remote-code.
  • Engine lifecycle — 3-state engine rows (Install / Update / Disable / Enable / Delete) for both the catalog engines (vLLM / MLX / TurboQuant) and the llama.cpp backends.
  • "All" models view — list models unfiltered by the active engine, with compatibility badges.
  • Auto-tune — live prefill-% progress and a Save / Cancel results dialog.

Changed

  • Auto-tune rewritten — binary search over GPU offload, a realistic bench prompt (min(50k, 0.75 × ctx)), a 3-minute-per-test cap, GPU settle between candidates, and a spill-aware peak confirmation (a config that spills VRAM to system memory is PCIe-bottlenecked, so throughput peaks at the no-spill edge).
  • Stop / restart / load now act as kill switches — they cancel a running auto-tune and abort in-flight chat generations.
  • The model load dialog is driven by the active engine kind (vLLM shows its real controls, not MLX copy); slim custom scrollbar; real GPU-layer count instead of "99".
  • turbollm launch claude raises the request timeout so slow local models don't trigger retries.

Fixed

  • Claude Code context meter and cache-hit now show real numbers.
  • Qwen tool-loop empty reply after web searches.
  • vLLM now fails fast with a clear message where it can't run (e.g. Windows).
  • ComfyUI reverse-gate log noise when ComfyUI is configured but not running.
  • A stale engine error now resets when you switch the active engine.

Install / upgrade: npx turbollm@latest

v0.7.2

19 Jun 05:37

Choose a tag to compare

Engine lifecycle hardening

Reliability fix for a user-reported cascade: cancelling/closing Claude left requests queuing forever, and stopping TurboLLM left the model loaded in RAM with the UI showing nothing.

Fixed

  • Engine load lock — static Manager.loadGate shared across every Manager instance ensures at most one model load/reload is ever in flight. New load() method is the single atomic entry point: stop → ComfyUI reverse gate → spawn → readiness wait, all under the lock. Eliminates the double-VRAM-allocation race between gateway auto-swap and a concurrent HTTP load.

  • Orphan-engine reaping — each engine writes a pidfile (run/engine-{pid}.pid) with its port and owner-daemon pid. On startup, reapStaleEngines() kills any engine whose port is live but whose owner daemon is gone (terminal closed, killed, crashed). killTrackedEnginesSync() on process exit covers abrupt exits that bypass signal handlers. Owner-aware: a restarting daemon never reaps engines the incoming process already owns.

  • Client-cancel propagation — gateway wires an AbortController into every upstream engine fetch. stream.onAbort fires ac.abort() so a cancelled Claude turn actually stops the engine generating rather than running to completion and clogging the queue. streamToAnthropic now uses reader.cancel() (not releaseLock()) to tear down the upstream body on disconnect.

  • Daemon crash on client disconnect — guarded the final writeSSE('done') in chat routes; unhandledRejection handler in CLI swallows expected AbortErrors. A disconnecting client can no longer crash the daemon and orphan the engine.

  • SIGHUP handled — added to graceful-shutdown signals.

Tests added

  • manager.loadlock.test.ts — proves two concurrent load() calls on different Managers serialise under the global lock
  • manager.reap.test.ts — reap live orphan, skip live-owner, skip dead-port (recycled-pid guard)
  • anthropic.cancel.test.tsreader.cancel() propagates to the upstream body on generator teardown