Releases · mohitsoni48/TurboLLM

23 Jun 16:08

mohitsoni48

v1.4.2

524087e

v1.4.2 Latest

Latest

Bugfix release — vLLM (safetensors) models now load and chat correctly.

Fixed

Chat on vLLM no longer fails with "Engine returned 400." Tool definitions were attached to every engine, but vLLM rejects a tools array unless launched with --enable-auto-tool-choice + a --tool-call-parser. Tools are now sent only to engines that accept them (the llama.cpp family). Tool-calling on vLLM remains unsupported for now.
Correct quant classification for vLLM/safetensors models. Compressed-tensors checkpoints were mislabeled as MLX fp16; the quant is now read from quantization_config (e.g. w4a16), so the model card shows the real quant instead of "MLX".
The vLLM "Max model length" control is settable again. Multimodal configs nest max_position_embeddings under text_config; the scanner now reads it, so a model's native context length is no longer reported as 0 (which had clamped the input to 0).

Assets 2

23 Jun 12:44

mohitsoni48

v1.4.1

9f0e3ff

v1.4.1 — repo rename to TurboLLM + link consistency

Maintenance — brand-name consistency. The GitHub repository was renamed Turbo-LLM → TurboLLM so the name matches everywhere (product = TurboLLM, npm package = turbollm, repo = TurboLLM). No runtime or behavior changes.

Changed

Repository renamed to github.com/mohitsoni48/TurboLLM; all in-repo links updated (package metadata, README badges/images + license link, and the in-app "Register your engine" issue link). Old links continue to redirect.

Verified: 495/498 tests pass (3 skipped), tsc clean, CI green.

Assets 2

23 Jun 10:33

mohitsoni48

v1.4.0

e57b60a

v1.4.0

Hygiene release — the small gaps that made the tool feel unfinished.

Added

turbollm --stop — gracefully stop a running daemon from any terminal (pidfile at ~/.turbollm/daemon.pid). Unix SIGTERM→SIGKILL; Windows taskkill /T /F. Confirms a TurboLLM daemon is answering on the recorded port before killing, so a stale pidfile whose PID the OS reused is never mistaken for the daemon.
turbollm launch claude --model <key|name> — load a specific model, then launch Claude Code against it (resolves by key, exact name, or partial name).
turbollm launch claude auto-loads a model when none is running — the last-used model if known, otherwise the first in your library.
Import chats from OpenAI-format JSON — the importer now also accepts a standard [{role, content}] array (or {messages: [...]}) from ChatGPT/Claude/LM Studio, auto-detected alongside .turbollm-chat.json.

Fixed

Models loaded by the gateway now show as loaded on the Models page (keep-N > 1 auto-swapped models were previously invisible).

Assets 2

22 Jun 12:42

mohitsoni48

v1.3.2

ad2af6d

v1.3.2

TurboQuant now installs on macOS end-to-end.

Fixed

macOS Gatekeeper blocking engine binaries. Downloaded engine binaries carry the com.apple.quarantine attribute, so Gatekeeper blocked execution and the probe timed out even after the right binary downloaded. TurboLLM now strips the quarantine attribute from every extracted engine on macOS (and on re-install). Thanks @manish026 (#16).
TurboQuant install/update failing on macOS with no_release_asset — now scans for the newest release carrying a binary for the current platform.
Metal engines timing out on first launch (shader JIT) — probe now allows 60s on macOS, 15s elsewhere.

Added

macOS CPU backend variant; TurboQuant listed installable on Linux (x64 Vulkan).

Changed

Unified TurboQuant's install and update paths onto one per-platform resolver.

Supersedes v1.3.1, which was tagged but never published to npm.

Contributors

manish026

Assets 2

22 Jun 11:37

mohitsoni48

v1.3.1

dd7b160

v1.3.1

⚠️ Superseded by v1.3.2 — this version was never published to npm. Its macOS fix was incomplete (Gatekeeper quarantine still blocked the binary); v1.3.2 completes it. Install v1.3.2 instead.

(original notes) TurboQuant install/update macOS fixes, Metal probe timeout, Linux build, macOS CPU backend.

Assets 2

22 Jun 10:32

mohitsoni48

v1.3.0

60374bc

v1.3.0

End-to-end engine builds — compile a CUDA llama.cpp (or any fork) from inside the app, downloading CUDA itself if you don't have it — plus each llama.cpp backend is now its own engine.

Added

1-click build from source (Windows + CUDA). The build guide now compiles for you: clone → cmake configure → compile llama-server → bundle its CUDA runtime → auto-register + activate, with a live phase + streaming compiler log and a success screen. Builds with Ninja inside the MSVC dev environment (driving nvcc directly), so a standalone / conda CUDA works where the Visual Studio generator can't. Manual command path kept as a fallback.
Automatic CUDA download. No CUDA Toolkit? Click Download CUDA — TurboLLM fetches NVIDIA's official build components (nvcc + cudart + cuBLAS + headers, ~0.5 GB) and assembles a toolkit, picking a version your GPU driver supports. No NVIDIA installer, no account.
Self-contained builds. The built engine bundles the CUDA runtime DLLs next to its binary, so it runs even without a CUDA Toolkit on PATH.
Build environment (PATH override). CUDA / compiler in a conda env or custom path? Add that folder under Build environment and hit Re-check — those dirs are prepended to PATH for both detection and the build.
One-click rebuild. The "newer source available" chip on source-built engines recompiles at the latest commit in place.

Changed

Compile-from-source is no longer guidance-only; the prerequisite checker and the build both honor the configured toolchain dirs.
Each llama.cpp backend is its own engine. CUDA, ROCm, CPU, Vulkan and SYCL builds now appear individually — switch between them (and TurboQuant, forks, …) straight from the Running now dropdown, with no per-row "Use" button. The recommended backend sits in Install & manage; the rest live in a collapsible Other llama.cpp builds section. Multiple builds of one backend collapse into that engine, newest first.

Fixed

Updating / managing a llama.cpp backend no longer claims it's "not installed." A backend updated to a newer build than the bundled default was wrongly reported as missing — so Update could fail, or re-download a duplicate. Backend install state is now resolved by what's actually on disk, regardless of build number; deleting a backend cleans up all of its builds.

Full changelog: https://github.com/mohitsoni48/Turbo-LLM/blob/main/turbollm/CHANGELOG.md

Assets 2

22 Jun 04:40

mohitsoni48

v1.2.1

52eab08

v1.2.1

Auto-tuning that knows the model, a roomier config panel, and a built-in update check. Bundles the work tracked internally as 1.1.0 + 1.2.0 + 1.2.1 into one release off 1.0.0.

Added

Auto-tune reads the model card — after a sweep, TurboLLM reads the model's Hugging Face card and prefills the profile's sampling (temperature / top_k / top_p / min_p) with the author's recommended values, shown in the results dialog and applied on Save. Hybrid extraction: a deterministic scan first, then the just-tuned model itself as a fallback for prose-only cards. No card / no recommendation → your sampling is left unchanged.
Base-model fallback for recommended sampling — most local GGUFs are third-party requants whose card omits the recommendation, so TurboLLM resolves the original model (via HF base_model) and reads its card. Well-known models (Gemma, Qwen, GLM, …) now get their recommended sampling even from a bare requant repo. Gated bases (e.g. Gemma) need a configured HuggingFace token.
Complete tuned config as a table in the auto-tune results dialog — runtime (GPU layers, MoE offload, context, KV cache, flash attention), the full sampling (card values tagged "from card"), and measured speed / VRAM / first-token latency.
App self-update check — Settings → About shows the running version and, when a newer TurboLLM is published on npm, an "update available" chip with a copy-paste npm i -g turbollm command. Cached 24h; silent when offline; never auto-updates.

Changed

Model config is now a resizable side panel — load/tune settings open as a right-docked panel that resizes the page instead of overlaying it (drag to resize; width remembered), shared by the Models screen and the Chat header. On narrow screens it becomes a full-screen takeover.

Fixed

Card-sampling extraction now works on reasoning models (Gemma 4, Qwen3) — thinking is disabled for the extraction step, so they emit usable JSON instead of empty or truncated output.
Large model cards (e.g. Qwen3.5, ~80–95k chars) — the recommended-settings block deep in the card is now within the extraction window; values inside usage code blocks are ignored so demo numbers aren't mistaken for recommendations.

Assets 2

21 Jun 06:58

mohitsoni48

v1.0.0

2a41d0a

v1.0.0 — the engine overhaul

TurboLLM 1.0.0 — engines reimagined: hardware-aware, self-updating, and bring-your-own from any source.

Added

Hardware-aware recommendation + a unified, fit-labeled engine catalog (llama.cpp, KoboldCpp, llamafile, MLX, vLLM, + ik_llama / TurboQuant forks).
KoboldCpp and llamafile as first-class engine kinds (GGUF, OpenAI-compatible), verified end-to-end.
Guided "Add your own engine" (folder scan) + a build-from-source guide (Windows + CUDA: prereq check + commands).
Honest engine updates — real upstream check, per-engine Off/Notify/Auto (default Notify), rollback-safe apply, "Rebuild available" for source builds.
"Register my engine" funnel, HF-cache default model dir (zero-config onboarding), grouped engine/version dropdown.

Changed

Redesigned, beginner-first Engines screen (status hero + Running-now switcher → unified Install & manage catalog → collapsed Advanced).
De-pinned official llama.cpp for updates. Route-level code-splitting (~1 MB → ~314 kB initial JS).

Fixed

The misleading "you're on the latest" for official llama.cpp (now checks real upstream).
llamafile launch on current versions (--no-webui); cross-engine KV-cache-type bleed (turbo* gated to supporting engines).
Loopback guard on engine add/scan (block LAN-triggered arbitrary binary execution).

Full changelog: see CHANGELOG.md.

Assets 2

19 Jun 13:45

mohitsoni48

v0.8.0

4af2036

v0.8.0

TurboLLM v0.8.0 — Research v2, chat portability, engine lifecycle, and an auto-tune overhaul.

Added

Research v2 — pluggable web-search providers (Tavily / Kagi / SearXNG); a deterministic retrieval service with a confidence loop and a sources panel; and a heuristic referee that flags reply claims not supported by their cited sources.
Chat portability — share a chat via a LAN link or a debug snapshot, and export/import chats as .turbollm-chat.json (imported chats are fully continuable).
Agentic tool security — SSRF/RFC-1918 block on fetch_url and a confirmation gate on run_code.
vLLM load controls — max model length, GPU memory utilization, max concurrent sequences, dtype, KV-cache dtype, enforce-eager, trust-remote-code.
Engine lifecycle — 3-state engine rows (Install / Update / Disable / Enable / Delete) for both the catalog engines (vLLM / MLX / TurboQuant) and the llama.cpp backends.
"All" models view — list models unfiltered by the active engine, with compatibility badges.
Auto-tune — live prefill-% progress and a Save / Cancel results dialog.

Changed

Auto-tune rewritten — binary search over GPU offload, a realistic bench prompt (min(50k, 0.75 × ctx)), a 3-minute-per-test cap, GPU settle between candidates, and a spill-aware peak confirmation (a config that spills VRAM to system memory is PCIe-bottlenecked, so throughput peaks at the no-spill edge).
Stop / restart / load now act as kill switches — they cancel a running auto-tune and abort in-flight chat generations.
The model load dialog is driven by the active engine kind (vLLM shows its real controls, not MLX copy); slim custom scrollbar; real GPU-layer count instead of "99".
turbollm launch claude raises the request timeout so slow local models don't trigger retries.

Fixed

Claude Code context meter and cache-hit now show real numbers.
Qwen tool-loop empty reply after web searches.
vLLM now fails fast with a clear message where it can't run (e.g. Windows).
ComfyUI reverse-gate log noise when ComfyUI is configured but not running.
A stale engine error now resets when you switch the active engine.

Install / upgrade: npx turbollm@latest

Assets 2

19 Jun 05:37

mohitsoni48

v0.7.2

5ece3e0

v0.7.2

Engine lifecycle hardening

Reliability fix for a user-reported cascade: cancelling/closing Claude left requests queuing forever, and stopping TurboLLM left the model loaded in RAM with the UI showing nothing.

Fixed

Engine load lock — static Manager.loadGate shared across every Manager instance ensures at most one model load/reload is ever in flight. New load() method is the single atomic entry point: stop → ComfyUI reverse gate → spawn → readiness wait, all under the lock. Eliminates the double-VRAM-allocation race between gateway auto-swap and a concurrent HTTP load.
Orphan-engine reaping — each engine writes a pidfile (run/engine-{pid}.pid) with its port and owner-daemon pid. On startup, reapStaleEngines() kills any engine whose port is live but whose owner daemon is gone (terminal closed, killed, crashed). killTrackedEnginesSync() on process exit covers abrupt exits that bypass signal handlers. Owner-aware: a restarting daemon never reaps engines the incoming process already owns.
Client-cancel propagation — gateway wires an AbortController into every upstream engine fetch. stream.onAbort fires ac.abort() so a cancelled Claude turn actually stops the engine generating rather than running to completion and clogging the queue. streamToAnthropic now uses reader.cancel() (not releaseLock()) to tear down the upstream body on disconnect.
Daemon crash on client disconnect — guarded the final writeSSE('done') in chat routes; unhandledRejection handler in CLI swallows expected AbortErrors. A disconnecting client can no longer crash the daemon and orphan the engine.
SIGHUP handled — added to graceful-shutdown signals.

Tests added

manager.loadlock.test.ts — proves two concurrent load() calls on different Managers serialise under the global lock
manager.reap.test.ts — reap live orphan, skip live-owner, skip dead-port (recycled-pid guard)
anthropic.cancel.test.ts — reader.cancel() propagates to the upstream body on generator teardown

Assets 2

Releases: mohitsoni48/TurboLLM

v1.4.2

Fixed

Uh oh!

v1.4.1 — repo rename to TurboLLM + link consistency

Changed

Uh oh!

v1.4.0

Added

Fixed

Uh oh!

v1.3.2

Fixed

Added

Changed

Contributors

Uh oh!

v1.3.1

Uh oh!

v1.3.0

Added

Changed

Fixed

Uh oh!

v1.2.1

Added

Changed

Fixed

Uh oh!

v1.0.0 — the engine overhaul

Added

Changed

Fixed

Uh oh!

v0.8.0

Added

Changed

Fixed

Uh oh!

v0.7.2

Engine lifecycle hardening

Fixed

Tests added

Uh oh!