Skip to content

Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm#210

Merged
mrjeeves merged 12 commits into
mainfrom
claude/gifted-tesla-HAIae
May 29, 2026
Merged

Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm#210
mrjeeves merged 12 commits into
mainfrom
claude/gifted-tesla-HAIae

Conversation

@mrjeeves
Copy link
Copy Markdown
Owner

@mrjeeves mrjeeves commented May 29, 2026

Makes loading a local model far less disruptive — especially on laptops where a cold load could freeze the whole machine.

1. Cold-start "loading the model" dialog

A non-blocking dialog over the chat surface while the model loads, with a Cancel button. Adapts copy for local vs. mesh ("Loading <model>…" / "Waiting for <host>…").

  • Pre-painted before the freeze: before sending, we check Ollama's /api/ps to see if the model is resident. On a predicted cold start we show the dialog and force an actual paint (tick + double rAF) before firing the request — so it's on screen ahead of any load freeze. Warm sends keep a lightweight 5s reactive fallback timer.
  • Tears down on the first agent frame; not re-armed for warm later turns.

2. Throttle the load so the machine stays usable

The freeze happens inside the Ollama server while it pages multi-GB weights in from disk. Since the app spawns that server itself, we now throttle it:

  • Lower IO priority of the spawned server — Linux ionice -c3 (idle) + a small renice; macOS taskpolicy -b; Windows BelowNormal priority class. IO is the real lever (loading is disk-bound), so this keeps the desktop responsive with negligible impact on inference (compute-bound). Best-effort; only applies when we own the process (not a system/tray Ollama).
  • Cap memory pressure on spawn via OLLAMA_MAX_LOADED_MODELS=1 and OLLAMA_NUM_PARALLEL=1, cutting the swap thrash behind the hardest freezes.

3. Warm the model proactively at startup

After launch we warm the active chat model in the background (throttled), so the one-time cold load happens at a predictable moment — with the dialog shown — instead of on the first message. Skipped when the model isn't downloaded or keep_alive is 0.

4. Configurable keep_alive (the repeated-cold-start fix)

Chat requests never set Ollama's keep_alive, so they relied on Ollama's 5-minute default; idle past that and the model unloaded, making the next message a cold start.

  • Chat requests (streaming + one-shot) now send keep_alive, read in Rust from config.
  • Configurable in Settings → Hardware → Performance → Model memory, default 30m, from "Unload immediately" (frees RAM/VRAM for transcription on tight machines) to "Until the app quits".

5. Live resource readout in the dialog

Shows live CPU / RAM / GPU + disk free so the user can see why a load is slow (e.g. RAM near full → paging from disk).

  • Reuses the existing usage_live_snapshot command; the LiveSnapshot type was promoted to types.ts and is now shared with the Usage settings tab.
  • Polled only while the dialog is visible (1.2s), with interval cleanup on teardown.

New backend surface

  • ollama_model_loaded(model) → bool (via /api/ps)
  • ollama_warm(model) (ensures running, then warms)
  • process::lower_priority(pid) — cross-platform best-effort throttle
  • ollama_keep_alive config field (default 30m)

Verification

  • pnpm run check — 0 errors, 0 warnings
  • pnpm run build — succeeds
  • cargo check + cargo clippy — my code clean (only pre-existing mesh/ unused-import warnings)
  • cargo test ... ollama — 4 passed

Notes / possible follow-ups

  • Throttling/mem-caps only apply when MyOwnLLM spawns Ollama. If Ollama runs as a system/tray service (common on Windows/macOS), we don't own the PID — could be extended to discover and throttle it.
  • The dialog's resource readout is whole-machine, not Ollama-process-specific; an /api/ps-based "model X — 100% GPU, 3.2 GB VRAM" line is a possible follow-up.

https://claude.ai/code/session_01Eze77o5msnfo5CBnJjd3Sd

claude added 2 commits May 29, 2026 01:19
When the first token doesn't arrive within 5s of sending, surface a
small non-blocking dialog explaining the model is loading into memory,
with a Cancel button. The dialog tears down on the first frame (delta,
tool call, or terminal event) and isn't re-armed for warm later turns.
Two improvements to the model-loading experience:

1. Cold-start fix: chat requests now send Ollama a keep_alive so the
   model stays resident between turns instead of relying on Ollama's
   5-minute default (which caused repeated cold-start reloads). The
   value is user-configurable in Settings > Hardware > Performance,
   defaulting to 30m, with options from 'unload immediately' (for
   memory-tight machines coexisting with transcription) to 'keep
   until the app quits'. Read in Rust from config so both the
   streaming and one-shot chat paths pick it up.

2. The model-loading dialog now shows a live CPU / RAM / GPU readout
   (and disk free) so the user can see why a load is slow — e.g. RAM
   near full means the model is paging in from disk. Reuses the
   existing usage_live_snapshot command and the LiveSnapshot type,
   now promoted to types.ts and shared with the Usage settings tab.
@mrjeeves mrjeeves changed the title Add cold-start model-loading dialog to chat Model-loading dialog: cold-start UX, keep_alive setting, live resource readout May 29, 2026
Addresses laptops freezing so hard during a cold model load that the
load dialog never paints, and warms the model without locking up the
machine:

- Throttle the Ollama server we spawn: lower its IO priority (Linux
  ionice idle + small renice; macOS taskpolicy -b; Windows BelowNormal)
  so the disk thrash of paging weights in no longer starves the
  desktop. Best-effort, only when we own the process.
- Cap memory pressure via OLLAMA_MAX_LOADED_MODELS=1 and
  OLLAMA_NUM_PARALLEL=1 on spawn, cutting the swap thrash that causes
  the hardest freezes.
- Pre-paint the load dialog: before sending, check /api/ps to see if
  the model is resident; on a predicted cold start, show the dialog and
  force a paint BEFORE firing the request so it's on screen ahead of
  any freeze. Warm starts keep the lightweight 5s reactive timer.
- Warm the chat model in the background at startup (throttled) so the
  one-time cold load happens at a predictable moment instead of on the
  first message. Skipped when the model is missing or keep_alive is 0.
- warm() now honors the configured keep_alive instead of a fixed 10m.
@mrjeeves mrjeeves changed the title Model-loading dialog: cold-start UX, keep_alive setting, live resource readout Model-loading: pre-painted dialog, throttled cold loads, keep_alive setting, startup warm May 29, 2026
claude added 9 commits May 29, 2026 05:04
The live usage sampler returned None for system-wide CPU% and RAM-used
on macOS (only the per-app figures populated), so the load dialog and
Usage tab showed app metrics but blank system metrics.

- System CPU%: sum every process's ps %cpu and normalise by core count
  (single fast call; no host_statistics FFI, no top -l 2 stall).
- System RAM used: (active + wired + compressed) pages x page size from
  vm_stat — the components Activity Monitor reports as Memory Used —
  using vm_stat's own header page size for self-consistent math.

Parsing is factored into pure helpers with unit tests so the logic is
verified on any host even though the macOS shell calls only run there.
Mechanical, behavior-preserving cleanup via cargo fix / clippy --fix:
- Remove unused re-exports (mesh identity/roster/signing).
- Inline format args in format!/anyhow!/write! across asr, diarize,
  mesh, transcribe, cli, main.

cargo check is now warning-free. The remaining clippy-only lints
(result_large_err, doc-list indentation) need invasive manual changes
and are left for a focused follow-up.
macOS inference was crippled because the throttle used taskpolicy -b
(background QoS), which demotes the whole process to efficiency cores
and throttles compute, not just disk. Switch to IO-only throttling so
the machine stays responsive during a load while token generation runs
at full speed:
- macOS: taskpolicy -d throttle (disk IO policy only; CPU/QoS untouched).
- Linux: ionice best-effort low (-c2 -n7) instead of idle, and drop the
  renice so inference keeps full CPU.
- Windows: unchanged (BelowNormal is a mild priority nudge, not a
  compute throttle).

Daemon binary search no longer logs a 'skipping ...' line for every
probed-but-inapplicable location on the happy path. Reasons are now
collected and printed only when the search actually fails (no usable
binary, or every candidate fails to spawn).

Clean up the warnings surfaced by a Windows  build (verified
via x86_64-pc-windows-gnu cross-check):
- usage.rs: drop unused std::ffi::c_void import.
- process.rs: drop redundant CommandExt import (tokio Command has an
  inherent creation_flags).
- ollama.rs: allow(unreachable_code) on install() — the tail Ok(()) is
  the Linux/unsupported fallback, unreachable on macOS/Windows by design.
- hardware.rs: cfg-gate the Linux-only parsers' dead_code allowance.
Promote performance settings out of the Hardware tab into their own
Performance tab (listed right after Hardware), and make the load
throttle user-tunable:

- New ollama_throttle config (off | io | aggressive), default io.
  - off: no throttle (fastest load, can bog the machine down).
  - io: ease disk IO priority only; inference stays full speed (default).
  - aggressive: also demote CPU/QoS; most responsive desktop, slower
    inference.
- lower_priority() now takes the mode and branches per platform; the
  Ollama spawn reads the config and skips throttling entirely when off.
- New PerformanceSection.svelte hosts both the keep-model-loaded
  (keep_alive) and load-throttle settings; removed the inline
  Performance group from HardwareSection.
cargo fmt --check failed on two lines the earlier cargo fix / clippy
--fix pass left wrapped non-canonically (embedder.rs anyhow! call and
roster.rs re-export list). Reflow to rustfmt's canonical form. fmt
--check, clippy --all-targets, and cargo test all pass locally on the
pinned 1.88.0 toolchain.
The io throttle was applied post-spawn via taskpolicy -p, which is a
no-op on macOS, so the server ran unthrottled and a load could starve
the display/networking and freeze the machine. And the previous fix
left the CPU fully open to the server (IO-only), which is what starved
the system in the first place.

Fix: throttle at launch with a moderate 0 so the server yields CPU
to the system (display, networking, WebView) when they need it, but
still gets the bulk of the cores when nothing competes — responsive
machine, inference not crippled. Applied as an argv prefix (nice execs
the target), which is also the only reliable way to set macOS IO policy.

- io (balanced, default): nice -n 10 (+ low best-effort ionice on Linux).
- aggressive: nice -n 19 + idle ionice (Linux) / background QoS (macOS).
- Windows: post-spawn priority class (BelowNormal / Idle).
- Fallback to a direct spawn if the wrapper can't bring Ollama up, so a
  missing/incompatible tool never disables the LLM.

Restore warm_on_startup to default ON (the load now runs under the
throttle, so it won't lock up the machine); it remains a toggle in
Settings → Performance.
Replace the floating load dialog with an in-bubble indicator that takes
the place of the typing dots while the model loads — no jolting overlay.
Minimal prose: a reassurance word that rotates every 3s with a moving
shine (recreated per change so it fades in), plus a quiet live CPU/RAM
line as proof the machine is still working. The composer's Stop button
already covers cancel, so the modal's Cancel/heading/spinner are gone.
Extract the cold-start indicator (rotating shining word + live CPU/RAM)
into a reusable LoadingPulse component and use it in two places:

- In chat: still shown in place of the typing dots whenever a call is
  slow (cold load or a long-running turn) — unchanged behavior, now via
  the shared component.
- At startup: when warm-on-startup runs, hold a full-screen loading
  screen (spinner + LoadingPulse beneath it) over the chat until the
  model is resident, instead of dropping into a chat that feels sluggish
  while it competes with the cold load. The chat still mounts behind the
  screen, so it's ready the moment the screen lifts; a Continue button is
  the escape hatch.

LoadingPulse self-manages its word rotation and usage poll (mount/
unmount lifecycle), so Chat no longer hand-rolls those.
The indicator covers both a cold model load and a slow in-progress
turn, so model-specific phrases (Loading the model / Reading the
weights / Warming up) wrongly implied a reload mid-chat. Swap for
neutral 'work is underway' phrases that fit either case.
@mrjeeves mrjeeves merged commit d266d1b into main May 29, 2026
4 checks passed
@mrjeeves mrjeeves deleted the claude/gifted-tesla-HAIae branch May 29, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants