Skip to content

optimize TTFT for local models#847

Merged
tpae merged 5 commits intomainfrom
enhancement/ttft-optimization-for-mlx-models
Apr 13, 2026
Merged

optimize TTFT for local models#847
tpae merged 5 commits intomainfrom
enhancement/ttft-optimization-for-mlx-models

Conversation

@RaajeevChandran
Copy link
Copy Markdown
Contributor

Summary

Fixes the root causes of slow Time-To-First-Token (TTFT) on local MLX models, particularly on memory constrained devices.

Changes

  • Fixed sandbox plugin creator guard by requirinh canCreatePlugins() to have both config.enabled && config.pluginCreate instead of just pluginCreate. This prevented the 4.2k characters Sandbox Plugin Creator skill from being injected when the sandbox toggle is off.
  • Removed preflight no-match fallback so that when preflight search finds no relevant tools, it returns .empty instead of injecting the full Sandbox Plugin Creator skill into every unrelated prompt
  • Skip tool injection for local models in the system prompt entirely. When preflight finds no relevant tools and the model is a local MLX model, all 9 tool specs (~4300 tokens) are now dropped from the system prompt.
  • Preflight LLM call will now be skipped for local models since tools are stripped anyway
  • Fixed TTFT math by moving streamStartTime to after engine.streamChat(). This was incorrectly measuring TTFT by including the model loading time as well. Now the displayed TTFT reflects inference time only not model loading
  • Added "Loading Model..." indicator in the assistant message
    cell during loadContainer which signifies that the model is still initializing and not ready for inference just yet. It will transition to the three dot indicator once the model is ready and prefill begins (this is when it actually starts processing the prompt and this also the timestamp from which TTFT will be measured)
  • Added TTFT trace instrumentation (debug-only) to write structured phase timings to /tmp/osaurus_ttft_trace.log with full prompt dump to /tmp/osaurus_debug.log gated behind #if DEBUG

NOTE: the local models can still use tool calls mid conversation via capabilities_search → capabilities_load

Before

16s TTFT for a simple prompt

image

After

<1s TTFT for the same prompt

image image
  • Behavior change
  • UI change (screenshots below)
  • Refactor / chore
  • Tests
  • Docs

Checklist

  • I have read CONTRIBUTING.md
  • I added/updated tests where reasonable
  • I updated docs/README as needed
  • I verified build on macOS with Xcode 16.4+

tpae
tpae previously approved these changes Apr 13, 2026
@tpae tpae merged commit 2b33c1f into main Apr 13, 2026
5 checks passed
@tpae tpae deleted the enhancement/ttft-optimization-for-mlx-models branch April 13, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants