optimize TTFT for local models by RaajeevChandran · Pull Request #847 · osaurus-ai/osaurus

RaajeevChandran · 2026-04-13T10:47:03Z

Summary

Fixes the root causes of slow Time-To-First-Token (TTFT) on local MLX models, particularly on memory constrained devices.

Changes

Fixed sandbox plugin creator guard by requirinh canCreatePlugins() to have both config.enabled && config.pluginCreate instead of just pluginCreate. This prevented the 4.2k characters Sandbox Plugin Creator skill from being injected when the sandbox toggle is off.
Removed preflight no-match fallback so that when preflight search finds no relevant tools, it returns .empty instead of injecting the full Sandbox Plugin Creator skill into every unrelated prompt
Skip tool injection for local models in the system prompt entirely. When preflight finds no relevant tools and the model is a local MLX model, all 9 tool specs (~4300 tokens) are now dropped from the system prompt.
Preflight LLM call will now be skipped for local models since tools are stripped anyway
Fixed TTFT math by moving streamStartTime to after engine.streamChat(). This was incorrectly measuring TTFT by including the model loading time as well. Now the displayed TTFT reflects inference time only not model loading
Added "Loading Model..." indicator in the assistant message
cell during loadContainer which signifies that the model is still initializing and not ready for inference just yet. It will transition to the three dot indicator once the model is ready and prefill begins (this is when it actually starts processing the prompt and this also the timestamp from which TTFT will be measured)
Added TTFT trace instrumentation (debug-only) to write structured phase timings to /tmp/osaurus_ttft_trace.log with full prompt dump to /tmp/osaurus_debug.log gated behind #if DEBUG

NOTE: the local models can still use tool calls mid conversation via capabilities_search → capabilities_load

Before

16s TTFT for a simple prompt

After

<1s TTFT for the same prompt

Checklist

I have read CONTRIBUTING.md
I added/updated tests where reasonable
I updated docs/README as needed
I verified build on macOS with Xcode 16.4+

… inference

RaajeevChandran added 4 commits April 13, 2026 13:20

ttft trace & optimization for local models

41fd779

disable preflight search for local models

3a4185b

show loading model indicator & adjusted TTFT math to account only for…

15b104c

… inference

revered signing in xcodeproj

a0d3a32

tpae previously approved these changes Apr 13, 2026

View reviewed changes

fixed a minor typo

108fcbe

RaajeevChandran dismissed tpae’s stale review via 108fcbe April 13, 2026 11:34

tpae approved these changes Apr 13, 2026

View reviewed changes

tpae merged commit 2b33c1f into main Apr 13, 2026
5 checks passed

tpae deleted the enhancement/ttft-optimization-for-mlx-models branch April 13, 2026 12:05

github-actions bot added pending release released and removed pending release labels Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize TTFT for local models#847

optimize TTFT for local models#847
tpae merged 5 commits intomainfrom
enhancement/ttft-optimization-for-mlx-models

RaajeevChandran commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaajeevChandran commented Apr 13, 2026

Summary

Changes

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants