Implement prewarm for MLXLanguageModel #97

noorbhatia · 2026-01-29T11:22:00Z

Implements prewarm() for MLXLanguageModel that improves first response time.
Prewarms the model with instructions, tools and prefixPrompt

Copilot

Pull request overview

Implements prewarm(for:promptPrefix:) for MLXLanguageModel to reduce first-response latency by loading the model context and priming the MLX processor with session instructions, tools, and an optional prompt prefix.

Changes:

Add MLXLanguageModel.prewarm(for:promptPrefix:) implementation.
Prewarm loads/caches ModelContext and calls context.processor.prepare(input:) with a minimal chat + tool specs.
Include session instructions and optional prompt prefix in the prewarm input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

mattt · 2026-01-29T12:11:01Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

+                // Add prompt prefix or minimal user message
+                let promptText = promptPrefix?.description ?? "."
+                chat.append(.init(role: .user, content: promptText))


Unless "." has special significance in MLX, this makes me think that promptPrefix should be non-optional (and maybe non-empty?)

What do you think?

mattt · 2026-01-29T12:11:40Z

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

+
+                let userInput = MLXLMCommon.UserInput(
+                    chat: chat,
+                    processing: .init(resize: .init(width: 512, height: 512)),


This seems like the kind of thing that we'd want to parameterize in the method, rather than hard-code.

mattt · 2026-01-29T12:18:20Z

Thanks for opening this PR, @noorbhatia!

I think this kind of functionality gets into the realm of KV cache management, which so far this implementation hasn't attempted to support. At a high-level, I'd expect an API that has some concept of prewarming a common prefix of tokens, caching that, and then reusing for various suffixes. Most likely, cache selection and management would be automatic; I'm not sure yet what controls we'd want to expose.

Can you say more about how you understand the problem?

noorbhatia · 2026-01-30T08:11:24Z

Thanks for opening this PR, @noorbhatia!

I think this kind of functionality gets into the realm of KV cache management, which so far this implementation hasn't attempted to support. At a high-level, I'd expect an API that has some concept of prewarming a common prefix of tokens, caching that, and then reusing for various suffixes. Most likely, cache selection and management would be automatic; I'm not sure yet what controls we'd want to expose.

Can you say more about how you understand the problem?

My problem: The very first respond() call has significant latency because the model must be loaded from disk, transferred to GPU memory, and the processor initialized. This cold start happens every time the app launches or when a model is first used.

My understanding of prewarm is simply to have the model ready before the user sends their first query, so respond() can start generation immediately.

I'd like to understand your vision and would be happy to implement a KV cache based solution if you could point me to the right direction.

mattt · 2026-01-30T11:22:58Z

@noorbhatia Thanks for elaborating — that's really helpful.

Let's break the cold start into two parts:

First, load the model
Second, create + cache the context with a given prompt prefix.

I suspect that the first step—loading the model—is the bulk of the time spent waiting. So let's try solving that first.

Then, once we have that, we can implement all of the KV cache infrastructure needed to make the promptPrefix parameter do something.

Does that track with your mental model of the problem?

noorbhatia · 2026-01-30T12:25:11Z

Understood. And solving the first step, loading the model should be simply calling loadContext in prewarm?

Perhaps we can expose loadContext but that would require a change in LanguageModel's API contract. What do you suggest?

mattt · 2026-01-30T12:45:34Z

Understood. And solving the first step, loading the model should be simply calling loadContext in prewarm?

Yes, exactly.

Perhaps we can expose loadContext but that would require a change in LanguageModel's API contract. What do you suggest?

I'd be interested to see how far we can get within the constraints of the existing Foundation Models API abstraction before we expand the surface area. Let's revisit this when we dig into KV caching.

noorbhatia · 2026-01-30T15:06:20Z

Great, I'll update the PR. Thanks @mattt !

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift

noorbhatia force-pushed the noor/mlx-prewarm-model branch from ede4b54 to 6d7cbd5 Compare January 29, 2026 11:25

Implement prewarm for MLXLanguageModel

f024896

noorbhatia force-pushed the noor/mlx-prewarm-model branch from 6d7cbd5 to f024896 Compare January 29, 2026 11:30

mattt requested a review from Copilot January 29, 2026 12:04

Copilot started reviewing on behalf of mattt January 29, 2026 12:04 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift Outdated Show resolved Hide resolved

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift Outdated Show resolved Hide resolved

mattt reviewed Jan 29, 2026

View reviewed changes

mattt requested a review from Copilot February 3, 2026 11:37

Copilot started reviewing on behalf of mattt February 3, 2026 11:38 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift Show resolved Hide resolved

Sources/AnyLanguageModel/Models/MLXLanguageModel.swift Outdated Show resolved Hide resolved

Reduce prewarm to just loadContext

cef7d55

noorbhatia force-pushed the noor/mlx-prewarm-model branch from 44d0a9f to cef7d55 Compare February 3, 2026 12:59

mattt approved these changes Feb 3, 2026

View reviewed changes

mattt merged commit 67d6a67 into mattt:main Feb 3, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement prewarm for MLXLanguageModel #97

Implement prewarm for MLXLanguageModel #97

noorbhatia commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

mattt Jan 29, 2026

Uh oh!

mattt Jan 29, 2026

Uh oh!

mattt commented Jan 29, 2026

Uh oh!

noorbhatia commented Jan 30, 2026

Uh oh!

mattt commented Jan 30, 2026

Uh oh!

noorbhatia commented Jan 30, 2026

Uh oh!

mattt commented Jan 30, 2026

Uh oh!

noorbhatia commented Jan 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Implement prewarm for MLXLanguageModel #97

Implement prewarm for MLXLanguageModel #97

Conversation

noorbhatia commented Jan 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

mattt Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

mattt Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

mattt commented Jan 29, 2026

Uh oh!

noorbhatia commented Jan 30, 2026

Uh oh!

mattt commented Jan 30, 2026

Uh oh!

noorbhatia commented Jan 30, 2026

Uh oh!

mattt commented Jan 30, 2026

Uh oh!

noorbhatia commented Jan 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants