Skip to content

perf(llm): warm up connection before correction#25

Merged
missuo merged 1 commit into
missuo:mainfrom
hyspace:codex/perf-llm-warmup
Mar 29, 2026
Merged

perf(llm): warm up connection before correction#25
missuo merged 1 commit into
missuo:mainfrom
hyspace:codex/perf-llm-warmup

Conversation

@hyspace

@hyspace hyspace commented Mar 29, 2026

Copy link
Copy Markdown
Contributor

This PR reduces cold LLM latency by warming the existing HTTP connection as soon as a recording session starts, so network setup can overlap with user speech instead of waiting until ASR has already finished.

In the current Azure OpenAI-compatible environment, direct cold-vs-warm measurements on the LLM host show an expected cold-path improvement of about 300ms. Across 7 paired runs, cold requests reached first response bytes in about 368-415ms, while warm reused connections did so in about 92-102ms, for an observed savings range of roughly 274-323ms.

This benefit mainly applies to:

  • the first correction after app launch
  • the first correction after the pooled connection has expired due to idleness

It is not expected to materially improve back-to-back dictation sessions when the LLM connection is already warm.

Implementation

  • Start a best-effort LLM warmup from sp_core_session_begin
  • Reuse the existing shared reqwest::Client
  • Use GET /models/{model} as a lightweight same-origin warmup request
  • URL-encode model IDs
  • Skip warmup when:
    • LLM is disabled or not configured
    • another warmup is already in flight
    • the connection was used recently
  • Never block the real correction request on warmup

The reuse window is tied to the real HTTP client settings:

  • pool idle timeout: 90s
  • safety margin: 20s
  • warmup reuse window: 70s

For the current Rust reqwest stack, a lightweight same-origin request is the most practical warmup mechanism; unlike browsers, there is no built-in higher-level preconnect primitive that safely warms the pooled connection we actually want to reuse.

Alternative explored but not adopted

We also tested HTTP/3 as a separate optimization idea.

Result:

  • OpenAI's public endpoint could use HTTP/3 on a compatible network path
  • the current Azure OpenAI-compatible endpoint did not demonstrate a usable HTTP/3 path in our testing
  • reqwest HTTP/3 is still unstable and does not provide the fallback behavior we want

Because of that, HTTP/3 is not part of this PR. This change only warms the existing stable HTTP/2/TLS path.

Validation

Passed:

  • cargo build -q -p koe-core
  • cargo test -q -p koe-core

@missuo missuo requested a review from Copilot March 29, 2026 08:21
@missuo missuo merged commit 183ae9c into missuo:main Mar 29, 2026
3 checks passed

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces first-request LLM latency by initiating a best-effort “warmup” HTTP request at the start of a recording session so TLS/HTTP connection setup can overlap with user speech, reusing the existing shared reqwest::Client.

Changes:

  • Added OpenAiCompatibleProvider::warmup() and centralized the HTTP pool idle timeout constant.
  • Introduced global warmup state and kick off warmup from sp_core_session_begin, with TTL/in-flight gating.
  • Added urlencoding dependency to safely URL-encode model IDs for the warmup endpoint.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
koe-core/src/llm/openai_compatible.rs Adds a warmup request method and reuses a shared pool idle timeout constant for the HTTP client.
koe-core/src/lib.rs Adds warmup state tracking and triggers warmup at session start; updates last-touched on successful corrections.
koe-core/Cargo.toml Adds urlencoding dependency used by the warmup URL construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +71 to +76
let _ = response.bytes().await;
if !status.is_success() {
log::debug!("LLM warmup completed with HTTP {status}");
}

Ok(())

Copilot AI Mar 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warmup() ignores the result of response.bytes().await. If reading/draining the body fails, the function still returns Ok(()), which can cause the caller to mark the connection as “touched” even though the request didn’t fully complete (and may not be reusable). Handle and propagate/log the bytes read error (and consider treating it as warmup failure).

Suggested change
let _ = response.bytes().await;
if !status.is_success() {
log::debug!("LLM warmup completed with HTTP {status}");
}
Ok(())
match response.bytes().await {
Ok(_) => {
if !status.is_success() {
log::debug!("LLM warmup completed with HTTP {status}");
}
Ok(())
}
Err(e) => {
log::debug!(
"LLM warmup request to {url} failed while reading response body: {e}"
);
Err(KoeError::LlmFailed(e.to_string()))
}
}

Copilot uses AI. Check for mistakes.
Comment thread koe-core/src/lib.rs
Comment on lines +810 to +812
fn llm_is_ready(cfg: &config::LlmSection) -> bool {
cfg.enabled && !cfg.base_url.is_empty() && !cfg.api_key.is_empty() && !cfg.model.is_empty()
}

Copilot AI Mar 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm_is_ready() introduces a stricter “configured” check (it includes model), but the correction path still uses a different condition (llm_enabled earlier in run_session only checks enabled/base_url/api_key). This divergence can make warmup behavior inconsistent with whether correction actually runs, and makes future config validation harder. Consider reusing a single readiness predicate for both warmup and correction (or rename/split the function to make the difference explicit).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants