perf(llm): warm up connection before correction#25
Conversation
There was a problem hiding this comment.
Pull request overview
This PR reduces first-request LLM latency by initiating a best-effort “warmup” HTTP request at the start of a recording session so TLS/HTTP connection setup can overlap with user speech, reusing the existing shared reqwest::Client.
Changes:
- Added
OpenAiCompatibleProvider::warmup()and centralized the HTTP pool idle timeout constant. - Introduced global warmup state and kick off warmup from
sp_core_session_begin, with TTL/in-flight gating. - Added
urlencodingdependency to safely URL-encode model IDs for the warmup endpoint.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
koe-core/src/llm/openai_compatible.rs |
Adds a warmup request method and reuses a shared pool idle timeout constant for the HTTP client. |
koe-core/src/lib.rs |
Adds warmup state tracking and triggers warmup at session start; updates last-touched on successful corrections. |
koe-core/Cargo.toml |
Adds urlencoding dependency used by the warmup URL construction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let _ = response.bytes().await; | ||
| if !status.is_success() { | ||
| log::debug!("LLM warmup completed with HTTP {status}"); | ||
| } | ||
|
|
||
| Ok(()) |
There was a problem hiding this comment.
warmup() ignores the result of response.bytes().await. If reading/draining the body fails, the function still returns Ok(()), which can cause the caller to mark the connection as “touched” even though the request didn’t fully complete (and may not be reusable). Handle and propagate/log the bytes read error (and consider treating it as warmup failure).
| let _ = response.bytes().await; | |
| if !status.is_success() { | |
| log::debug!("LLM warmup completed with HTTP {status}"); | |
| } | |
| Ok(()) | |
| match response.bytes().await { | |
| Ok(_) => { | |
| if !status.is_success() { | |
| log::debug!("LLM warmup completed with HTTP {status}"); | |
| } | |
| Ok(()) | |
| } | |
| Err(e) => { | |
| log::debug!( | |
| "LLM warmup request to {url} failed while reading response body: {e}" | |
| ); | |
| Err(KoeError::LlmFailed(e.to_string())) | |
| } | |
| } |
| fn llm_is_ready(cfg: &config::LlmSection) -> bool { | ||
| cfg.enabled && !cfg.base_url.is_empty() && !cfg.api_key.is_empty() && !cfg.model.is_empty() | ||
| } |
There was a problem hiding this comment.
llm_is_ready() introduces a stricter “configured” check (it includes model), but the correction path still uses a different condition (llm_enabled earlier in run_session only checks enabled/base_url/api_key). This divergence can make warmup behavior inconsistent with whether correction actually runs, and makes future config validation harder. Consider reusing a single readiness predicate for both warmup and correction (or rename/split the function to make the difference explicit).
This PR reduces cold LLM latency by warming the existing HTTP connection as soon as a recording session starts, so network setup can overlap with user speech instead of waiting until ASR has already finished.
In the current Azure OpenAI-compatible environment, direct cold-vs-warm measurements on the LLM host show an expected cold-path improvement of about 300ms. Across 7 paired runs, cold requests reached first response bytes in about 368-415ms, while warm reused connections did so in about 92-102ms, for an observed savings range of roughly 274-323ms.
This benefit mainly applies to:
It is not expected to materially improve back-to-back dictation sessions when the LLM connection is already warm.
Implementation
sp_core_session_beginreqwest::ClientGET /models/{model}as a lightweight same-origin warmup requestThe reuse window is tied to the real HTTP client settings:
90s20s70sFor the current Rust
reqweststack, a lightweight same-origin request is the most practical warmup mechanism; unlike browsers, there is no built-in higher-level preconnect primitive that safely warms the pooled connection we actually want to reuse.Alternative explored but not adopted
We also tested HTTP/3 as a separate optimization idea.
Result:
reqwestHTTP/3 is still unstable and does not provide the fallback behavior we wantBecause of that, HTTP/3 is not part of this PR. This change only warms the existing stable HTTP/2/TLS path.
Validation
Passed:
cargo build -q -p koe-corecargo test -q -p koe-core