perf(llm): reuse HTTP client and enable HTTP/2#17
Merged
Conversation
d58f2f6 to
11157c5
Compare
Owner
|
For now, I’d like to avoid merging Win into the main branch. macOS is still in its early stages of development, and since it uses completely different frameworks, aligning all functions between macOS and Win is challenging. Therefore, I believe it’s better to focus on polishing the Win version after the macOS version is essentially complete. |
Contributor
Author
|
I'm switching between mac and win to test and accidentally brought win changes. has rebased and fixed. |
Owner
Thx! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves LLM correction latency by reusing a shared
reqwest::Clientacross sessions and enabling HTTP/2 support inkoe-core.The main issue in the current implementation is that the OpenAI-compatible provider builds a new HTTP client for every voice session. That prevents connection pooling from being effective and adds unnecessary transport setup cost on a latency-sensitive path.
This change keeps the existing config hot-read behavior for request-level settings, but moves the HTTP client lifecycle to the core level:
sp_core_create()This is a
koe-corechange, but it was implemented and validated on top of the currentwindows-supportbranch.What Changed
http2feature forreqwestOpenAiCompatibleProvider::new()Corestatesp_core_reload_config()llm.timeout_msrequires restarting Koe to fully applypool_max_idle_per_host(2)tcp_keepalive(30s)http2_keep_alive_interval(30s)http2_keep_alive_timeout(30s)http2_keep_alive_while_idle(true)Why This Approach
reqwest::Clientis designed to be reused and already contains an internal connection pool. Reusing the same client lets Koe keep transport state warm across voice sessions instead of paying the setup cost every time.I intentionally did not add per-field config diffing logic for client rebuilds. The simpler policy here is:
base_url,api_key,model,temperature,top_p, and token settings still apply on the next sessionThat keeps the implementation small and predictable.
Benchmark Notes
I ran a real-endpoint benchmark against the current Azure Foundry OpenAI-compatible endpoint used by this setup.
Most relevant comparison:
HTTP/1.1 + fresh client per requestHTTP/2 + reused client30-request result:
current-like path
1.390s1.319s1.780s1.789soptimized path
1.113s1.079s1.470s1.584sObserved improvement:
0.277s19.9%faster overall in this benchmarkI also verified that the endpoint negotiates
HTTP/2when the client has HTTP/2 enabled. This is not a forced-HTTP/2 change: if an upstream only supports HTTP/1.1, the client can still fall back automatically.Validation
Verified locally with:
Also validated end-to-end on both Windows and macOS after rebasing onto the latest upstream
main.User-Facing Behavior
No workflow changes.
The only user-facing documentation change is that
llm.timeout_msnow explicitly notes:because the shared HTTP client is long-lived and timeout is applied when that client is built.