Skip to content

perf(llm): reuse HTTP client and enable HTTP/2#17

Merged
missuo merged 1 commit into
missuo:mainfrom
hyspace:codex/llm-latency-optimization-http2
Mar 27, 2026
Merged

perf(llm): reuse HTTP client and enable HTTP/2#17
missuo merged 1 commit into
missuo:mainfrom
hyspace:codex/llm-latency-optimization-http2

Conversation

@hyspace

@hyspace hyspace commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR improves LLM correction latency by reusing a shared reqwest::Client across sessions and enabling HTTP/2 support in koe-core.

The main issue in the current implementation is that the OpenAI-compatible provider builds a new HTTP client for every voice session. That prevents connection pooling from being effective and adds unnecessary transport setup cost on a latency-sensitive path.

This change keeps the existing config hot-read behavior for request-level settings, but moves the HTTP client lifecycle to the core level:

  • create one shared LLM HTTP client in sp_core_create()
  • reuse it across sessions
  • rebuild it on explicit config reload
  • prefer HTTP/2 when the upstream supports it, while still allowing automatic fallback to HTTP/1.1

This is a koe-core change, but it was implemented and validated on top of the current windows-support branch.

What Changed

  • enabled the http2 feature for reqwest
  • moved HTTP client construction out of OpenAiCompatibleProvider::new()
  • added a shared LLM HTTP client to the global Core state
  • cloned the shared client into each session instead of rebuilding it
  • rebuilt the client during sp_core_reload_config()
  • documented that changing llm.timeout_ms requires restarting Koe to fully apply
  • tuned the transport settings for the voice-input use case:
    • pool_max_idle_per_host(2)
    • tcp_keepalive(30s)
    • http2_keep_alive_interval(30s)
    • http2_keep_alive_timeout(30s)
    • http2_keep_alive_while_idle(true)

Why This Approach

reqwest::Client is designed to be reused and already contains an internal connection pool. Reusing the same client lets Koe keep transport state warm across voice sessions instead of paying the setup cost every time.

I intentionally did not add per-field config diffing logic for client rebuilds. The simpler policy here is:

  • request-level settings like base_url, api_key, model, temperature, top_p, and token settings still apply on the next session
  • transport-level client settings are refreshed on explicit reload / restart

That keeps the implementation small and predictable.

Benchmark Notes

I ran a real-endpoint benchmark against the current Azure Foundry OpenAI-compatible endpoint used by this setup.

Most relevant comparison:

  • current-like path: HTTP/1.1 + fresh client per request
  • optimized path: HTTP/2 + reused client

30-request result:

  • current-like path

    • average: 1.390s
    • P50: 1.319s
    • P90: 1.780s
    • P95: 1.789s
  • optimized path

    • average: 1.113s
    • P50: 1.079s
    • P90: 1.470s
    • P95: 1.584s

Observed improvement:

  • average latency reduced by 0.277s
  • about 19.9% faster overall in this benchmark
  • tail latency also improved

I also verified that the endpoint negotiates HTTP/2 when the client has HTTP/2 enabled. This is not a forced-HTTP/2 change: if an upstream only supports HTTP/1.1, the client can still fall back automatically.

Validation

Verified locally with:

cargo test -p koe-core
cargo build --manifest-path koe-core/Cargo.toml --release --target x86_64-pc-windows-msvc
cmake -B KoeWin/build-x64 -S KoeWin -G "Visual Studio 18 2026" -A x64
cmake --build KoeWin/build-x64 --config Release

Also validated end-to-end on both Windows and macOS after rebasing onto the latest upstream main.

User-Facing Behavior

No workflow changes.

The only user-facing documentation change is that llm.timeout_ms now explicitly notes:

  • restart Koe after changing this value

because the shared HTTP client is long-lived and timeout is applied when that client is built.

@hyspace hyspace force-pushed the codex/llm-latency-optimization-http2 branch from d58f2f6 to 11157c5 Compare March 27, 2026 08:39
@missuo

missuo commented Mar 27, 2026

Copy link
Copy Markdown
Owner

For now, I’d like to avoid merging Win into the main branch. macOS is still in its early stages of development, and since it uses completely different frameworks, aligning all functions between macOS and Win is challenging. Therefore, I believe it’s better to focus on polishing the Win version after the macOS version is essentially complete.

@missuo missuo merged commit 051dc6d into missuo:main Mar 27, 2026
@hyspace

hyspace commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

I'm switching between mac and win to test and accidentally brought win changes. has rebased and fixed.

@missuo

missuo commented Mar 27, 2026

Copy link
Copy Markdown
Owner

I'm switching between mac and win to test and accidentally brought win changes. has rebased and fixed.

Thx!

@hyspace hyspace deleted the codex/llm-latency-optimization-http2 branch March 29, 2026 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants