Skip to content

perf: trace and reduce remote first-turn latency#30632

Closed
apanasenko-oai wants to merge 1 commit into
mainfrom
apanasenko/cca-ttft-core
Closed

perf: trace and reduce remote first-turn latency#30632
apanasenko-oai wants to merge 1 commit into
mainfrom
apanasenko/cca-ttft-core

Conversation

@apanasenko-oai

Copy link
Copy Markdown
Collaborator

Summary

This change makes remote first-turn and command latency attributable end to end, and removes several avoidable waits found while profiling that path.

  • propagate W3C trace context across Core, exec-server RPC, and the encrypted Noise relay
  • add stage-level spans for tool dispatch, RPC queues, relay framing, local process lifecycle, terminal delivery, and thread metadata work
  • consume pushed terminal events directly instead of polling when the transport supports them
  • keep thread metadata persistence off the synchronous event-delivery path while preserving explicit flush barriers
  • make turn-diff repository discovery lazy
  • reuse the AGENTS project root once during adjacent startup skill discovery, with a pipelined fallback for later independent loads
  • preserve client/server timing correlation for Responses WebSocket requests

Why

The previous telemetry exposed large envelopes such as exec_command and session_init, but could not distinguish queue residence, transport transit, child-process work, terminal notification delivery, or persistence. Profiling also showed that skill discovery repeated the same remote ancestor-marker walk that AGENTS discovery had just completed.

The one-shot root handoff removes that duplicate startup walk without caching filesystem topology across turns. Later loads rediscover the nearest marker, so creating or removing a closer repository boundary remains observable.

Validation

  • cargo fmt --all -- --check
  • cargo check -p codex-core
  • cargo test -p codex-core-skills repo_skill_roots_use_one_shot_hint_then_observe_new_nearest_marker -- --nocapture
  • cargo test -p codex-exec-server process_lifecycle_trace_separates_output_readers_and_exit_watcher -- --nocapture
  • cargo test -p codex-exec-server --test exec_process
  • cargo test -p codex-thread-store
  • just bazel-lock-check

The new trace boundaries were also exercised with a real remote command flow. Local process handling remained small; the added spans separated it from the dominant transport legs and exposed the repeated startup metadata RPCs.

This is an internally requested performance and observability enhancement; there is no public issue.

@apanasenko-oai

Copy link
Copy Markdown
Collaborator Author

Superseded by focused draft PRs:

The original branch is preserved, but this umbrella is closed to avoid duplicate review and CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant