Experimental Audio Support#1475
Merged
Merged
Conversation
Adds ALSA capture from the USB UAC1 gadget, G.722/PCMU encoding, a WebRTC audio track, and an e2e remote-agent flow that plays a tone on the remote host and verifies it reaches the browser. Snapshot of codex-driven implementation before simplification.
Backend: - audio.go: drop source rotation + "no-data reopen with next device" loop. One source (UAC1Gadget; falls back to hw:1,0 only if sysfs lookup fails). - internal/audio: remove unused Reader interface and unavailableCapture stub. Stub now returns the concrete type with an error. - webrtc.go: inline single-use resolveAudioCodec helper; DRY video/audio RTCP drain into drainRTCP; fold startSessionAudio into the connect callback. Frontend: - devices.$id.tsx: drop remoteMediaStreamRef track-merging. Backend tracks share stream ID "kvm", so pion delivers them in one MediaStream — just assign event.streams[0]. - WebRTCVideo.tsx: replace dynamic per-track <audio> creation + ref array with a single hidden <audio> bound to mediaStream. Remote agent: - Drop PipeWire/wpctl detection path; plughw: works directly. - Drop killStaleAudioToneProcesses pkill workaround; the (cmd, cancel, done) trio collapses to a single *exec.Cmd field with Start/Kill/Wait. E2E: - ra-audio.spec.ts: drop attachAudioDiagnostics scaffold and openReadyPage duplicate. Spec is now linear: setup → wait for track → diff stats → tone. Net: ~355 LOC removed.
…split The previous codex split routed audio start through onCurrentSessionConnected, gated on session == currentSession. But currentSession is assigned by the caller (web.go, cloud.go) AFTER ExchangeOffer returns, while OnICEConnectionStateChange can fire from inside ExchangeOffer or shortly after — racing the assignment. When the race hits, the equality check fails, the callback is skipped, and audio never starts. Pass session into the callback directly so the per-session setup uses the session in hand, not whatever currentSession happens to point to at that instant. Keep stopVideoSleepModeTicker on the count-edge (still only on first-session) and let onSessionConnected handle the rest unconditionally.
Chrome's adaptive receive-side jitter buffer grows under stress (e.g. playing a video on the controlled machine) and does not reliably shrink back; the Connection Stats "Playback Delay" graph used to climb to ~300 ms and stay there until the page was reloaded. The trigger is the USB UAC1 audio path, not video motion per se — once real audio starts flowing, Chrome's AV-sync layer pulls the video jitter buffer up to whatever the audio path settles at, and the ratchet locks in. Receiver-side hints (jitterBufferTarget, playoutDelayHint, setMinimumJitterBufferDelay) cap the steady state but don't release a buffer that has already grown. Fix: register the WebRTC playout-delay RTP header extension on both audio and video and stamp min=max=0 on every outgoing packet via a pion interceptor. Chrome treats this as an authoritative override of its adaptive logic and keeps both buffers at the decoder floor through and after stress, with no peer-connection rebuild needed. Test: drive the host display with a real audio+video file via gst-launch playbin (audio routed through PipeWire to the USB UAC1 sink) and assert receive-side video delay stays bounded both during and after playback.
- Drop short-read zero-fill in ALSA reader; return ErrNoAudioData so the capture loop emits no frame for the cycle instead of half-silent audio. - Replace ErrNoAudioData = io.ErrNoProgress (wrong semantic) with a domain sentinel and remove the unused idleReads debug counter. - Encoders sum all source samples before one divide — better precision, fewer ops; clampS16 and sampleS16 helpers gone. - Resolve audio codec inline in runAudioCapture; drop the audioCodecForTrack wrapper. Caller checks AudioTrack != nil so startAudio no longer accepts nil as a stop signal. - Use C.GoString instead of hand-rolled cString helper. - Add a why-comment on the separate <audio> element (video stays muted).
Keep the package-level "why" (Chrome's one-way jitter buffer); drop restate-the-signature comments on Factory, NewFactory, NewInterceptor, and BindLocalStream.
- Lift ensureNoPasswordViaAPI and waitForAudioStream into helpers.ts (the audio spec was inlining both, the latter as a copy of waitForVideoStream). - ra-audio.spec.ts shrinks from 78 to 55 lines. - Remove the JETKVM_AUDIO_DEVICE override in the remote agent: it fabricated an AudioDeviceInfo with is_jetkvm=true regardless of what device the env var pointed at, silently lying to the spec's assertion. Audio device discovery via aplay + /proc/asound/.../usbid is reliable; if no JetKVM device is present the spec already skips.
The C-side recovers EPIPE/ESTRPIPE via snd_pcm_recover; the errors that surface to Go (EBADFD, ENODEV, …) usually mean the handle is dead — typically a USB gadget rebuild or host reattach mid-session, which used to leave audio silent until the session disconnected. After 5 consecutive non-idle read errors, close and reopen the capture with exponential backoff (100 ms → 2 s cap). Initial open uses the same helper so we keep retrying instead of giving up if the gadget isn't ready yet. Re-resolves the card each attempt so a USB re-enumeration that shifts the card number is picked up automatically.
Audio is opt-in via device config. New Audio nav entry in Settings sits next to Video, with a single "Enable Audio" item marked Experimental (mirrors HTTPS Mode in Access). Backend: - Config.AudioEnabled (default false), persisted to /userdata/kvm_config.json - getAudioConfig / setAudioConfig JSON-RPC handlers - webrtc.go: extract attachAudioTrack helper; skip track creation when disabled or when the offer advertises no supported codec. The SDP answer leaves the audio m-line inactive, so flipping the toggle requires a fresh connection (page reload). Frontend: - New devices.$id.settings.audio.tsx — fetches state via getAudioConfig, saves via setAudioConfig, optimistic UI with rollback on error. - devices.$id.tsx always offers audio in the SDP; backend decides. - en.json + 13 locale files: 5 keys each (audio_*, settings_audio) with proper translations honoring per-language formality. E2E: - ra-audio.spec.ts: connect, enable via RPC, reload, verify audio energy. Restores disabled state in a finally block so other specs aren't affected. 9 s on kvm-2 + .180.
Re-negotiation only happens on a fresh WebRTC session, and the autoplay overlay needs a user gesture to play the new audio track. A simple reload covers both — the toggle's user click acts as the gesture, the new offer includes audio, and the overlay surfaces normally. Disable path is unchanged (audio stops naturally on the next connect).
Page header and item description previously said roughly the same thing
in long form. Now: page-level describes the topic ("Stream audio from
the host to your browser"); item-level is a terse one-liner ("Stream
HDMI audio from the host."). Drops the "Requires a fresh connection"
clause — the page auto-reloads on toggle, so it's no longer accurate.
Per-language tone follows I18N_BEST_PRACTICES.md: formal Sie/vous/usted/
вы/chi (de/fr/es/ru/cy), informal du (sv/nb/da), polite です/ます (ja),
infinitive (it), European Portuguese (pt).
Drop the dedicated test_audio_e2e Makefile target and the separate remote-agent-audio Playwright project. The remote-agent project now matches every ra-*.spec.ts under e2e/remote-agent, so make test_e2e runs ra-audio.spec.ts alongside ra-all.spec.ts in the same worker.
Hit a state where pc.getReceivers() showed live video and audio tracks but useRTCStore.mediaStream stayed undefined — the SDP answer arrived without a=msid, so event.streams[0] was undefined and setMediaStream(undefined) left the store empty even though RTP was flowing. Only a hard reload recovered. Now: when the event carries a stream, use it as before. When it doesn't, get-or-create a MediaStream and append the track. Re-using the existing store value across both ontrack invocations keeps audio + video on the same MediaStream so the autoplay/video pipeline downstream is unchanged.
Firefox's soft reload doesn't always tear down the RTCPeerConnection, which leaves the post-reload page in a half-renegotiated state: tracks arrive on receivers but never attach to a MediaStream, so video stays stuck on "Loading…" (or the page falls back to the pre-connect blue background) until a hard refresh. Closing the PC explicitly before reload guarantees a clean start.
The answer SDP from pion omits a=msid for the audio track in some configurations (visible on Firefox: video keeps its msid, audio doesn't). The previous handler called setMediaStream(event.streams[0]) on each ontrack, so: 1. video ontrack → setMediaStream(streamA) [has video] 2. audio ontrack → setMediaStream(streamB) [synthetic, audio only] streamB replaces streamA, video disappears. Hard refresh only "fixed" it incidentally — the same SDP would break the next negotiation too. Now: ignore event.streams[0], maintain one canonical MediaStream in the store, and addTrack into it on every ontrack. Browsers render tracks added to a live MediaStream that's already attached to srcObject, so both audio and video stay attached regardless of which order ontrack fires or whether the SDP carried msid.
The backend keeps the m=audio section in the SDP even when audio is disabled (just inactive direction), so Firefox still attaches a muted audio track to the MediaStream. The autoplay <audio> element then triggers Firefox's "block audio" policy on a stream that will never actually play any sound. Fetch getAudioConfig once the RPC channel is up, then conditionally render the <audio> element. No autoplay prompt when audio is off.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 330f617. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Captures audio from the host's HDMI source via a UAC1 USB gadget, encodes it server-side (G.722 / PCMU fallback), and pipes it to the browser over the existing WebRTC session.
Decisions
Settings → Audio → Enable Audio(badge: Experimental) writes to device config; toggling the switch reloads the page so the next WebRTC handshake renegotiates cleanly. Frontend only renders the<audio>element when the device says audio is on, so Firefox's "block audio" policy doesn't fire on a silent stream.Sessioncarries both tracks. Audio is negotiated inside the same offer/answer exchange — no separate signaling path, no second peer connection, no extra dependencies. The video path is untouched when audio is disabled.ontrackignoresevent.streams[0]and accumulates every incoming track into one stream stored in the RTC store. Robust to answers that omita=msidfor audio (which Firefox would otherwise route into a synthetic per-track stream that wipes out the video).Mechanisms added
internal/audio/). Cgo-loadedlibasoundvia dlopen so the regular Go build works without ALSA headers in the sysroot; periodic blocking reads, no zero-fill on short reads.internal/audio/g722.go,g711.go), browser-negotiated based on the offer SDP. 48 kHz stereo → 8 kHz / 16 kHz mono with a single-divide downsample.internal/playoutdelay/). Stampsmin=max=0on every outgoing RTP packet for both video and audio. Chrome's adaptive jitter buffer is one-way — it grows under jittery input (e.g. variable-size H.264 frames during high-motion host content) and never shrinks back, leaving Playback Delay stuck at hundreds of ms until reload. The extension is an authoritative sender-side override that pins the receiver at the floor; audio is registered too because Chrome's AV-sync layer pulls video up to whatever the audio jitter buffer is.Tests
End-to-end via Playwright against a real device + remote host: plays a 997 Hz tone on the USB ALSA sink and verifies
getStats()inbound audio energy on the browser. Folded intomake test_e2eunder the existingremote-agentproject.Note
High Risk
Adds a new audio capture/encode pipeline (ALSA via cgo/dlopen) and modifies WebRTC negotiation/session lifecycle, which are latency- and stability-sensitive paths that can impact streaming reliability across browsers and devices.
Overview
Adds experimental, opt-in host audio streaming over the existing WebRTC connection, including device-level config (
AudioEnabled) and new JSON-RPC endpoints (getAudioConfig/setAudioConfig) plus a Settings page to toggle it (forces reload for renegotiation).Implements an ALSA capture stack in
internal/audio(cgolibasoundloaded viadlopen, plus Go G.711 µ-law and G.722 encoders) and a newaudio.gocapture loop with backoff/reopen logic and safe teardown (stopAudioIfOwner). WebRTC now always offers an audio m-line, conditionally attaches an outgoing audio track based on config+offer codec, adds a playout-delay RTP header-extension interceptor for both audio and video, and updates the UI to use a separate<audio>element and a canonicalMediaStreamthat accumulates tracks (withmediaStreamTrackVersionfor consumers).Extends the USB gadget defaults to include a UAC1 audio function and expands e2e coverage: the remote-agent can list ALSA devices and play/stop a test tone, while Playwright adds an end-to-end spec that verifies inbound audio stats and improves remote-agent deploy caching.
Reviewed by Cursor Bugbot for commit c443959. Bugbot is set up for automated code reviews on this repo. Configure here.