Skip to content

Experimental Audio Support#1475

Merged
adamshiervani merged 20 commits into
devfrom
audio-usb
May 22, 2026
Merged

Experimental Audio Support#1475
adamshiervani merged 20 commits into
devfrom
audio-usb

Conversation

@adamshiervani
Copy link
Copy Markdown
Contributor

@adamshiervani adamshiervani commented May 21, 2026

Captures audio from the host's HDMI source via a UAC1 USB gadget, encodes it server-side (G.722 / PCMU fallback), and pipes it to the browser over the existing WebRTC session.

Decisions

  • Off by default. New Settings → Audio → Enable Audio (badge: Experimental) writes to device config; toggling the switch reloads the page so the next WebRTC handshake renegotiates cleanly. Frontend only renders the <audio> element when the device says audio is on, so Firefox's "block audio" policy doesn't fire on a silent stream.
  • Single pion Session carries both tracks. Audio is negotiated inside the same offer/answer exchange — no separate signaling path, no second peer connection, no extra dependencies. The video path is untouched when audio is disabled.
  • One canonical MediaStream on the client. ontrack ignores event.streams[0] and accumulates every incoming track into one stream stored in the RTC store. Robust to answers that omit a=msid for audio (which Firefox would otherwise route into a synthetic per-track stream that wipes out the video).

Mechanisms added

  • ALSA capture (internal/audio/). Cgo-loaded libasound via dlopen so the regular Go build works without ALSA headers in the sysroot; periodic blocking reads, no zero-fill on short reads.
  • G.722 + G.711 µ-law encoders in pure Go (internal/audio/g722.go, g711.go), browser-negotiated based on the offer SDP. 48 kHz stereo → 8 kHz / 16 kHz mono with a single-divide downsample.
  • Reconnect logic. After 5 consecutive non-idle read errors the capture closes and reopens with exponential backoff (100 ms → 2 s cap), re-resolving the ALSA card index on each attempt so USB re-enumeration with a shifted card number is handled for free.
  • WebRTC playout-delay header extension (internal/playoutdelay/). Stamps min=max=0 on every outgoing RTP packet for both video and audio. Chrome's adaptive jitter buffer is one-way — it grows under jittery input (e.g. variable-size H.264 frames during high-motion host content) and never shrinks back, leaving Playback Delay stuck at hundreds of ms until reload. The extension is an authoritative sender-side override that pins the receiver at the floor; audio is registered too because Chrome's AV-sync layer pulls video up to whatever the audio jitter buffer is.

Tests

End-to-end via Playwright against a real device + remote host: plays a 997 Hz tone on the USB ALSA sink and verifies getStats() inbound audio energy on the browser. Folded into make test_e2e under the existing remote-agent project.


Note

High Risk
Adds a new audio capture/encode pipeline (ALSA via cgo/dlopen) and modifies WebRTC negotiation/session lifecycle, which are latency- and stability-sensitive paths that can impact streaming reliability across browsers and devices.

Overview
Adds experimental, opt-in host audio streaming over the existing WebRTC connection, including device-level config (AudioEnabled) and new JSON-RPC endpoints (getAudioConfig/setAudioConfig) plus a Settings page to toggle it (forces reload for renegotiation).

Implements an ALSA capture stack in internal/audio (cgo libasound loaded via dlopen, plus Go G.711 µ-law and G.722 encoders) and a new audio.go capture loop with backoff/reopen logic and safe teardown (stopAudioIfOwner). WebRTC now always offers an audio m-line, conditionally attaches an outgoing audio track based on config+offer codec, adds a playout-delay RTP header-extension interceptor for both audio and video, and updates the UI to use a separate <audio> element and a canonical MediaStream that accumulates tracks (with mediaStreamTrackVersion for consumers).

Extends the USB gadget defaults to include a UAC1 audio function and expands e2e coverage: the remote-agent can list ALSA devices and play/stop a test tone, while Playwright adds an end-to-end spec that verifies inbound audio stats and improves remote-agent deploy caching.

Reviewed by Cursor Bugbot for commit c443959. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds ALSA capture from the USB UAC1 gadget, G.722/PCMU encoding, a WebRTC
audio track, and an e2e remote-agent flow that plays a tone on the remote
host and verifies it reaches the browser.

Snapshot of codex-driven implementation before simplification.
Backend:
- audio.go: drop source rotation + "no-data reopen with next device" loop.
  One source (UAC1Gadget; falls back to hw:1,0 only if sysfs lookup fails).
- internal/audio: remove unused Reader interface and unavailableCapture stub.
  Stub now returns the concrete type with an error.
- webrtc.go: inline single-use resolveAudioCodec helper; DRY video/audio RTCP
  drain into drainRTCP; fold startSessionAudio into the connect callback.

Frontend:
- devices.$id.tsx: drop remoteMediaStreamRef track-merging. Backend tracks
  share stream ID "kvm", so pion delivers them in one MediaStream — just
  assign event.streams[0].
- WebRTCVideo.tsx: replace dynamic per-track <audio> creation + ref array
  with a single hidden <audio> bound to mediaStream.

Remote agent:
- Drop PipeWire/wpctl detection path; plughw: works directly.
- Drop killStaleAudioToneProcesses pkill workaround; the (cmd, cancel, done)
  trio collapses to a single *exec.Cmd field with Start/Kill/Wait.

E2E:
- ra-audio.spec.ts: drop attachAudioDiagnostics scaffold and openReadyPage
  duplicate. Spec is now linear: setup → wait for track → diff stats → tone.

Net: ~355 LOC removed.
…split

The previous codex split routed audio start through onCurrentSessionConnected,
gated on session == currentSession. But currentSession is assigned by the
caller (web.go, cloud.go) AFTER ExchangeOffer returns, while
OnICEConnectionStateChange can fire from inside ExchangeOffer or shortly
after — racing the assignment. When the race hits, the equality check fails,
the callback is skipped, and audio never starts.

Pass session into the callback directly so the per-session setup uses the
session in hand, not whatever currentSession happens to point to at that
instant. Keep stopVideoSleepModeTicker on the count-edge (still only on
first-session) and let onSessionConnected handle the rest unconditionally.
Chrome's adaptive receive-side jitter buffer grows under stress (e.g.
playing a video on the controlled machine) and does not reliably shrink
back; the Connection Stats "Playback Delay" graph used to climb to
~300 ms and stay there until the page was reloaded.

The trigger is the USB UAC1 audio path, not video motion per se — once
real audio starts flowing, Chrome's AV-sync layer pulls the video
jitter buffer up to whatever the audio path settles at, and the ratchet
locks in. Receiver-side hints (jitterBufferTarget, playoutDelayHint,
setMinimumJitterBufferDelay) cap the steady state but don't release a
buffer that has already grown.

Fix: register the WebRTC playout-delay RTP header extension on both
audio and video and stamp min=max=0 on every outgoing packet via a
pion interceptor. Chrome treats this as an authoritative override of
its adaptive logic and keeps both buffers at the decoder floor through
and after stress, with no peer-connection rebuild needed.

Test: drive the host display with a real audio+video file via
gst-launch playbin (audio routed through PipeWire to the USB UAC1
sink) and assert receive-side video delay stays bounded both during
and after playback.
- Drop short-read zero-fill in ALSA reader; return ErrNoAudioData so the
  capture loop emits no frame for the cycle instead of half-silent audio.
- Replace ErrNoAudioData = io.ErrNoProgress (wrong semantic) with a domain
  sentinel and remove the unused idleReads debug counter.
- Encoders sum all source samples before one divide — better precision,
  fewer ops; clampS16 and sampleS16 helpers gone.
- Resolve audio codec inline in runAudioCapture; drop the audioCodecForTrack
  wrapper. Caller checks AudioTrack != nil so startAudio no longer accepts
  nil as a stop signal.
- Use C.GoString instead of hand-rolled cString helper.
- Add a why-comment on the separate <audio> element (video stays muted).
Keep the package-level "why" (Chrome's one-way jitter buffer); drop
restate-the-signature comments on Factory, NewFactory, NewInterceptor,
and BindLocalStream.
- Lift ensureNoPasswordViaAPI and waitForAudioStream into helpers.ts (the
  audio spec was inlining both, the latter as a copy of waitForVideoStream).
- ra-audio.spec.ts shrinks from 78 to 55 lines.
- Remove the JETKVM_AUDIO_DEVICE override in the remote agent: it fabricated
  an AudioDeviceInfo with is_jetkvm=true regardless of what device the env
  var pointed at, silently lying to the spec's assertion. Audio device
  discovery via aplay + /proc/asound/.../usbid is reliable; if no JetKVM
  device is present the spec already skips.
The C-side recovers EPIPE/ESTRPIPE via snd_pcm_recover; the errors that
surface to Go (EBADFD, ENODEV, …) usually mean the handle is dead —
typically a USB gadget rebuild or host reattach mid-session, which used
to leave audio silent until the session disconnected.

After 5 consecutive non-idle read errors, close and reopen the capture
with exponential backoff (100 ms → 2 s cap). Initial open uses the same
helper so we keep retrying instead of giving up if the gadget isn't ready
yet. Re-resolves the card each attempt so a USB re-enumeration that
shifts the card number is picked up automatically.
Audio is opt-in via device config. New Audio nav entry in Settings sits
next to Video, with a single "Enable Audio" item marked Experimental
(mirrors HTTPS Mode in Access).

Backend:
- Config.AudioEnabled (default false), persisted to /userdata/kvm_config.json
- getAudioConfig / setAudioConfig JSON-RPC handlers
- webrtc.go: extract attachAudioTrack helper; skip track creation when
  disabled or when the offer advertises no supported codec. The SDP
  answer leaves the audio m-line inactive, so flipping the toggle
  requires a fresh connection (page reload).

Frontend:
- New devices.$id.settings.audio.tsx — fetches state via getAudioConfig,
  saves via setAudioConfig, optimistic UI with rollback on error.
- devices.$id.tsx always offers audio in the SDP; backend decides.
- en.json + 13 locale files: 5 keys each (audio_*, settings_audio) with
  proper translations honoring per-language formality.

E2E:
- ra-audio.spec.ts: connect, enable via RPC, reload, verify audio energy.
  Restores disabled state in a finally block so other specs aren't
  affected. 9 s on kvm-2 + .180.
Re-negotiation only happens on a fresh WebRTC session, and the autoplay
overlay needs a user gesture to play the new audio track. A simple reload
covers both — the toggle's user click acts as the gesture, the new offer
includes audio, and the overlay surfaces normally.

Disable path is unchanged (audio stops naturally on the next connect).
Page header and item description previously said roughly the same thing
in long form. Now: page-level describes the topic ("Stream audio from
the host to your browser"); item-level is a terse one-liner ("Stream
HDMI audio from the host."). Drops the "Requires a fresh connection"
clause — the page auto-reloads on toggle, so it's no longer accurate.

Per-language tone follows I18N_BEST_PRACTICES.md: formal Sie/vous/usted/
вы/chi (de/fr/es/ru/cy), informal du (sv/nb/da), polite です/ます (ja),
infinitive (it), European Portuguese (pt).
Drop the dedicated test_audio_e2e Makefile target and the separate
remote-agent-audio Playwright project. The remote-agent project now
matches every ra-*.spec.ts under e2e/remote-agent, so make test_e2e
runs ra-audio.spec.ts alongside ra-all.spec.ts in the same worker.
Hit a state where pc.getReceivers() showed live video and audio tracks
but useRTCStore.mediaStream stayed undefined — the SDP answer arrived
without a=msid, so event.streams[0] was undefined and
setMediaStream(undefined) left the store empty even though RTP was
flowing. Only a hard reload recovered.

Now: when the event carries a stream, use it as before. When it doesn't,
get-or-create a MediaStream and append the track. Re-using the existing
store value across both ontrack invocations keeps audio + video on the
same MediaStream so the autoplay/video pipeline downstream is unchanged.
Firefox's soft reload doesn't always tear down the RTCPeerConnection,
which leaves the post-reload page in a half-renegotiated state: tracks
arrive on receivers but never attach to a MediaStream, so video stays
stuck on "Loading…" (or the page falls back to the pre-connect blue
background) until a hard refresh. Closing the PC explicitly before
reload guarantees a clean start.
The answer SDP from pion omits a=msid for the audio track in some
configurations (visible on Firefox: video keeps its msid, audio doesn't).
The previous handler called setMediaStream(event.streams[0]) on each
ontrack, so:

  1. video ontrack → setMediaStream(streamA)  [has video]
  2. audio ontrack → setMediaStream(streamB)  [synthetic, audio only]

streamB replaces streamA, video disappears. Hard refresh only "fixed" it
incidentally — the same SDP would break the next negotiation too.

Now: ignore event.streams[0], maintain one canonical MediaStream in the
store, and addTrack into it on every ontrack. Browsers render tracks
added to a live MediaStream that's already attached to srcObject, so
both audio and video stay attached regardless of which order ontrack
fires or whether the SDP carried msid.
The backend keeps the m=audio section in the SDP even when audio is
disabled (just inactive direction), so Firefox still attaches a muted
audio track to the MediaStream. The autoplay <audio> element then
triggers Firefox's "block audio" policy on a stream that will never
actually play any sound.

Fetch getAudioConfig once the RPC channel is up, then conditionally
render the <audio> element. No autoplay prompt when audio is off.
@adamshiervani adamshiervani changed the title USB audio capture from host HDMI to browser (experimental) Experimental Audio Support May 21, 2026
Comment thread ui/src/routes/devices.$id.tsx
Comment thread webrtc.go
cursor[bot]

This comment was marked as outdated.

@jetkvm jetkvm deleted a comment from cursor Bot May 22, 2026
@jetkvm jetkvm deleted a comment from cursor Bot May 22, 2026
Comment thread audio.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 330f617. Configure here.

Comment thread ui/src/routes/devices.$id.tsx
Comment thread webrtc.go
@adamshiervani adamshiervani marked this pull request as ready for review May 22, 2026 08:43
@adamshiervani adamshiervani merged commit 90750bf into dev May 22, 2026
5 checks passed
@adamshiervani adamshiervani deleted the audio-usb branch May 22, 2026 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant