Add handler outcome Prometheus metric for SLO-based kill rate alerting by milos-lk · Pull Request #1230 · livekit/egress

milos-lk · 2026-05-20T21:24:20Z

We currently monitor egress handler kills via a logs that counts killed vs validated log lines. This is noisy at low traffic because small denominators cause the ratio to spike, and it mixes request-time events (validation) with process-exit events (kills), which skews the ratio for long-running egresses.

This PR adds a single prometheus counter, livekit_egress_handler_results_total{type, result}, that tracks terminal handler process outcomes. Every handler that runs records exactly one result when it exits: completed, process_error, aborted, killed_cpu, killed_oom, or killed_shutdown. The type label carries the egress request type for drill-down.

To correctly attribute kill reasons at process exit time, the ProcessManager now maintains a killReasons map. When the monitor triggers a kill (CPU or OOM), it stores the reason before sending the kill signal. When cmd.Wait() returns in the goroutine, it consumes the stored reason via GetKillReason and records the appropriate metric. The consume-on-read design prevents double-counting when ProcessFinished cleans up.

The KillProcess method now takes an explicit reason parameter instead of deriving the reason from error string matching, which keeps the metric labels type-safe and co-located with the constants they reference.

We currently monitor egress handler kills via a logs that counts killed vs validated log lines. This is noisy at low traffic because small denominators cause the ratio to spike, and it mixes request-time events (validation) with process-exit events (kills), which skews the ratio for long-running egresses. This PR adds a single prometheus counter, livekit_egress_handler_results_total{type, result}, that tracks terminal handler process outcomes. Every handler that runs records exactly one result when it exits: completed, process_error, aborted, killed_cpu, killed_oom, or killed_shutdown. The type label carries the egress request type for drill-down. To correctly attribute kill reasons at process exit time, the ProcessManager now maintains a killReasons map. When the monitor triggers a kill (CPU or OOM), it stores the reason before sending the kill signal. When cmd.Wait() returns in the goroutine, it consumes the stored reason via GetKillReason and records the appropriate metric. The consume-on-read design prevents double-counting when ProcessFinished cleans up. The KillProcess method now takes an explicit reason parameter instead of deriving the reason from error string matching, which keeps the metric labels type-safe and co-located with the constants they reference.

biglittlebigben · 2026-05-20T22:54:32Z

 	pm.mu.Lock()
 	h, ok := pm.activeHandlers[egressID]
 	if ok {
+		pm.killReasons[egressID] = stats.ResultAborted


In this code path, it looks like failing to call GetKillReason will cause a leak? This feels a bit fragile. Is it possible to refactor the Abort/createErr flow in a way where the activeHanders entry get removed after we read the kill reason (at what point we should be able to stash killReason in Process). Alternatively is there a way to register a "UpdateKillReason" callback that sets the prometheus value (either sync or in a goroutine) and gets called at the point where we are currently setting the killReason entry?

In this code path, it looks like failing to call GetKillReason will cause a leak?

only if ProcessFinished isn't called but if that happens we are in a deeper kimchi and leaking active handlers map and a lot more of other things. But could probably be made simpler (without the need for a map) if we delay removal from active handlers after the reason is read - will update that.

The AbortProcess path doesn't call ProcessFinished. That's why we currently remove the Process entry from activeHandler in AbortProcess. That the path I think is dangerous.

hm - not sure if this is the case - as far as i can see AbortProcess() is called from StartEgress only after handler is launched at which point it needs to return from cmd.Wait() - and invoke processEnded which calls processFinished - or I missed something?

Right, today, if we call AbortProcess, launchErr is nil and thus we call processEnded after cmd.Wait. I was focusing on the StartEgress implementation only. It is all very fragile though. This also makes me wonder why AbortEgress removes the Provess from activeHandlers.

It was introduced in https://github.com/livekit/egress/pull/769/changes
looking in the code there - cmd.wait would have been executed - removing from the list of active handlers would have been done in process finished as well - so not seeing obvious reason to for immediate removal. Maybe it was defensive. Even if we have some other call in a meanwhile the worst thing that could happen is failing IPC but we handle that gracefully as far as I can see. Maybe @frostbyte73 remembers?

Delayed removal has also one nice side effect - consider this scenario:

--- Goroutine 2 (spawned at end of launchProcess): cmd.Wait() ← blocking, waiting for handler process to exit Goroutine 1 (RPC handler, continuing StartCloudEgress): createErr := <-errChan ← IO create failed → enters case createErr != nil → AbortProcess(egressID, createErr) Inside AbortProcess: pm.mu.Lock() h.killReason = stats.ResultAborted ← reason stored on Process pm.mu.Unlock() ← lock released h.kill(err) ← tries gRPC KillEgress over Unix socket Two sub-scenarios from here: --- A) Handler is still alive: - KillEgress gRPC succeeds → handler shuts down → process exits - h.ipcHandlerClient.Close() closes the connection - Goroutine 2: cmd.Wait() returns → GetKillReason reads "aborted" from h.killReason → records metric → processEnded → ProcessFinished closes IPC (already closed, no-op), deletes from activeHandlers - Clean. B) Handler is already dead (crashed before abort): - h.kill() enters p.closed.Once() → tries KillEgress gRPC → socket is stale → blocks - h.ipcHandlerClient.Close() hasn't been reached yet (goroutine 1 stuck on line above) - Goroutine 2: cmd.Wait() returns immediately (process already dead) - GetKillReason → pm.mu.RLock() → succeeds (goroutine 1 doesn't hold the lock) → reads "aborted" → records metric - processEnded → ProcessFinished → pm.mu.Lock() → succeeds → finds handler in map → calls p.ipcHandlerClient.Close() → this closes the underlying connection → unblocks the stuck gRPC call in goroutine 1 - ProcessFinished deletes from activeHandlers, done - Goroutine 1: h.kill() returns (unblocked by the close) → h.ipcHandlerClient.Close() again → no-op, already closed - Both goroutines complete. No leak.

So scenario B fixes the goroutine leak as far as I can see

Basically with this change, the handler stays in activeHandlers, so ProcessFinished finds it, calls Close(), and unblocks the stuck kill(). I roped in claude to simulate stuck process just to make sure Close() unblocks IPC which it does:

package ipc import ( "context" "net" "os" "path" "testing" "time" "github.com/stretchr/testify/require" "google.golang.org/grpc" "google.golang.org/protobuf/types/known/emptypb" ) type blockingHandler struct { UnimplementedEgressHandlerServer blocked chan struct{} } func (h *blockingHandler) KillEgress(_ context.Context, _ *KillEgressRequest) (*emptypb.Empty, error) { <-h.blocked return &emptypb.Empty{}, nil } func TestCloseUnblocksStuckKillOnStaleSocket(t *testing.T) { tmpDir, err := os.MkdirTemp("/tmp", "ipc-test-") require.NoError(t, err) defer os.RemoveAll(tmpDir) // Start a server and create a client, then stop the server to simulate a dead handler srv := grpc.NewServer() RegisterEgressHandlerServer(srv, &blockingHandler{blocked: make(chan struct{})}) require.NoError(t, StartHandlerListener(srv, tmpDir)) client, err := NewHandlerClient(tmpDir) require.NoError(t, err) srv.GracefulStop() // Re-create the socket file without serving — simulates stale socket from dead process _, err = net.Listen("unix", path.Join(tmpDir, handlerAddress)) require.NoError(t, err) // KillEgress blocks on the stale socket killReturned := make(chan error, 1) go func() { _, err := client.KillEgress(context.Background(), &KillEgressRequest{Error: "test"}) killReturned <- err }() select { case <-killReturned: t.Fatal("KillEgress should be stuck on stale socket") case <-time.After(2 * time.Second): } // Close unblocks it require.NoError(t, client.Close()) select { case err := <-killReturned: require.Error(t, err) case <-time.After(5 * time.Second): t.Fatal("Close() did not unblock the stuck KillEgress call") } }

… for simpler end reson tracking

github-actions · 2026-05-21T22:56:21Z

AV-sync stats summary: view in run #26257363168

## Changelog ### Added - V2 egress API — unified source and output config (StartEgressRequest, MediaSource, Output, StorageConfig), per-participant audio channel routing (AudioRoute), and request-level storage overrides (#1155) - Support for egress auto retry (#1138) - Faster than realtime recording (non-live) (#1192) - Cgroup-aware memory monitoring for admission control and OOM kill (#1118) - Support for mulaw and alaw input codecs (#1105) - Add handler outcome Prometheus metric for SLO-based kill rate alerting (#1230) - Add livekit_load_ratio metric for composite load-based autoscaling (#1234) - Better multi-publisher support (#1214) - Read K8s CPU requests to automatically set GOMAXPROCS for cgroup-aware scheduling (#1204) - Backup storage observability — including storage event IPC for tracking uploads across primary/backup stores (#1120, #1184) - Enable one-shot sender report sync mode for room composite behind config (#1158) - Export InitLogger with configurable service name (#1218) - Allow specifying affinity timeout (#1104) - Instrument all leaky queues (#1116) - Instrument data loss on video leaky queues (#1109) - Add a 1G memory buffer for accepting requests (#1088) - Log rss periodically (#1151) - Log pipeline time to playing (#1103) - Log file upload stats (size, duration) (#1096) - Chrome 146 (#1173) ### Fixed - Fix s3-compat multi-part uploads (#1228) - Fix deadlock when AbortProcess/KillProcess call kill() under pm.mu lock (#1226) - Use statistical cadence check to tolerate WebRTC NetEQ time-stretching (#1223) - Fix MP3 duration metadata and CBR encoding (#1187) - Fix room composite duration to include post-participant silence tail (#1169) - Fix awaitMediaTracks race (#1165) - Fix for x264 encoder errors causing egress failures at the end of the recording (#1095) - Fix for data race inside monitor (#1091) - Fixing one of the causes of pipeline frozen errors (#1129) - Potential fix for rare issue causing tracks not to be recorded (#1130) - Handle EOS when removing source bin (#1093) - Drain appwriter (#1090) - Drain audio tracks before removing their appsource (#1085) - Heal track which enters into continuous flow flushing loop on pushing packets (#1142) - One pacer per audio track (#1235) - Set time provider only after pipeline reaches playing state (#1212) - AbortProcess is not safe to be executed by multiple goroutines (#1208) - Fixing unprotected state write from timer goroutine (#1206) - Disable PTS adjustments on sender reports for track egresses (#1134) - Disconnect from room on failing to await for some track (#1132) - Retry chrome egress navigation on chrome cert verifier change (#1194) - Return a 500 if handler fails to start (#1137) - Make sure all data is read from LK server (#1111) - Always check video dimensions (#1119) - Fail web egress on HTTP 4xx/5xx page load errors (#1106) - Avoid panic if we fail to parse a gst pipeline error (#1086) - Address unsafe int casting (#1126) - Get original room name from Start request (#1189) - Suppress noisy colorimetry warnings (#1163) - Don't log error on manifest upload failure if backup storage wasn't used (#1152) - Don't count room composite SDK source against Pulse limits (#1114) - Use a 10 min deadline for the RPC watchdog (#1207) - Remove enable_room_composite_sdk_source and always enable sdk source when conditions allow it (#1122) - Gstreamer logs based on configured level (#1161) - Reintroduce sdk logs filtering (#1213) - Using variant of OnDisconnected callback which passes a reason (#1221) - AV sync content verification in integration tests (#1215) - Fixing arm64 chrome installer (#1175) - Use separate token with delete permission (#1172) - Update module go.opentelemetry.io/otel to v1.41.0 [SECURITY] (#1201) - Update module github.com/aws/aws-sdk-go-v2/service/s3 to v1.97.3 [SECURITY] (#1178) - Update module github.com/go-jose/go-jose/v4 to v4.1.4 [SECURITY] (#1170) - Update module google.golang.org/grpc to v1.79.3 [SECURITY] (#1153) - Rename IOClient to SessionReporter (#1097) - Move ProcessManager to interface and create a fake implementation (#1147) - Enable staticcheck (#1094)

milos-lk requested a review from a team as a code owner May 20, 2026 21:24

github-actions Bot added the slack-ts:cs-devs:1779312269.587949 label May 20, 2026

biglittlebigben reviewed May 20, 2026

View reviewed changes

Delay removal from active map while process is being aborted - allows…

9e42bfd

… for simpler end reson tracking

biglittlebigben approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into handler-completed-metric

e47b2ac

milos-lk merged commit 33cc8c4 into main May 21, 2026
25 of 26 checks passed

milos-lk deleted the handler-completed-metric branch May 21, 2026 23:06

milos-lk mentioned this pull request May 28, 2026

Version 1.13.0 #1237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add handler outcome Prometheus metric for SLO-based kill rate alerting#1230

Add handler outcome Prometheus metric for SLO-based kill rate alerting#1230
milos-lk merged 3 commits into
mainfrom
handler-completed-metric

milos-lk commented May 20, 2026 •

edited

Loading

Uh oh!

biglittlebigben May 20, 2026

Uh oh!

milos-lk May 20, 2026

Uh oh!

biglittlebigben May 21, 2026

Uh oh!

milos-lk May 21, 2026

Uh oh!

biglittlebigben May 21, 2026

Uh oh!

milos-lk May 21, 2026 •

edited

Loading

Uh oh!

milos-lk May 21, 2026

Uh oh!

milos-lk May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

milos-lk commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

biglittlebigben May 20, 2026

Choose a reason for hiding this comment

Uh oh!

milos-lk May 20, 2026

Choose a reason for hiding this comment

Uh oh!

biglittlebigben May 21, 2026

Choose a reason for hiding this comment

Uh oh!

milos-lk May 21, 2026

Choose a reason for hiding this comment

Uh oh!

biglittlebigben May 21, 2026

Choose a reason for hiding this comment

Uh oh!

milos-lk May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milos-lk May 21, 2026

Choose a reason for hiding this comment

Uh oh!

milos-lk May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

milos-lk commented May 20, 2026 •

edited

Loading

milos-lk May 21, 2026 •

edited

Loading

milos-lk May 21, 2026 •

edited

Loading