Skip to content

Add handler outcome Prometheus metric for SLO-based kill rate alerting#1230

Merged
milos-lk merged 3 commits into
mainfrom
handler-completed-metric
May 21, 2026
Merged

Add handler outcome Prometheus metric for SLO-based kill rate alerting#1230
milos-lk merged 3 commits into
mainfrom
handler-completed-metric

Conversation

@milos-lk
Copy link
Copy Markdown
Contributor

@milos-lk milos-lk commented May 20, 2026

We currently monitor egress handler kills via a logs that counts killed vs validated log lines. This is noisy at low traffic because small denominators cause the ratio to spike, and it mixes request-time events (validation) with process-exit events (kills), which skews the ratio for long-running egresses.

This PR adds a single prometheus counter, livekit_egress_handler_results_total{type, result}, that tracks terminal handler process outcomes. Every handler that runs records exactly one result when it exits: completed, process_error, aborted, killed_cpu, killed_oom, or killed_shutdown. The type label carries the egress request type for drill-down.

To correctly attribute kill reasons at process exit time, the ProcessManager now maintains a killReasons map. When the monitor triggers a kill (CPU or OOM), it stores the reason before sending the kill signal. When cmd.Wait() returns in the goroutine, it consumes the stored reason via GetKillReason and records the appropriate metric. The consume-on-read design prevents double-counting when ProcessFinished cleans up.

The KillProcess method now takes an explicit reason parameter instead of deriving the reason from error string matching, which keeps the metric labels type-safe and co-located with the constants they reference.

We currently monitor egress handler kills via a logs that counts killed vs  validated log lines. This is noisy at low traffic because small denominators cause the ratio to spike, and it mixes request-time events (validation) with process-exit events  (kills), which skews the ratio for long-running egresses.

This PR adds a single prometheus counter, livekit_egress_handler_results_total{type, result}, that tracks terminal handler process outcomes. Every handler that runs records exactly one result when it exits: completed, process_error, aborted, killed_cpu, killed_oom, or killed_shutdown. The type label carries the egress request type for drill-down.

To correctly attribute kill reasons at process exit time, the ProcessManager now maintains a killReasons map. When the monitor triggers a kill (CPU or OOM), it stores the reason before sending the kill signal. When cmd.Wait() returns in the goroutine, it consumes the stored reason via GetKillReason and records the appropriate metric. The  consume-on-read design prevents double-counting when ProcessFinished cleans up.

The KillProcess method now takes an explicit reason parameter instead of deriving the  reason from error string matching, which keeps the metric labels type-safe and co-located with the constants they reference.
Comment thread pkg/service/process.go Outdated
pm.mu.Lock()
h, ok := pm.activeHandlers[egressID]
if ok {
pm.killReasons[egressID] = stats.ResultAborted
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code path, it looks like failing to call GetKillReason will cause a leak? This feels a bit fragile. Is it possible to refactor the Abort/createErr flow in a way where the activeHanders entry get removed after we read the kill reason (at what point we should be able to stash killReason in Process). Alternatively is there a way to register a "UpdateKillReason" callback that sets the prometheus value (either sync or in a goroutine) and gets called at the point where we are currently setting the killReason entry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this code path, it looks like failing to call GetKillReason will cause a leak?

only if ProcessFinished isn't called but if that happens we are in a deeper kimchi and leaking active handlers map and a lot more of other things. But could probably be made simpler (without the need for a map) if we delay removal from active handlers after the reason is read - will update that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AbortProcess path doesn't call ProcessFinished. That's why we currently remove the Process entry from activeHandler in AbortProcess. That the path I think is dangerous.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm - not sure if this is the case - as far as i can see AbortProcess() is called from StartEgress only after handler is launched at which point it needs to return from cmd.Wait() - and invoke processEnded which calls processFinished - or I missed something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, today, if we call AbortProcess, launchErr is nil and thus we call processEnded after cmd.Wait. I was focusing on the StartEgress implementation only. It is all very fragile though. This also makes me wonder why AbortEgress removes the Provess from activeHandlers.

Copy link
Copy Markdown
Contributor Author

@milos-lk milos-lk May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was introduced in https://github.com/livekit/egress/pull/769/changes
looking in the code there - cmd.wait would have been executed - removing from the list of active handlers would have been done in process finished as well - so not seeing obvious reason to for immediate removal. Maybe it was defensive. Even if we have some other call in a meanwhile the worst thing that could happen is failing IPC but we handle that gracefully as far as I can see. Maybe @frostbyte73 remembers?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delayed removal has also one nice side effect - consider this scenario:

  ---                                                                                     
  Goroutine 2 (spawned at end of launchProcess):                                          
  cmd.Wait()  ← blocking, waiting for handler process to exit                             
                                                                                          
  Goroutine 1 (RPC handler, continuing StartCloudEgress):                                 
  createErr := <-errChan  ← IO create failed                                              
  → enters case createErr != nil            
  → AbortProcess(egressID, createErr)                                                     
                                                                                        
  Inside AbortProcess:                                                                    
  pm.mu.Lock()                                                                          
  h.killReason = stats.ResultAborted   ← reason stored on Process                         
  pm.mu.Unlock()                       ← lock released                                  
                                                                                          
  h.kill(err)                          ← tries gRPC KillEgress over Unix socket           
                                                                                          
  Two sub-scenarios from here:                                                            
                                                                                          
  ---
  A) Handler is still alive:                                                              
  - KillEgress gRPC succeeds → handler shuts down → process exits                       
  - h.ipcHandlerClient.Close() closes the connection             
  - Goroutine 2: cmd.Wait() returns → GetKillReason reads "aborted" from h.killReason →   
  records metric → processEnded → ProcessFinished closes IPC (already closed, no-op),     
  deletes from activeHandlers                                                             
  - Clean.                                                                                
                                                                                          
  B) Handler is already dead (crashed before abort):                                    
  - h.kill() enters p.closed.Once() → tries KillEgress gRPC → socket is stale → blocks    
  - h.ipcHandlerClient.Close() hasn't been reached yet (goroutine 1 stuck on line above)  
  - Goroutine 2: cmd.Wait() returns immediately (process already dead)                    
  - GetKillReason → pm.mu.RLock() → succeeds (goroutine 1 doesn't hold the lock) → reads  
  "aborted" → records metric                                                            
  - processEnded → ProcessFinished → pm.mu.Lock() → succeeds → finds handler in map →     
  calls p.ipcHandlerClient.Close() → this closes the underlying connection → unblocks the 
  stuck gRPC call in goroutine 1                                                          
  - ProcessFinished deletes from activeHandlers, done                                   
  - Goroutine 1: h.kill() returns (unblocked by the close) → h.ipcHandlerClient.Close()   
  again → no-op, already closed                                                         
  - Both goroutines complete. No leak. 

So scenario B fixes the goroutine leak as far as I can see

Copy link
Copy Markdown
Contributor Author

@milos-lk milos-lk May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically with this change, the handler stays in activeHandlers, so ProcessFinished finds it, calls Close(), and unblocks the stuck kill(). I roped in claude to simulate stuck process just to make sure Close() unblocks IPC which it does:

package ipc

import (
	"context"
	"net"
	"os"
	"path"
	"testing"
	"time"

	"github.com/stretchr/testify/require"
	"google.golang.org/grpc"
	"google.golang.org/protobuf/types/known/emptypb"
)

type blockingHandler struct {
	UnimplementedEgressHandlerServer
	blocked chan struct{}
}

func (h *blockingHandler) KillEgress(_ context.Context, _ *KillEgressRequest) (*emptypb.Empty, error) {
	<-h.blocked
	return &emptypb.Empty{}, nil
}

func TestCloseUnblocksStuckKillOnStaleSocket(t *testing.T) {
	tmpDir, err := os.MkdirTemp("/tmp", "ipc-test-")
	require.NoError(t, err)
	defer os.RemoveAll(tmpDir)

	// Start a server and create a client, then stop the server to simulate a dead handler
	srv := grpc.NewServer()
	RegisterEgressHandlerServer(srv, &blockingHandler{blocked: make(chan struct{})})
	require.NoError(t, StartHandlerListener(srv, tmpDir))

	client, err := NewHandlerClient(tmpDir)
	require.NoError(t, err)

	srv.GracefulStop()

	// Re-create the socket file without serving — simulates stale socket from dead process
	_, err = net.Listen("unix", path.Join(tmpDir, handlerAddress))
	require.NoError(t, err)

	// KillEgress blocks on the stale socket
	killReturned := make(chan error, 1)
	go func() {
		_, err := client.KillEgress(context.Background(), &KillEgressRequest{Error: "test"})
		killReturned <- err
	}()

	select {
	case <-killReturned:
		t.Fatal("KillEgress should be stuck on stale socket")
	case <-time.After(2 * time.Second):
	}

	// Close unblocks it
	require.NoError(t, client.Close())

	select {
	case err := <-killReturned:
		require.Error(t, err)
	case <-time.After(5 * time.Second):
		t.Fatal("Close() did not unblock the stuck KillEgress call")
	}
}

@github-actions
Copy link
Copy Markdown
Contributor

AV-sync stats summary: view in run #26257363168

@milos-lk milos-lk merged commit 33cc8c4 into main May 21, 2026
25 of 26 checks passed
@milos-lk milos-lk deleted the handler-completed-metric branch May 21, 2026 23:06
@milos-lk milos-lk mentioned this pull request May 28, 2026
milos-lk added a commit that referenced this pull request May 28, 2026
## Changelog

  ### Added
  - V2 egress API — unified source and output config (StartEgressRequest, MediaSource, Output,
  StorageConfig), per-participant audio channel routing (AudioRoute), and request-level storage
  overrides (#1155)
  - Support for egress auto retry (#1138)
  - Faster than realtime recording (non-live) (#1192)
  - Cgroup-aware memory monitoring for admission control and OOM kill (#1118)
  - Support for mulaw and alaw input codecs (#1105)
  - Add handler outcome Prometheus metric for SLO-based kill rate alerting (#1230)
  - Add livekit_load_ratio metric for composite load-based autoscaling (#1234)
  - Better multi-publisher support (#1214)
  - Read K8s CPU requests to automatically set GOMAXPROCS for cgroup-aware scheduling (#1204)
  - Backup storage observability — including storage event IPC for tracking uploads across
  primary/backup stores (#1120, #1184)
  - Enable one-shot sender report sync mode for room composite behind config (#1158)
  - Export InitLogger with configurable service name (#1218)
  - Allow specifying affinity timeout (#1104)
  - Instrument all leaky queues (#1116)
  - Instrument data loss on video leaky queues (#1109)
  - Add a 1G memory buffer for accepting requests (#1088)
  - Log rss periodically (#1151)
  - Log pipeline time to playing (#1103)
  - Log file upload stats (size, duration) (#1096)
  - Chrome 146 (#1173)

  ### Fixed
  - Fix s3-compat multi-part uploads (#1228)
  - Fix deadlock when AbortProcess/KillProcess call kill() under pm.mu lock (#1226)
  - Use statistical cadence check to tolerate WebRTC NetEQ time-stretching (#1223)
  - Fix MP3 duration metadata and CBR encoding (#1187)
  - Fix room composite duration to include post-participant silence tail (#1169)
  - Fix awaitMediaTracks race (#1165)
  - Fix for x264 encoder errors causing egress failures at the end of the recording (#1095)
  - Fix for data race inside monitor (#1091)
  - Fixing one of the causes of pipeline frozen errors (#1129)
  - Potential fix for rare issue causing tracks not to be recorded (#1130)
  - Handle EOS when removing source bin (#1093)
  - Drain appwriter (#1090)
  - Drain audio tracks before removing their appsource (#1085)
  - Heal track which enters into continuous flow flushing loop on pushing packets (#1142)
  - One pacer per audio track (#1235)
  - Set time provider only after pipeline reaches playing state (#1212)
  - AbortProcess is not safe to be executed by multiple goroutines (#1208)
  - Fixing unprotected state write from timer goroutine (#1206)
  - Disable PTS adjustments on sender reports for track egresses (#1134)
  - Disconnect from room on failing to await for some track (#1132)
  - Retry chrome egress navigation on chrome cert verifier change (#1194)
  - Return a 500 if handler fails to start (#1137)
  - Make sure all data is read from LK server (#1111)
  - Always check video dimensions (#1119)
  - Fail web egress on HTTP 4xx/5xx page load errors (#1106)
  - Avoid panic if we fail to parse a gst pipeline error (#1086)
  - Address unsafe int casting (#1126)
  - Get original room name from Start request (#1189)
  - Suppress noisy colorimetry warnings (#1163)
  - Don't log error on manifest upload failure if backup storage wasn't used (#1152)
  - Don't count room composite SDK source against Pulse limits (#1114)
  - Use a 10 min deadline for the RPC watchdog (#1207)
  - Remove enable_room_composite_sdk_source and always enable sdk source when conditions allow
  it (#1122)
  - Gstreamer logs based on configured level (#1161)
  - Reintroduce sdk logs filtering (#1213)
  - Using variant of OnDisconnected callback which passes a reason (#1221)
  - AV sync content verification in integration tests (#1215)
  - Fixing arm64 chrome installer (#1175)
  - Use separate token with delete permission (#1172)
  - Update module go.opentelemetry.io/otel to v1.41.0 [SECURITY] (#1201)
  - Update module github.com/aws/aws-sdk-go-v2/service/s3 to v1.97.3 [SECURITY] (#1178)
  - Update module github.com/go-jose/go-jose/v4 to v4.1.4 [SECURITY] (#1170)
  - Update module google.golang.org/grpc to v1.79.3 [SECURITY] (#1153)
  - Rename IOClient to SessionReporter (#1097)
  - Move ProcessManager to interface and create a fake implementation (#1147)
  - Enable staticcheck (#1094)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants