Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Orchestrator Swaps #2885

Merged
merged 3 commits into from Oct 10, 2023
Merged

Fix Orchestrator Swaps #2885

merged 3 commits into from Oct 10, 2023

Conversation

leszko
Copy link
Contributor

@leszko leszko commented Oct 9, 2023

Related to two Linear Tickets:
fix https://linear.app/livepeer/issue/VID-430/investigate-why-stream-had-highly-variable-transcode-times
fix https://linear.app/livepeer/issue/VID-429/understand-orchestrator-swap

Explanation for https://linear.app/livepeer/issue/VID-430/investigate-why-stream-had-highly-variable-transcode-times
The segment is very short 1s. And we calculate the in-memory latency score as RTT transcoding time / segment duration. If this value is greater than 1, then we swap the Orchestrators. For such short segments, the RTT is almost always higher than 1. The quick fix is to have a minimal segment duration. I've set it to 1.5s.

Explanation for https://linear.app/livepeer/issue/VID-429/understand-orchestrator-swap
It's hard to tell why the Os were swapped, but I believe that it's because there was a segment in flight for longer than 1.5s, which causes the Orchestrator Swap. This PR does not fix anything wrt to that, but it adds additional logs which will help to analyze further cases like that.

@leszko leszko requested a review from thomshutt October 9, 2023 13:51
@linear
Copy link

linear bot commented Oct 9, 2023

VID-430 Investigate why stream had highly variable transcode times

This cause the stream to have over 50 swaps in total.

A portion of the logs can be viewed here:

https://eu-metrics-monitoring.livepeer.live/grafana/explore?orgId=1&panes=%7B%22_cn%22:%7B%22datasource%22:%22P8E80F9AEF21F6940%22,%22queries%22:%5B%7B%22exemplar%22:true,%22expr%22:%22%7Bapp%3D~%5C%22prod-livepeer-broadcaster-.%2B%5C%22%7D%20%7C~%20%5C%22547b9d1d-649f-4e81-bf50-b0345fe214e9%5C%22%20%7C~%20%5C%22took%3D%5C%22%20%7C%20logfmt%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22P8E80F9AEF21F6940%22%7D%7D%5D,%22range%22:%7B%22from%22:%221695799550000%22,%22to%22:%221695800406000%22%7D%7D%7D&schemaVersion=1

The client is using 1s segments, which already makes the window for a successful transcode very tight but we're also seeing transcode times vary wildly between 1-4 seconds per segment

VID-429 Understand Orchestrator swap

https://eu-metrics-monitoring.livepeer.live/grafana/explore?orgId=1&panes=%7B%22_cn%22:%7B%22datasource%22:%22P8E80F9AEF21F6940%22,%22queries%22:%5B%7B%22exemplar%22:true,%22expr%22:%22%7Bapp%3D~%5C%22prod-livepeer-broadcaster-.%2B%5C%22%7D%20%7C~%20%5C%2273d7c164-4564-4eca-ae23-58b0a161286f%5C%22%20%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22P8E80F9AEF21F6940%22%7D%7D%5D,%22range%22:%7B%22from%22:%221695803744140%22,%22to%22:%221695803779198%22%7D%7D%7D&schemaVersion=1

It looks like we swapped at segment 1621, but I can't spot an obvious reason why - if we're missing logging that would let us debug this then let's add it.

@leszko leszko requested a review from mjh1 October 9, 2023 13:51
@thomshutt
Copy link
Contributor

@leszko we should already have the 1.5s floor from #2837, so it'd be good to understand why if that isn't working

@leszko
Copy link
Contributor Author

leszko commented Oct 9, 2023

@leszko we should already have the 1.5s floor from #2837, so it'd be good to understand why if that isn't working

This PR and #2837 are two separate things. I'll try to explain it here.

  • Define "WebSocket" Orchestrator Capability #2387 is about "how long you wait when there is segment in flight". So, let's say you've sent segment 1 to O, then you want to send segment 2, but O didn't return the renditions for segment 1 yet. So, there is 1 segment in flight. Should you send segment 2 to the same O or should you swap to a new one; Define "WebSocket" Orchestrator Capability #2387 changes that we allow at least 1.5s before we swap to another O
  • This PR is about in-memory latency score. So, let's say you've sent segment 1 to O, then O returned rendition for segment 1. So, there are no segments in flight. You want to send segment 2, should you send it to the same O or should you swap? In-memory latency score decides about it. And we calculate it as RTT / seg dur.

@codecov
Copy link

codecov bot commented Oct 9, 2023

Codecov Report

Merging #2885 (6454a6a) into master (45a03e7) will increase coverage by 0.04271%.
The diff coverage is 100.00000%.

Impacted file tree graph

@@                 Coverage Diff                 @@
##              master       #2885         +/-   ##
===================================================
+ Coverage   56.38155%   56.42426%   +0.04271%     
===================================================
  Files             89          89                 
  Lines          19384       19403         +19     
===================================================
+ Hits           10929       10948         +19     
  Misses          7849        7849                 
  Partials         606         606                 
Files Coverage Δ
server/broadcast.go 77.90414% <100.00000%> (+0.29097%) ⬆️
server/segment_rpc.go 78.36257% <100.00000%> (+0.09532%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 45a03e7...6454a6a. Read the comment docs.

Files Coverage Δ
server/broadcast.go 77.90414% <100.00000%> (+0.29097%) ⬆️
server/segment_rpc.go 78.36257% <100.00000%> (+0.09532%) ⬆️

@thomshutt
Copy link
Contributor

@leszko gotcha, thanks for the explanation!

@leszko leszko merged commit 32d5d45 into master Oct 10, 2023
18 checks passed
@leszko leszko deleted the rafal/fix-orch-swaps branch October 10, 2023 12:35
eliteprox pushed a commit to eliteprox/go-livepeer that referenced this pull request Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants