New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish between fatal and non-fatal retryable errors #1384
Conversation
@@ -127,18 +130,17 @@ func (sv *SegmentVerifier) Verify(params *Params) (*Params, error) { | |||
// Verification passed successfully, so use this set of params | |||
return params, nil | |||
} | |||
sv.count++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about this choice - why shouldn't a fatal error count towards the number of retries? This would result in taking more retries than the policy calls for. Under pathological conditions (eg, perhaps there is a bug in how we count pixels), we could be retrying until we run out of orchestrators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yondonfu made this decision, perhaps he will be able to explain this better than I would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why shouldn't a fatal error count towards the number of retries?
I originally thought it made sense to only count non-fatal errors towards the # of retries since we were only appending results that trigger a non-fatal error to sv.results
. But, I see now that the # of retries do not need to match up with the length of sv.results
and that always incrementing the retry count regardless of the error associated with a result would better match the expected behavior given a policy with a max # of retries. My mistake @mk-livepeer !
On a related note, I also originally thought it made sense to only append results that trigger a non-fatal error to sv.results
because these results could still be valid (perhaps they were misclassified by the verifier). However, I now realize that results that trigger ErrPixelMismatch
could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results
is ErrAudioMismatch
because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.
So, I think the categorization of errors is actually:
- Fatal: The orchestrator definitely did something wrong
- Usable: The result is still usable
- Non-usable: The result is not usable
- Non-fatal: Unclear if the orchestrator did something wrong
Results that trigger a fatal non-usable error [1] should be excluded from sv.results
. Results that trigger a fatal usable error or a non-fatal error can be included in sv.reuslts
.
[1] Open to better naming here.
TLDR:
- We should keep a running count of the # of retries that is incremented regardless of the error triggered by a result
- We should exclude a result from
sv.results
if it triggers anErrAudioMismatch
error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restored the count field and previous handling such that fatal errors will count towards the retries count.
Your TLDR says ErrAudioMismatch
but the same holds for ErrPixelMismatch
, no? I believe the PR in it's current state is correct, if not please clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I restored the count field and previous handling such that fatal errors will count towards the retries count.
👍
Your TLDR says ErrAudioMismatch but the same holds for ErrPixelMismatch, no?
See this section of my previous comment:
However, I now realize that results that trigger ErrPixelMismatch could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results is ErrAudioMismatch because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.
I think the following update to the conditional evaluated prior to appending a result to sv.results
should work:
if err != ErrAudioMismatch && IsRetryable(err) {
...
sv.results = append(sv.results, r)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make that change, then we'll no longer be using the IsFatal()
function nor considering the Fatal
error type at all. Should I go ahead and revert it all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also had to change @j0sh 's test to use ErrAudioMismatch
instead of ErrPixelMismatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do still ultimately want to distinguish between fatal vs. non-fatal retryable errors because we'll want to re-add Os back to the working set for non-fatal retryable errors after finishing a segment - right now we just remove Os for all retryable errors. As already mentioned, we can handle re-adding Os back to the working set for non-fatal retryable errors outside of this PR though.
I do see that given the latest changes that the distinction between fatal vs. non-fatal retryable errors wouldn't actually be useful until we do the above. I'm ok with leaving that out of this PR then.
d44d67f
to
bb32d4c
Compare
Thanks for the review, @j0sh . Those new tests would be interesting, but they're not directly related to this PR, and I believe that they should be handled separately. This PR is for distinguishing between fatal and non-fatal retryable errors, and should stay local to that specific function. |
The test describes a bug with the current implementation. This is one of the reasons [1] we're introducing the notion of fatal errors right now, so I think it makes sense to check that the bug doesn't exist anymore. I also thought we had a test that would fail if this was ever fixed (so we could just adjust the test and maintain coverage), but maybe not. Here's a test that passes right now on master, but should fail with this PR: diff --git a/verification/verify_test.go b/verification/verify_test.go
index b506e713..c556287e 100644
--- a/verification/verify_test.go
+++ b/verification/verify_test.go
@@ -115,6 +115,11 @@ func TestVerify(t *testing.T) {
res, err = sv.Verify(&Params{ManifestID: "stu", Results: data})
assert.Equal(ErrPixelsAbsent, err)
assert.Equal("mno", string(res.ManifestID)) // Still return best result
+ // Bug: higher score but wrong pixels still returns results
+ verifier.results = &Results{Score: 20.0, Pixels: []int64{-1, -2}}
+ res, err = sv.Verify(&Params{ManifestID: "vws", Results: data})
+ assert.Equal(ErrPixelMismatch, err)
+ assert.Equal("vws", string(res.ManifestID))
// Check *not* retryable; should never get a result
sv = NewSegmentVerifier(&Policy{Verifier: verifier, Retries: 1}) // reset
[1] The other reason being that Os shouldn't be re-added to the working set for fatal errors. But first we need to actually re-add Os to the working set for non-fatal errors. That can be handled outside this PR. |
I think this feature #1327 was written to set the stage to be able to address what you're describing in yet another separate ticket. I would prefer to handle that in a separate PR, but if we must include it here, it's fine. In which case, do you think you can help me out a bit with the test cases, as I am not very familiar with the bug that you've described with the current implementation nor those four conditions that you mentioned earlier. If you can help out with that, it would be awesome... Otherwise, it might take me some time to fully grok your points :-) |
Don't think there should be anything additional to address the issue described in #1096 (comment) ; this PR (as-is, I think) should fix the bug.
The bug is that a result with incorrect pixel counts could be accepted by the B if the segment passes the verifier with the highest score of all the retried segments. For example, this could happen if an O tries to over-charge a B. But there's really nothing to do to fix this beyond what's already been done; the provided test case should suffice as a sanity check in the codebase. Note that as described, the test will fail if applied to this PR, but it'll pass on current master: the last equality check should be adjusted to "mno", and that'll be the correct behavior. |
like so 6b3414b ? |
BTW, thank you for explaining that @j0sh - it makes all the sense now :-) |
c28822c
to
589ca34
Compare
done. and squashed. |
@@ -127,18 +130,17 @@ func (sv *SegmentVerifier) Verify(params *Params) (*Params, error) { | |||
// Verification passed successfully, so use this set of params | |||
return params, nil | |||
} | |||
sv.count++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why shouldn't a fatal error count towards the number of retries?
I originally thought it made sense to only count non-fatal errors towards the # of retries since we were only appending results that trigger a non-fatal error to sv.results
. But, I see now that the # of retries do not need to match up with the length of sv.results
and that always incrementing the retry count regardless of the error associated with a result would better match the expected behavior given a policy with a max # of retries. My mistake @mk-livepeer !
On a related note, I also originally thought it made sense to only append results that trigger a non-fatal error to sv.results
because these results could still be valid (perhaps they were misclassified by the verifier). However, I now realize that results that trigger ErrPixelMismatch
could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results
is ErrAudioMismatch
because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.
So, I think the categorization of errors is actually:
- Fatal: The orchestrator definitely did something wrong
- Usable: The result is still usable
- Non-usable: The result is not usable
- Non-fatal: Unclear if the orchestrator did something wrong
Results that trigger a fatal non-usable error [1] should be excluded from sv.results
. Results that trigger a fatal usable error or a non-fatal error can be included in sv.reuslts
.
[1] Open to better naming here.
TLDR:
- We should keep a running count of the # of retries that is incremented regardless of the error triggered by a result
- We should exclude a result from
sv.results
if it triggers anErrAudioMismatch
error
589ca34
to
1363e75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to update the if statement for the following block of code:
// Check pixel counts
if (err == nil || IsRetryable(err)) && res != nil && params.Results != nil {
pxls := res.Pixels
if len(pxls) != len(params.Results.Segments) {
pxls, err = countPixelParams(params)
}
for i := 0; err == nil && i < len(params.Results.Segments) && i < len(pxls); i++ {
reportedPixels := params.Results.Segments[i].Pixels
verifiedPixels := pxls[i]
if reportedPixels != verifiedPixels {
err = ErrPixelMismatch
}
}
}
If the verifier returns ErrAudioMismatch
, then we could end up running the pixel checking code snippet above and err
could be set to ErrPixelMismatch
. As a result, even though the verifier detected tampered audio, the result would still be added to sv.results
(eligible for insertion into the playlist) despite the audio being tampered.
We should be able to fix this by updating the if statement to something like:
if (err == nil || (err != ErrAudioMismatch && IsRetryable(err))) && res != nil && params.Results != nil { ... }
Would be good to add a test for this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM let's squash & merge
// The caller should terminate processing for non-retryable errors | ||
if IsRetryable(err) { | ||
if !IsFatal(err) && IsRetryable(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're only defining ErrAudioMismatch
as a fatal error now which allows results that trigger ErrPixelMismatch
to be appended to sv.results
- ok, that works for now.
We eventually might need an error type that describes both ErrAudioMismatch
and ErrPixelMismatch
- this error type would indicate that the orchestrator definitely did something wrong (the broadcaster would not re-add these orchestrators to its working set - see this comment). But, we can address that outside of this PR when we actually handle re-adding orchestrators back to the working set.
Looks like the only remaining requested change can be addressed outside of this PR.
ad8926d
to
22175d7
Compare
This wraps
Retryable
errors with a newfatal
error type and adds a check for it just likeIsRetryable()
, calledIsFatal()
Fatal errors should not be included in the results from
verification.Verify()
Fatal errors should not trigger a
sv.count++
inverification.Verify()
Rather than using
sv.count
, we can rely on the length of the results slice.Fatal errors include:
ErrAudioMismatch
ErrPixelMismatch
Non-fatal:
ErrTampered
closes #1327