Distinguish between fatal and non-fatal retryable errors #1384

mk-livepeer · 2020-02-14T16:45:44Z

This wraps Retryable errors with a new fatal error type and adds a check for it just like IsRetryable(), called IsFatal()

Fatal errors should not be included in the results from verification.Verify()
Fatal errors should not trigger a sv.count++ in verification.Verify()
Rather than using sv.count, we can rely on the length of the results slice.

Fatal errors include:

ErrAudioMismatch
ErrPixelMismatch

Non-fatal:

ErrTampered

closes #1327

verification/verify.go

j0sh · 2020-02-14T17:28:42Z

verification/verify.go

@@ -127,18 +130,17 @@ func (sv *SegmentVerifier) Verify(params *Params) (*Params, error) {
 		// Verification passed successfully, so use this set of params
 		return params, nil
 	}
-	sv.count++


Curious about this choice - why shouldn't a fatal error count towards the number of retries? This would result in taking more retries than the policy calls for. Under pathological conditions (eg, perhaps there is a bug in how we count pixels), we could be retrying until we run out of orchestrators.

@yondonfu made this decision, perhaps he will be able to explain this better than I would.

why shouldn't a fatal error count towards the number of retries?

I originally thought it made sense to only count non-fatal errors towards the # of retries since we were only appending results that trigger a non-fatal error to sv.results. But, I see now that the # of retries do not need to match up with the length of sv.results and that always incrementing the retry count regardless of the error associated with a result would better match the expected behavior given a policy with a max # of retries. My mistake @mk-livepeer !

On a related note, I also originally thought it made sense to only append results that trigger a non-fatal error to sv.results because these results could still be valid (perhaps they were misclassified by the verifier). However, I now realize that results that trigger ErrPixelMismatch could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results is ErrAudioMismatch because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.

So, I think the categorization of errors is actually:

Fatal: The orchestrator definitely did something wrong

Usable: The result is still usable

Non-usable: The result is not usable

Non-fatal: Unclear if the orchestrator did something wrong

Results that trigger a fatal non-usable error [1] should be excluded from sv.results. Results that trigger a fatal usable error or a non-fatal error can be included in sv.reuslts.

[1] Open to better naming here.

TLDR:

We should keep a running count of the # of retries that is incremented regardless of the error triggered by a result

We should exclude a result from sv.results if it triggers an ErrAudioMismatch error

I restored the count field and previous handling such that fatal errors will count towards the retries count.

Your TLDR says ErrAudioMismatch but the same holds for ErrPixelMismatch, no? I believe the PR in it's current state is correct, if not please clarify.

I restored the count field and previous handling such that fatal errors will count towards the retries count.

👍

Your TLDR says ErrAudioMismatch but the same holds for ErrPixelMismatch, no?

See this section of my previous comment:

However, I now realize that results that trigger ErrPixelMismatch could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results is ErrAudioMismatch because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.

I think the following update to the conditional evaluated prior to appending a result to sv.results should work:

if err != ErrAudioMismatch && IsRetryable(err) { ... sv.results = append(sv.results, r) }

If we make that change, then we'll no longer be using the IsFatal() function nor considering the Fatal error type at all. Should I go ahead and revert it all?

Done. Also had to change @j0sh 's test to use ErrAudioMismatch instead of ErrPixelMismatch

We do still ultimately want to distinguish between fatal vs. non-fatal retryable errors because we'll want to re-add Os back to the working set for non-fatal retryable errors after finishing a segment - right now we just remove Os for all retryable errors. As already mentioned, we can handle re-adding Os back to the working set for non-fatal retryable errors outside of this PR though.

I do see that given the latest changes that the distinction between fatal vs. non-fatal retryable errors wouldn't actually be useful until we do the above. I'm ok with leaving that out of this PR then.

mkrufky · 2020-02-14T20:31:11Z

Thanks for the review, @j0sh . Those new tests would be interesting, but they're not directly related to this PR, and I believe that they should be handled separately.

This PR is for distinguishing between fatal and non-fatal retryable errors, and should stay local to that specific function.

j0sh · 2020-02-14T22:30:09Z

Those new tests would be interesting, but they're not directly related to this PR

The test describes a bug with the current implementation. This is one of the reasons [1] we're introducing the notion of fatal errors right now, so I think it makes sense to check that the bug doesn't exist anymore.

I also thought we had a test that would fail if this was ever fixed (so we could just adjust the test and maintain coverage), but maybe not. Here's a test that passes right now on master, but should fail with this PR:

diff --git a/verification/verify_test.go b/verification/verify_test.go
index b506e713..c556287e 100644
--- a/verification/verify_test.go
+++ b/verification/verify_test.go
@@ -115,6 +115,11 @@ func TestVerify(t *testing.T) {
        res, err = sv.Verify(&Params{ManifestID: "stu", Results: data})
        assert.Equal(ErrPixelsAbsent, err)
        assert.Equal("mno", string(res.ManifestID)) // Still return best result
+       // Bug: higher score but wrong pixels still returns results
+       verifier.results = &Results{Score: 20.0, Pixels: []int64{-1, -2}}
+       res, err = sv.Verify(&Params{ManifestID: "vws", Results: data})
+       assert.Equal(ErrPixelMismatch, err)
+       assert.Equal("vws", string(res.ManifestID))
 
        // Check *not* retryable; should never get a result
        sv = NewSegmentVerifier(&Policy{Verifier: verifier, Retries: 1}) // reset

[1] The other reason being that Os shouldn't be re-added to the working set for fatal errors. But first we need to actually re-add Os to the working set for non-fatal errors. That can be handled outside this PR.

mk-livepeer · 2020-02-14T22:48:15Z

I think this feature #1327 was written to set the stage to be able to address what you're describing in yet another separate ticket.

I would prefer to handle that in a separate PR, but if we must include it here, it's fine.

In which case, do you think you can help me out a bit with the test cases, as I am not very familiar with the bug that you've described with the current implementation nor those four conditions that you mentioned earlier.

If you can help out with that, it would be awesome... Otherwise, it might take me some time to fully grok your points :-)

j0sh · 2020-02-14T23:19:15Z

I think this feature #1327 was written to set the stage to be able to address what you're describing

Don't think there should be anything additional to address the issue described in #1096 (comment) ; this PR (as-is, I think) should fix the bug.

as I am not very familiar with the bug that you've described with the current implementation nor those four conditions that you mentioned earlier.

The bug is that a result with incorrect pixel counts could be accepted by the B if the segment passes the verifier with the highest score of all the retried segments. For example, this could happen if an O tries to over-charge a B.

But there's really nothing to do to fix this beyond what's already been done; the provided test case should suffice as a sanity check in the codebase. Note that as described, the test will fail if applied to this PR, but it'll pass on current master: the last equality check should be adjusted to "mno", and that'll be the correct behavior.

mkrufky · 2020-02-14T23:29:31Z

like so 6b3414b ?

mkrufky · 2020-02-14T23:33:26Z

BTW, thank you for explaining that @j0sh - it makes all the sense now :-)

mkrufky · 2020-02-14T23:54:42Z

done. and squashed.

verification/epic.go

verification/verify.go

yondonfu · 2020-02-20T19:55:10Z

verification/verify.go

@@ -127,18 +130,17 @@ func (sv *SegmentVerifier) Verify(params *Params) (*Params, error) {
 		// Verification passed successfully, so use this set of params
 		return params, nil
 	}
-	sv.count++


why shouldn't a fatal error count towards the number of retries?

I originally thought it made sense to only count non-fatal errors towards the # of retries since we were only appending results that trigger a non-fatal error to sv.results. But, I see now that the # of retries do not need to match up with the length of sv.results and that always incrementing the retry count regardless of the error associated with a result would better match the expected behavior given a policy with a max # of retries. My mistake @mk-livepeer !

On a related note, I also originally thought it made sense to only append results that trigger a non-fatal error to sv.results because these results could still be valid (perhaps they were misclassified by the verifier). However, I now realize that results that trigger ErrPixelMismatch could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results is ErrAudioMismatch because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.

So, I think the categorization of errors is actually:

Fatal: The orchestrator definitely did something wrong

Usable: The result is still usable

Non-usable: The result is not usable

Non-fatal: Unclear if the orchestrator did something wrong

Results that trigger a fatal non-usable error [1] should be excluded from sv.results. Results that trigger a fatal usable error or a non-fatal error can be included in sv.reuslts.

[1] Open to better naming here.

TLDR:

We should keep a running count of the # of retries that is incremented regardless of the error triggered by a result

We should exclude a result from sv.results if it triggers an ErrAudioMismatch error

yondonfu

I think we also need to update the if statement for the following block of code:

// Check pixel counts
if (err == nil || IsRetryable(err)) && res != nil && params.Results != nil {
	pxls := res.Pixels
	if len(pxls) != len(params.Results.Segments) {
		pxls, err = countPixelParams(params)
	}
	for i := 0; err == nil && i < len(params.Results.Segments) && i < len(pxls); i++ {
		reportedPixels := params.Results.Segments[i].Pixels
		verifiedPixels := pxls[i]
		if reportedPixels != verifiedPixels {
			err = ErrPixelMismatch
		}
	}
}

If the verifier returns ErrAudioMismatch, then we could end up running the pixel checking code snippet above and err could be set to ErrPixelMismatch. As a result, even though the verifier detected tampered audio, the result would still be added to sv.results (eligible for insertion into the playlist) despite the audio being tampered.

We should be able to fix this by updating the if statement to something like:

if (err == nil || (err != ErrAudioMismatch && IsRetryable(err))) && res != nil && params.Results != nil { ... }

Would be good to add a test for this as well.

verification/verify_test.go

yondonfu

LGTM let's squash & merge

yondonfu · 2020-02-24T21:18:08Z

verification/verify.go

 	// The caller should terminate processing for non-retryable errors
-	if IsRetryable(err) {
+	if !IsFatal(err) && IsRetryable(err) {


So we're only defining ErrAudioMismatch as a fatal error now which allows results that trigger ErrPixelMismatch to be appended to sv.results - ok, that works for now.

We eventually might need an error type that describes both ErrAudioMismatch and ErrPixelMismatch - this error type would indicate that the orchestrator definitely did something wrong (the broadcaster would not re-add these orchestrators to its working set - see this comment). But, we can address that outside of this PR when we actually handle re-adding orchestrators back to the working set.

Looks like the only remaining requested change can be addressed outside of this PR.

mk-livepeer requested a review from yondonfu February 14, 2020 16:45

j0sh reviewed Feb 14, 2020

View reviewed changes

verification/verify.go Outdated Show resolved Hide resolved

j0sh reviewed Feb 14, 2020

View reviewed changes

mk-livepeer force-pushed the mk/fatal-retryable-errors#1327 branch from d44d67f to bb32d4c Compare February 14, 2020 17:29

mk-livepeer force-pushed the mk/fatal-retryable-errors#1327 branch from c28822c to 589ca34 Compare February 14, 2020 23:54

kyriediculous previously requested changes Feb 15, 2020

View reviewed changes

verification/epic.go Show resolved Hide resolved

verification/verify.go Show resolved Hide resolved

yondonfu requested changes Feb 20, 2020

View reviewed changes

mk-livepeer force-pushed the mk/fatal-retryable-errors#1327 branch from 589ca34 to 1363e75 Compare February 21, 2020 16:03

mk-livepeer requested a review from yondonfu February 21, 2020 16:04

yondonfu requested changes Feb 21, 2020

View reviewed changes

verification/verify_test.go Show resolved Hide resolved

yondonfu approved these changes Feb 24, 2020

View reviewed changes

verification: Distinguish fatal vs non-fatal retryable errors

22175d7

yondonfu force-pushed the mk/fatal-retryable-errors#1327 branch from ad8926d to 22175d7 Compare March 2, 2020 14:27

yondonfu merged commit d3f536a into master Mar 2, 2020

yondonfu deleted the mk/fatal-retryable-errors#1327 branch March 2, 2020 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish between fatal and non-fatal retryable errors #1384

Distinguish between fatal and non-fatal retryable errors #1384

mk-livepeer commented Feb 14, 2020

j0sh Feb 14, 2020

mkrufky Feb 14, 2020

yondonfu Feb 20, 2020

mkrufky Feb 21, 2020

yondonfu Feb 21, 2020

mkrufky Feb 21, 2020

mkrufky Feb 21, 2020

yondonfu Feb 21, 2020 •

edited

mkrufky commented Feb 14, 2020

j0sh commented Feb 14, 2020

mk-livepeer commented Feb 14, 2020

j0sh commented Feb 14, 2020

mkrufky commented Feb 14, 2020

mkrufky commented Feb 14, 2020

mkrufky commented Feb 14, 2020

yondonfu Feb 20, 2020

yondonfu left a comment •

edited

yondonfu left a comment

yondonfu Feb 24, 2020

Distinguish between fatal and non-fatal retryable errors #1384

Distinguish between fatal and non-fatal retryable errors #1384

Conversation

mk-livepeer commented Feb 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yondonfu Feb 21, 2020 • edited

Choose a reason for hiding this comment

mkrufky commented Feb 14, 2020

j0sh commented Feb 14, 2020

mk-livepeer commented Feb 14, 2020

j0sh commented Feb 14, 2020

mkrufky commented Feb 14, 2020

mkrufky commented Feb 14, 2020

mkrufky commented Feb 14, 2020

Choose a reason for hiding this comment

yondonfu left a comment • edited

Choose a reason for hiding this comment

yondonfu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yondonfu Feb 21, 2020 •

edited

yondonfu left a comment •

edited