Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between fatal and non-fatal retryable errors #1384

Merged
merged 1 commit into from Mar 2, 2020

Conversation

mk-livepeer
Copy link
Contributor

This wraps Retryable errors with a new fatal error type and adds a check for it just like IsRetryable(), called IsFatal()

Fatal errors should not be included in the results from verification.Verify()
Fatal errors should not trigger a sv.count++ in verification.Verify()
Rather than using sv.count, we can rely on the length of the results slice.

Fatal errors include:

  • ErrAudioMismatch
  • ErrPixelMismatch

Non-fatal:

  • ErrTampered

closes #1327

verification/verify.go Outdated Show resolved Hide resolved
@@ -127,18 +130,17 @@ func (sv *SegmentVerifier) Verify(params *Params) (*Params, error) {
// Verification passed successfully, so use this set of params
return params, nil
}
sv.count++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about this choice - why shouldn't a fatal error count towards the number of retries? This would result in taking more retries than the policy calls for. Under pathological conditions (eg, perhaps there is a bug in how we count pixels), we could be retrying until we run out of orchestrators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yondonfu made this decision, perhaps he will be able to explain this better than I would.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shouldn't a fatal error count towards the number of retries?

I originally thought it made sense to only count non-fatal errors towards the # of retries since we were only appending results that trigger a non-fatal error to sv.results. But, I see now that the # of retries do not need to match up with the length of sv.results and that always incrementing the retry count regardless of the error associated with a result would better match the expected behavior given a policy with a max # of retries. My mistake @mk-livepeer !

On a related note, I also originally thought it made sense to only append results that trigger a non-fatal error to sv.results because these results could still be valid (perhaps they were misclassified by the verifier). However, I now realize that results that trigger ErrPixelMismatch could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results is ErrAudioMismatch because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.

So, I think the categorization of errors is actually:

  • Fatal: The orchestrator definitely did something wrong
    • Usable: The result is still usable
    • Non-usable: The result is not usable
  • Non-fatal: Unclear if the orchestrator did something wrong

Results that trigger a fatal non-usable error [1] should be excluded from sv.results. Results that trigger a fatal usable error or a non-fatal error can be included in sv.reuslts.

[1] Open to better naming here.

TLDR:

  • We should keep a running count of the # of retries that is incremented regardless of the error triggered by a result
  • We should exclude a result from sv.results if it triggers an ErrAudioMismatch error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored the count field and previous handling such that fatal errors will count towards the retries count.

Your TLDR says ErrAudioMismatch but the same holds for ErrPixelMismatch, no? I believe the PR in it's current state is correct, if not please clarify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I restored the count field and previous handling such that fatal errors will count towards the retries count.

👍

Your TLDR says ErrAudioMismatch but the same holds for ErrPixelMismatch, no?

See this section of my previous comment:

However, I now realize that results that trigger ErrPixelMismatch could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results is ErrAudioMismatch because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.

I think the following update to the conditional evaluated prior to appending a result to sv.results should work:

if err != ErrAudioMismatch && IsRetryable(err) {
    ...
    sv.results = append(sv.results, r)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make that change, then we'll no longer be using the IsFatal() function nor considering the Fatal error type at all. Should I go ahead and revert it all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also had to change @j0sh 's test to use ErrAudioMismatch instead of ErrPixelMismatch

Copy link
Member

@yondonfu yondonfu Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do still ultimately want to distinguish between fatal vs. non-fatal retryable errors because we'll want to re-add Os back to the working set for non-fatal retryable errors after finishing a segment - right now we just remove Os for all retryable errors. As already mentioned, we can handle re-adding Os back to the working set for non-fatal retryable errors outside of this PR though.

I do see that given the latest changes that the distinction between fatal vs. non-fatal retryable errors wouldn't actually be useful until we do the above. I'm ok with leaving that out of this PR then.

@mkrufky
Copy link
Contributor

mkrufky commented Feb 14, 2020

Thanks for the review, @j0sh . Those new tests would be interesting, but they're not directly related to this PR, and I believe that they should be handled separately.

This PR is for distinguishing between fatal and non-fatal retryable errors, and should stay local to that specific function.

@j0sh
Copy link
Contributor

j0sh commented Feb 14, 2020

Those new tests would be interesting, but they're not directly related to this PR

The test describes a bug with the current implementation. This is one of the reasons [1] we're introducing the notion of fatal errors right now, so I think it makes sense to check that the bug doesn't exist anymore.

I also thought we had a test that would fail if this was ever fixed (so we could just adjust the test and maintain coverage), but maybe not. Here's a test that passes right now on master, but should fail with this PR:

diff --git a/verification/verify_test.go b/verification/verify_test.go
index b506e713..c556287e 100644
--- a/verification/verify_test.go
+++ b/verification/verify_test.go
@@ -115,6 +115,11 @@ func TestVerify(t *testing.T) {
        res, err = sv.Verify(&Params{ManifestID: "stu", Results: data})
        assert.Equal(ErrPixelsAbsent, err)
        assert.Equal("mno", string(res.ManifestID)) // Still return best result
+       // Bug: higher score but wrong pixels still returns results
+       verifier.results = &Results{Score: 20.0, Pixels: []int64{-1, -2}}
+       res, err = sv.Verify(&Params{ManifestID: "vws", Results: data})
+       assert.Equal(ErrPixelMismatch, err)
+       assert.Equal("vws", string(res.ManifestID))
 
        // Check *not* retryable; should never get a result
        sv = NewSegmentVerifier(&Policy{Verifier: verifier, Retries: 1}) // reset

[1] The other reason being that Os shouldn't be re-added to the working set for fatal errors. But first we need to actually re-add Os to the working set for non-fatal errors. That can be handled outside this PR.

@mk-livepeer
Copy link
Contributor Author

I think this feature #1327 was written to set the stage to be able to address what you're describing in yet another separate ticket.

I would prefer to handle that in a separate PR, but if we must include it here, it's fine.

In which case, do you think you can help me out a bit with the test cases, as I am not very familiar with the bug that you've described with the current implementation nor those four conditions that you mentioned earlier.

If you can help out with that, it would be awesome... Otherwise, it might take me some time to fully grok your points :-)

@j0sh
Copy link
Contributor

j0sh commented Feb 14, 2020

I think this feature #1327 was written to set the stage to be able to address what you're describing

Don't think there should be anything additional to address the issue described in #1096 (comment) ; this PR (as-is, I think) should fix the bug.

as I am not very familiar with the bug that you've described with the current implementation nor those four conditions that you mentioned earlier.

The bug is that a result with incorrect pixel counts could be accepted by the B if the segment passes the verifier with the highest score of all the retried segments. For example, this could happen if an O tries to over-charge a B.

But there's really nothing to do to fix this beyond what's already been done; the provided test case should suffice as a sanity check in the codebase. Note that as described, the test will fail if applied to this PR, but it'll pass on current master: the last equality check should be adjusted to "mno", and that'll be the correct behavior.

@mkrufky
Copy link
Contributor

mkrufky commented Feb 14, 2020

like so 6b3414b ?

@mkrufky
Copy link
Contributor

mkrufky commented Feb 14, 2020

BTW, thank you for explaining that @j0sh - it makes all the sense now :-)

@mkrufky
Copy link
Contributor

mkrufky commented Feb 14, 2020

done. and squashed.

verification/epic.go Show resolved Hide resolved
verification/verify.go Show resolved Hide resolved
@@ -127,18 +130,17 @@ func (sv *SegmentVerifier) Verify(params *Params) (*Params, error) {
// Verification passed successfully, so use this set of params
return params, nil
}
sv.count++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shouldn't a fatal error count towards the number of retries?

I originally thought it made sense to only count non-fatal errors towards the # of retries since we were only appending results that trigger a non-fatal error to sv.results. But, I see now that the # of retries do not need to match up with the length of sv.results and that always incrementing the retry count regardless of the error associated with a result would better match the expected behavior given a policy with a max # of retries. My mistake @mk-livepeer !

On a related note, I also originally thought it made sense to only append results that trigger a non-fatal error to sv.results because these results could still be valid (perhaps they were misclassified by the verifier). However, I now realize that results that trigger ErrPixelMismatch could also be valid - the # of reported pixels might be wrong, but the video/audio content could be fine. The only error that should make a result ineligible to be appended to sv.results is ErrAudioMismatch because if this error is triggered then we know with certainty that the audio was tampered so this result should not be eligible for insertion into a playlist.

So, I think the categorization of errors is actually:

  • Fatal: The orchestrator definitely did something wrong
    • Usable: The result is still usable
    • Non-usable: The result is not usable
  • Non-fatal: Unclear if the orchestrator did something wrong

Results that trigger a fatal non-usable error [1] should be excluded from sv.results. Results that trigger a fatal usable error or a non-fatal error can be included in sv.reuslts.

[1] Open to better naming here.

TLDR:

  • We should keep a running count of the # of retries that is incremented regardless of the error triggered by a result
  • We should exclude a result from sv.results if it triggers an ErrAudioMismatch error

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to update the if statement for the following block of code:

// Check pixel counts
if (err == nil || IsRetryable(err)) && res != nil && params.Results != nil {
	pxls := res.Pixels
	if len(pxls) != len(params.Results.Segments) {
		pxls, err = countPixelParams(params)
	}
	for i := 0; err == nil && i < len(params.Results.Segments) && i < len(pxls); i++ {
		reportedPixels := params.Results.Segments[i].Pixels
		verifiedPixels := pxls[i]
		if reportedPixels != verifiedPixels {
			err = ErrPixelMismatch
		}
	}
}

If the verifier returns ErrAudioMismatch, then we could end up running the pixel checking code snippet above and err could be set to ErrPixelMismatch. As a result, even though the verifier detected tampered audio, the result would still be added to sv.results (eligible for insertion into the playlist) despite the audio being tampered.

We should be able to fix this by updating the if statement to something like:

if (err == nil || (err != ErrAudioMismatch && IsRetryable(err))) && res != nil && params.Results != nil { ... }

Would be good to add a test for this as well.

verification/verify_test.go Show resolved Hide resolved
Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM let's squash & merge

// The caller should terminate processing for non-retryable errors
if IsRetryable(err) {
if !IsFatal(err) && IsRetryable(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're only defining ErrAudioMismatch as a fatal error now which allows results that trigger ErrPixelMismatch to be appended to sv.results - ok, that works for now.

We eventually might need an error type that describes both ErrAudioMismatch and ErrPixelMismatch - this error type would indicate that the orchestrator definitely did something wrong (the broadcaster would not re-add these orchestrators to its working set - see this comment). But, we can address that outside of this PR when we actually handle re-adding orchestrators back to the working set.

@yondonfu yondonfu dismissed kyriediculous’s stale review March 2, 2020 14:25

Looks like the only remaining requested change can be addressed outside of this PR.

@yondonfu yondonfu force-pushed the mk/fatal-retryable-errors#1327 branch from ad8926d to 22175d7 Compare March 2, 2020 14:27
@yondonfu yondonfu merged commit d3f536a into master Mar 2, 2020
@yondonfu yondonfu deleted the mk/fatal-retryable-errors#1327 branch March 2, 2020 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

verification: Distinguish between fatal vs. non-fatal retryable errors
5 participants