Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in reflector not recovering from "Too large resource version"… #92537

Merged

Conversation

wojtek-t
Copy link
Member

Ref #91073

Fix bug in reflector that couldn't recover from "Too large resource version" errors

/kind bug

@wojtek-t wojtek-t added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jun 26, 2020
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 26, 2020
@k8s-ci-robot k8s-ci-robot added area/apiserver kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 26, 2020
@@ -356,10 +356,7 @@ func TestTooLargeResourceVersionList(t *testing.T) {

result := &example.PodList{}
err = cacher.List(context.TODO(), "pods/ns", storage.ListOptions{ResourceVersion: listRV, Predicate: storage.Everything}, result)
if !errors.IsTimeout(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout error is no longer checked ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's checked as part of IsTooLargeResourceVersion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems a bit strange to me... timeout reason and TooLargeResourceVersion cause seem orthogonal to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@wojtek-t
Copy link
Member Author

/retest

@tedyu
Copy link
Contributor

tedyu commented Jun 26, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2020
Comment on lines 569 to 571
if !IsTimeout(err) {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little strange... why does a CauseTypeResourceVersionTooLarge always have to be a timeout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is how we create the "TooLargeResourceVersion" error.

Do you suggest dropping it and leaving it rely only on the cause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - I played a bit more with the PR and removed this part completely. The cause type being the official type seems to be enough for the purpose of this PR and makes it easier to cherrypick.

Comment on lines 572 to 578
if status := APIStatus(nil); errors.As(err, &status) && status.Status().Details != nil {
for _, cause := range status.Status().Details.Causes {
if cause.Type == metav1.CauseTypeResourceVersionTooLarge {
return true
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be simplified to this:

return HasStatusCause(metav1.CauseTypeResourceVersionTooLarge)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice - didn't know about this function

// NewTooLargeResourceVersionError returns a timeout error with the given retrySeconds for a request for
// a minimum resource version that is larger than the largest currently available resource version for a requested resource.
func NewTooLargeResourceVersionError(minimumResourceVersion, currentRevision uint64, retrySeconds int) error {
err := errors.NewTimeoutError(fmt.Sprintf("Too large resource version: %d, current: %d", minimumResourceVersion, currentRevision), retrySeconds)
err.ErrStatus.Details.Causes = []metav1.StatusCause{{Message: tooLargeResourceVersionCauseMsg}}
err.ErrStatus.Details.Causes = []metav1.StatusCause{{Type: metav1.CauseTypeResourceVersionTooLarge}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also continue setting the message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return err
}

// IsTooLargeResourceVersion returns true if the error is a TooLargeResourceVersion error.
func IsTooLargeResourceVersion(err error) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to backport this, I would probably leave this method here and delegate to apierrors.IsTooLargeResourceVersion(err) in old releases, rather than remove an exported method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 876 to 877
// CauseTypeResourceVersionTooLarge is used to report that resource version is coming
// from the future and request cannot be served.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"resource version is coming from the future" is confusing... consider rephrasing to something like

CauseTypeResourceVersionTooLarge is used to report that that requested resource version is newer than the data observed by the API server, so the request cannot be served.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - done

@@ -288,7 +288,7 @@ func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
}

list, paginatedResult, err = pager.List(context.Background(), options)
if isExpiredError(err) {
if isExpiredError(err) || apierrors.IsTooLargeResourceVersion(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the docs below, and the docs for setIsLastSyncResourceVersionExpired misleading, since we're setting that in cases other than for expired errors. Should we rename setIsLastSyncResourceVersionExpired to setIsLastSyncResourceVersionUnavailable or something that can make sense for both expired and toolarge errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense - done

@wojtek-t wojtek-t force-pushed the fix_reflector_not_making_progress branch from b876884 to f65f961 Compare June 30, 2020 06:16
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2020
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt - thanks for the review; PTAL

Comment on lines 876 to 877
// CauseTypeResourceVersionTooLarge is used to report that resource version is coming
// from the future and request cannot be served.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - done

Comment on lines 572 to 578
if status := APIStatus(nil); errors.As(err, &status) && status.Status().Details != nil {
for _, cause := range status.Status().Details.Causes {
if cause.Type == metav1.CauseTypeResourceVersionTooLarge {
return true
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice - didn't know about this function

// NewTooLargeResourceVersionError returns a timeout error with the given retrySeconds for a request for
// a minimum resource version that is larger than the largest currently available resource version for a requested resource.
func NewTooLargeResourceVersionError(minimumResourceVersion, currentRevision uint64, retrySeconds int) error {
err := errors.NewTimeoutError(fmt.Sprintf("Too large resource version: %d, current: %d", minimumResourceVersion, currentRevision), retrySeconds)
err.ErrStatus.Details.Causes = []metav1.StatusCause{{Message: tooLargeResourceVersionCauseMsg}}
err.ErrStatus.Details.Causes = []metav1.StatusCause{{Type: metav1.CauseTypeResourceVersionTooLarge}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return err
}

// IsTooLargeResourceVersion returns true if the error is a TooLargeResourceVersion error.
func IsTooLargeResourceVersion(err error) bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -288,7 +288,7 @@ func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
}

list, paginatedResult, err = pager.List(context.Background(), options)
if isExpiredError(err) {
if isExpiredError(err) || apierrors.IsTooLargeResourceVersion(err) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense - done

@@ -356,10 +356,7 @@ func TestTooLargeResourceVersionList(t *testing.T) {

result := &example.PodList{}
err = cacher.List(context.TODO(), "pods/ns", storage.ListOptions{ResourceVersion: listRV, Predicate: storage.Everything}, result)
if !errors.IsTimeout(err) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

Comment on lines 569 to 571
if !IsTimeout(err) {
return false
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is how we create the "TooLargeResourceVersion" error.

Do you suggest dropping it and leaving it rely only on the cause?

@wojtek-t wojtek-t force-pushed the fix_reflector_not_making_progress branch from f65f961 to 84cb34c Compare June 30, 2020 07:02
@wojtek-t wojtek-t force-pushed the fix_reflector_not_making_progress branch from 84cb34c to 3704174 Compare June 30, 2020 07:12
Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt - PTAL

Comment on lines 569 to 571
if !IsTimeout(err) {
return false
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - I played a bit more with the PR and removed this part completely. The cause type being the official type seems to be enough for the purpose of this PR and makes it easier to cherrypick.

@wojtek-t
Copy link
Member Author

/retest

@liggitt
Copy link
Member

liggitt commented Jun 30, 2020

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants