New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#50102 Task 3: Until, backed by retry watcher #67350

Open
wants to merge 3 commits into
base: master
from

Conversation

@tnozicka
Copy link
Contributor

tnozicka commented Aug 13, 2018

What this PR does / why we need it:
This is a split off from #50102 to go in smaller pieces.

Introduces Until based on RetryWatcher. It can survive closed watches if last read ResourceVersion is still present in etcd.

Fixes: #31345

Requires:

/hold

Release note:

watch.Until now works for long durations.

/priority important-soon
/kind bug
(bug after the main PR which is this split from)

@tnozicka tnozicka changed the title [WIP] - #50102 Task 3: Until backed by retry watcher [WIP] - #50102 Task 3: Until, backed by retry watcher Aug 13, 2018

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch from 7ff610e to 09d5b96 Aug 15, 2018

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Oct 29, 2018

ping, curious if any progress is being planned for this release cycle

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Oct 29, 2018

/skip
trying to clear the Submit Queue status

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch from 09d5b96 to 5da92f8 Jan 3, 2019

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch 3 times, most recently from 6dd7303 to 363ae00 Jan 3, 2019

@tnozicka tnozicka referenced this pull request Jan 4, 2019

Closed

Fix broken watches #50102

2 of 5 tasks complete

@wojtek-t wojtek-t self-assigned this Jan 4, 2019

@tnozicka

This comment has been minimized.

Copy link
Contributor Author

tnozicka commented Jan 4, 2019

/milestone v1.14

@tnozicka

This comment has been minimized.

Copy link
Contributor Author

tnozicka commented Feb 12, 2019

/retest

}

// Retry after delay
return false, time.Duration(status.Details.RetryAfterSeconds) * time.Second

This comment has been minimized.

@liggitt

liggitt Feb 12, 2019

Member

is there a reason to limit this to StatusGatewayTimeout, and not to use api/error#SuggestsClientDelay(err) to check this?

This comment has been minimized.

@tnozicka

tnozicka Feb 12, 2019

Author Contributor

thx, I wasn't aware we had a function for it, will fix it

This comment has been minimized.

@tnozicka

tnozicka Feb 18, 2019

Author Contributor

I've made it common for all the failures but didn't use SuggestsClientDelay in the end as it requires different type and creating another object embedding the type just to artificially get it din't feel right

This comment has been minimized.

@liggitt

liggitt Feb 19, 2019

Member

suggest extracting a helper method to check for delay directly on a status object in the api errors package like this:

// SuggestsClientDelay returns true if this error suggests a client delay as well as the
// suggested seconds to wait, or false if the error does not imply a wait. It does not
// address whether the error *should* be retried, since some errors (like a 3xx) may
// request delay without retry.
func SuggestsClientDelay(err error) (int, bool) {
	if t, ok := err.(APIStatus); ok {
		return StatusSuggestsClientDelay(t.Status())
	}
	return 0, false
}

func StatusSuggestsClientDelay(status metav1.Status) (int, bool) {
	if status.Details == nil {
		return 0, false
	}
	if status.Reason == metav1.StatusReasonServerTimeout {
		// this StatusReason explicitly requests the caller to delay the action
		return int(status.Details.RetryAfterSeconds), true
	}
	if status.Details.RetryAfterSeconds > 0 {
		// If the client requests that we retry after a certain number of seconds
		return int(status.Details.RetryAfterSeconds), true
	}
	return 0, false
}

we really don't want to treat status codes and retryafter delays inconsistently

This comment has been minimized.

@tnozicka

tnozicka Feb 20, 2019

Author Contributor

hm, given we'd just ignore the bool value, that whole function comes down to: give me whatever value is in RetryAfterSeconds.

If that's 0, retry immediately, wait the amount if not.

what Am I missing?

(I guess I can just extract the function anyways to move it forward, the value of that step is not obvious to me.)

This comment has been minimized.

@tnozicka

tnozicka Feb 20, 2019

Author Contributor

Also none of the usages of that function actually reads the int value (I wondered why we don't return time.Duration there) so it seems exclusively used to determine retry purposes given you can't infer it being set from the int value.

klog.V(4).Info("Stopping RetryWatcher.")
return
case <-time.After(retryAfter):
done, retryAfter = rw.doReceive()

This comment has been minimized.

@liggitt

liggitt Feb 12, 2019

Member

other loops that re-establish watches do so at a 1 second interval. how are we avoiding hot-looping here?

This comment has been minimized.

@tnozicka

tnozicka Feb 12, 2019

Author Contributor

are those (1s) mostly controller loops? my thinking is - watch getting closed is a normal occurrence and we should reestablish it as fast as we can, as this should behave the same way as a singe watch and reestablishing it shouldn't be observed nor cause artificial delays, at least for the first time - maybe we need to limit consequent failures or we are good with leaving this on client rate-limiter?

This comment has been minimized.

@tnozicka

tnozicka Feb 18, 2019

Author Contributor

delays are now added in the inner function

This comment has been minimized.

@liggitt

liggitt Feb 19, 2019

Member

I still see several places it does return false, 0

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 18, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tnozicka
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: wojtek-t

If they are not already assigned, you can assign the PR to them by writing /assign @wojtek-t in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch 2 times, most recently from fd95a4f to c974742 Feb 18, 2019

@anarchistHH1983

This comment has been minimized.

Copy link

anarchistHH1983 commented Feb 18, 2019

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch from c974742 to e11e6aa Feb 18, 2019

@tnozicka

This comment has been minimized.

Copy link
Contributor Author

tnozicka commented Feb 18, 2019

@liggitt thanks for your insights, comments addressed, ptal

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch 2 times, most recently from feb7c4a to 747313e Feb 18, 2019


case io.EOF:
// watch closed normally
return false, 0

This comment has been minimized.

@liggitt

liggitt Feb 19, 2019

Member

anywhere this does return false, 0, we're still setting ourselves up for a hotloop, right?

This comment has been minimized.

@tnozicka

tnozicka Feb 19, 2019

Author Contributor

I think on the normal path, WATCH should be restarted immediately.

What is a normal path could be the question. If you are unlucky to open a connection to apiserver that is just about to terminate, you likely want to start again by trying another watch as that wasn't error path you'd have to backoff on.

If your watch timeouts, you shouldn't sleep but try immediately.

I guess we could count consequent failures before getting to the processing phase but splitting failures into ones we want to retry immediately and ones we want to apply delay to seemed good enough for this use.

_ = rw.send(event)
return true, 0

default:

This comment has been minimized.

@liggitt

liggitt Feb 19, 2019

Member

a default switch case with a fallthrough to specific status codes reads really strangely... is there a clearer way to express this?

This comment has been minimized.

@tnozicka

tnozicka Feb 19, 2019

Author Contributor

moved it to a variable avoiding the fallthrough

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch from 747313e to 763cf3a Feb 19, 2019

tnozicka added some commits Jan 3, 2019

@tnozicka tnozicka force-pushed the tnozicka:retry-watcher branch from 763cf3a to c59bea2 Feb 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment