New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return dynamic RetryAfter header from APF #117547
Return dynamic RetryAfter header from APF #117547
Conversation
@MikeSpreitzer @tkashem @deads2k - FYI |
/triage accepted |
/cc @deads2k |
/cc @MikeSpreitzer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Fibonacci numbers?
Why not an exponential slow-down and linear speeed-up, as in TCP window sizes?
workEstimator: workEstimator, | ||
droppedRequests: utilflowcontrol.NewDroppedRequestsTracker(), | ||
} | ||
return http.HandlerFunc(priorityAndFairnessHandler.Handle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead we could just return priorityAndFairnessHandler
if its method were named ServeHTTP
rather than Handle
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True - but that wouldn't change much here..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make no change at all that matters to the rest of the code, just a simplification here.
} | ||
} | ||
|
||
// recomputeRetryAfter is checking if retryAfter shouldn't be adjusted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence and the others that refer to "retryAfter" are confusing to the reader at this point, who does not see a variable or field named "retryAfter". I suggest adding a comment on droppedRequestsStats.currentRetryAfterStep
explaining that retryAfter = retryAfterSteps[droppedRequestsStats.currentRetryAfterStep]
.
staging/src/k8s.io/apiserver/pkg/util/flowcontrol/dropped_requests_tracker.go
Show resolved
Hide resolved
history []unixStat | ||
|
||
currentRetryAfterStep int | ||
retryAfterUpdateUnix int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retryAfterUpdateUnix int64 | |
// retryAfterUpdateUnix is the time when currentRetryAfterStep was last updated, in seconds since Unix epoch. | |
retryAfterUpdateUnix int64 |
// requests. | ||
func (s *droppedRequestsStats) recomputeRetryAfter(unixTime int64) { | ||
retryAfter := retryAfterSteps[s.currentRetryAfterStep] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be written less redundantly.
First, early out if enough time since retryAfterUpdateUnix has not passed.
Otherwise compute the doppedRequests sum, then compare with the high and low threshholds.
// maxHistory represents what is the maximum history of dropped | ||
// requests stored per priority level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// maxHistory represents what is the maximum history of dropped | |
// requests stored per priority level. | |
// maxHistory is the age threshhold, in seconds, for discarding dropped requests history for a given priority level. |
const ( | ||
// maxHistory represents what is the maximum history of dropped | ||
// requests stored per priority level. | ||
maxHistory = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why keep stuff 29 seconds old when the logic never looks at anything older than 13 seconds?
// for the purpose of adjusting RetryAfter header for newly dropped | ||
// requests to avoid system overload. | ||
type droppedRequestsTracker struct { | ||
clock clock.Clock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need a whole Clock here, a PassiveClock or func() time.Time
will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
467e24e
to
60661ae
Compare
@MikeSpreitzer - I added tests, PTAL |
for _, h := range s.history { | ||
if unixTime-h.unixTime < retryAfter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this slice were traversed in the reverse order then there could be an early out as soon as the time threshold has been passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
staging/src/k8s.io/apiserver/pkg/util/flowcontrol/dropped_requests_tracker_test.go
Show resolved
Hide resolved
steps := []struct { | ||
secondsElapsed int | ||
droppedRequests int | ||
retryAfter int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retryAfter int64 | |
// Expected value after the _first_ call to RecordDroppedRequest | |
retryAfter int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
85b1762
to
3c49019
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments applied - PTAL
for _, h := range s.history { | ||
if unixTime-h.unixTime < retryAfter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
steps := []struct { | ||
secondsElapsed int | ||
droppedRequests int | ||
retryAfter int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
staging/src/k8s.io/apiserver/pkg/util/flowcontrol/dropped_requests_tracker_test.go
Show resolved
Hide resolved
return | ||
} | ||
|
||
if droppedRequests < retryAfter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the comment currently at lines 121--126, this condition needs && unixTime-s.retryAfterUpdateUnix >= retryAfter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consciously decided not to rate-limit decreases (they will kind-of self-rate-limit after bumps anyway).
Fixed the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means that both slow-down and speed-up happen at the same rate (amortized over adjacent time), the slow-downs are just more jumpy. The net result is that retryAfter goes up and goes down at one second per second (but the up changes are clumped).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - it depends on what exactly is happening. Note that the treshold for increasing retryAfter is much higher (3retryAfter) than for decreasing it (1retryAfter), so while going down can potentially be smoother, it generally will take more time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That factor of 3 is the threshold for making a change, but the change is a factor of 2. So, apart from the fact that the increases are batched, both the ramp up and the ramp down goes --- at a maximum --- at the same amortized rate: 1 second per second.
// However, given that we didn't report anything for the current second, | ||
// we recompute it based on statistics from the previous one. | ||
s.updateRetryAfterIfNeededLocked(unixTime - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even so, subtracting one here is confusing. Consider an example.
- Suppose a drop happens at T=5.1 seconds, and the previous drop was at 4.something. This code will now trigger.
- updateRetryAfterIfNeededLocked gets called with 4.
- If it makes an update, it sets
s.retryAfterUpdateUnix
to 4.
I think it makes perfect sense to call updateRetryAfterIfNeededLocked(5) in that case. That is saying to think about what happened up to T=5.0, which is what you mean. More precisely, the logic would be working with half-open intervals: [retryAfterUpdateUnix.0, retryAfterUpdateUnix.0+retryAfter).
// The following table represents the list over time of: | ||
// - seconds elapsed (as computed since the initial time) | ||
// - requests that will be recorded as dropped in a current second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the current design of the test cases to be harder to think about than I think is necessary. The usual design of a test case is that it says "do this then expect that" (for some this and that). The current design does not follow that pattern. I would find it easier to think about these test cases if the secondsElapsed and droppedRequests prescribed dropping at a steady rate (secondsElapsed/droppedRequests between drops, with no drop at the start and a drop at the end); that would make a coherent "do this" part and it would precede the "expect that" part. (Of course the actual test cases would need some editing to work with this definition.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that steady-rate is really needed (it will uneccessary complicate the test).
But I think the rest of this comment makes sense (i.e. redoing it to be more like "do this, expect that". Will try to fix that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the test to actually verify RetryAfter after dropping all requests. This makes it more intuitive and doesn't change much.
PTAL
3c49019
to
9bd60c7
Compare
IIUIC this will have an interesting interaction with the "internal requests" we have in client-go , since the timeout of a request accounts for the whole original request #117313, now the retry times will increase so most probable will "retry less" |
This is true, but if you're explicitly requesting a client-side timeout for the request, that's actually desired. I don't perceive it as a problem, but rather as a natural consequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeSpreitzer - PTAL
// The following table represents the list over time of: | ||
// - seconds elapsed (as computed since the initial time) | ||
// - requests that will be recorded as dropped in a current second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the test to actually verify RetryAfter after dropping all requests. This makes it more intuitive and doesn't change much.
PTAL
9bd60c7
to
23ac0fd
Compare
retryAfter := s.retryAfter.Load() | ||
|
||
droppedRequests := int64(0) | ||
if len(s.history) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditionality is not needed because len(s.history) - 1
would evaluate to the int
-1
the excluded case, right?
s.history = append(s.history, unixStat{unixTime: unixTime, requests: count}) | ||
|
||
startIndex := 0 | ||
for ; startIndex < len(s.history) && unixTime-s.history[startIndex].unixTime > maxRetryAfter; startIndex++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entries whose age equals or exceeds 2 * s.retryAfter
are also never going to be needed.
} | ||
fakeClock.Step(time.Duration(secondsToAdvance) * time.Second) | ||
|
||
// Record only first dropped request and recompute retryAfter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is out of date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This is an improvement. More can be done in follow-on.
LGTM label has been added. Git tree hash: 039423f5b5deaef2ebe2d83e9da0fd531d5a479e
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MikeSpreitzer, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ref kubernetes/enhancements#1040
/kind feature
/sig api-machinery
/priority important-longterm