Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P&F: Update and cleanup mutating work estimator #105930

Merged
merged 3 commits into from
Nov 1, 2021

Conversation

wojtek-t
Copy link
Member

NONE

/kind cleanup
/priority important-soon
/sig api-machinery

/assign @MikeSpreitzer @tkashem

@k8s-ci-robot k8s-ci-robot added the release-note-none Denotes a PR that doesn't merit a release note. label Oct 27, 2021
@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 27, 2021
@wojtek-t
Copy link
Member Author

/retest

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 28, 2021
Copy link
Contributor

@tkashem tkashem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, left a few minor comments

// TODO: As described in the KEP, we should take into account that not all
// events are equal and try to estimate the cost of a single event based on
// some historical data about size of events.
var finalSeats uint
var additionalLatency time.Duration
maxSeats := uint(math.Ceil(float64(watchCount) / watchesPerSeat))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maxSeats and maximumSeats sound very similar, can we rename maxSeats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is just an earlier calculation of finalSeats.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// TODO: Make this unconditional after we tune the algorithm better.
// Technically, there is an overhead connected to processing an event after
// the request finishes even if there is a small number of watches.
// However, until we tune the estimation we want to stay on the safe side
// an avoid introducing additional latency for almost every single request.
if watchCount >= watchesPerSeat {
finalSeats = uint(math.Ceil(float64(watchCount) / watchesPerSeat))
additionalLatency = eventAdditionalDuration
Copy link
Contributor

@tkashem tkashem Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually preferred what you had before, this way we can avoid calling SeatsTimesDuration and DurationPerSeat when final seats do not exceed the maximum, or am i missing something here?

if watchCount >= watchesPerSeat {
   finalSeats = uint(math.Ceil(float64(watchCount) / watchesPerSeat))
   additionalLatency = eventAdditionalDuration

   if finalSeats > maximumSeats {
	originalFinalSeatSeconds := SeatsTimesDuration(float64(finalSeats), additionalLatency)
        
        finalSeats = maximumSeats
        additionalLatency = originalFinalSeatSeconds.DurationPerSeat(float64(finalSeats))
   }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. And this avoids the poorly-named "maxSeats".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

additionalLatencyExpected: 5 * time.Millisecond,
},
*/
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just the uncommented version, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - modulo one ase that was adjusted to test the capping - this one:
"request verb is create, watches registered, maximum is capped"

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly good. A few minor issues noted inline.


// SeatSeconds is a measure of work, in units of seat-seconds, using a fixed-point representation.
// `SeatSeconds(n)` represents `n/ssScale` seat-seconds.
// The constants `ssScale` and `ssScaleDigits` are private to the implementation here,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I forgot to remove the mention of "ssScaleDigits" from this comment when that constant was removed in the original drafting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -24,27 +24,36 @@ import (
apirequest "k8s.io/apiserver/pkg/endpoints/request"
)

const (
watchesPerSeat = 10.0
eventAdditionalDuration = 5 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "notification" better than "event", since we also have a different thing called "event" and "notification" is a pretty standard and more specific term for what we are talking about here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology that is used everywhere is "watch event", not "watch notification".
So we should be consistent with the terminology that is used across the project (even if we believe that notification would be a better one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I had forgotten about "watch event".

Comment on lines 81 to 84
// We simply estimate the total amount of final work for this request based
// on the above and rely on potential reshaping of the request if the
// concurrency limit for a given priority level will not allow to run
// request with that many seats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one - in my opinion it is true:

  • we estimate the total amount of final work based on the this assumption
  • we still rely on potential reshaping if concurrency limit is smaller then final seats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the preceding unchanged sentence is a bit off base. It says "we assume ... is infinitely parallelizable". Was that intended to say that the actual work is all launched at once? Or that we can freely reshape that work as much as we like?

I think a more accurate statement, starting with a revised version of the preceding thus-far-unchanged sentence, would be something like the following.

// As a starting point we assume that the actual work associated with the watchers happens
// in many waiting goroutines that are all resumed at once,
// each such goroutine taking 1/Nth of a seat for M milliseconds.
// We allow the accounting in APF of that work to be reshaped into
// another rectangle of equal area, for practical reasons.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly changed the version (as the multiple gorotuines part is not an assumption - it's a fact).

// TODO: As described in the KEP, we should take into account that not all
// events are equal and try to estimate the cost of a single event based on
// some historical data about size of events.
var finalSeats uint
var additionalLatency time.Duration
maxSeats := uint(math.Ceil(float64(watchCount) / watchesPerSeat))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is just an earlier calculation of finalSeats.


// TODO: Make this unconditional after we tune the algorithm better.
// Technically, there is an overhead connected to processing an event after
// the request finishes even if there is a small number of watches.
// However, until we tune the estimation we want to stay on the safe side
// an avoid introducing additional latency for almost every single request.
if watchCount >= watchesPerSeat {
finalSeats = uint(math.Ceil(float64(watchCount) / watchesPerSeat))
additionalLatency = eventAdditionalDuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. And this avoids the poorly-named "maxSeats".


// While processing individual events is highly parallelizable,
// our design/implementation have a couple limitations that would make
// it highly inefficient:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two problems with the wording here.

  1. "it" is supposed to refer to the most recent noun phrase. In the current text that is "a couple limitations", which is not what you mean.
  2. The limitations in the current APF design/implementation do not make the processing ("sending") of notifications inefficient. APF does not change how notifications are sent. The issue here is with how APF manages requests to account for that notification sending.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworder

//
// TODO: Confirm that the current cap of maximumSeats allow us to
// achieve the above.
if finalSeats > maximumSeats {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant maximumSeats was introduced to cap the width coming from the LIST estimator. Perhaps we do not necessarily want the same cap for the mutation-WATCH estimator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to tune it, but for now we don't have any better constant so I'm reusing the existing one.

// 2) we are not changing the finalWork estimate - just potentially
// making it much longer. This is fine as long as we will not
// be able to dispatch so many requests that will effectively
// completely overload kube-apiserver (and/or etcd).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there are other concerns as well. I recall network bandwidth being mentioned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the "overloaded" is both in terms of cpu and network bandwidth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also reworded it.

// 1) we reduce the amount of seat-seconds that are "wasted" during
// dispatching and executing initial phase of the request
// 2) we are not changing the finalWork estimate - just potentially
// making it much longer. This is fine as long as we will not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not "fine". This is abandoning the correspondence between the way the actual work is scheduled --- all the notification work is launched at once --- and how APF is reserving server capacity for that work. The hope here is that the newly relaxed relationship is good enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeSpreitzer @tkashem - thanks for the review; I applied the comments - PTAL

Comment on lines 81 to 84
// We simply estimate the total amount of final work for this request based
// on the above and rely on potential reshaping of the request if the
// concurrency limit for a given priority level will not allow to run
// request with that many seats.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one - in my opinion it is true:

  • we estimate the total amount of final work based on the this assumption
  • we still rely on potential reshaping if concurrency limit is smaller then final seats.

additionalLatencyExpected: 5 * time.Millisecond,
},
*/
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - modulo one ase that was adjusted to test the capping - this one:
"request verb is create, watches registered, maximum is capped"

@@ -24,27 +24,36 @@ import (
apirequest "k8s.io/apiserver/pkg/endpoints/request"
)

const (
watchesPerSeat = 10.0
eventAdditionalDuration = 5 * time.Millisecond
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminology that is used everywhere is "watch event", not "watch notification".
So we should be consistent with the terminology that is used across the project (even if we believe that notification would be a better one).

// TODO: As described in the KEP, we should take into account that not all
// events are equal and try to estimate the cost of a single event based on
// some historical data about size of events.
var finalSeats uint
var additionalLatency time.Duration
maxSeats := uint(math.Ceil(float64(watchCount) / watchesPerSeat))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// TODO: Make this unconditional after we tune the algorithm better.
// Technically, there is an overhead connected to processing an event after
// the request finishes even if there is a small number of watches.
// However, until we tune the estimation we want to stay on the safe side
// an avoid introducing additional latency for almost every single request.
if watchCount >= watchesPerSeat {
finalSeats = uint(math.Ceil(float64(watchCount) / watchesPerSeat))
additionalLatency = eventAdditionalDuration
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// 2) we are not changing the finalWork estimate - just potentially
// making it much longer. This is fine as long as we will not
// be able to dispatch so many requests that will effectively
// completely overload kube-apiserver (and/or etcd).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - the "overloaded" is both in terms of cpu and network bandwidth.

// 1) we reduce the amount of seat-seconds that are "wasted" during
// dispatching and executing initial phase of the request
// 2) we are not changing the finalWork estimate - just potentially
// making it much longer. This is fine as long as we will not
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded

// 2) we are not changing the finalWork estimate - just potentially
// making it much longer. This is fine as long as we will not
// be able to dispatch so many requests that will effectively
// completely overload kube-apiserver (and/or etcd).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also reworded it.

//
// TODO: Confirm that the current cap of maximumSeats allow us to
// achieve the above.
if finalSeats > maximumSeats {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to tune it, but for now we don't have any better constant so I'm reusing the existing one.


// SeatSeconds is a measure of work, in units of seat-seconds, using a fixed-point representation.
// `SeatSeconds(n)` represents `n/ssScale` seat-seconds.
// The constants `ssScale` and `ssScaleDigits` are private to the implementation here,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 81 to 84
// We simply estimate the total amount of final work for this request based
// on the above and rely on potential reshaping of the request if the
// concurrency limit for a given priority level will not allow to run
// request with that many seats.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, the preceding unchanged sentence is a bit off base. It says "we assume ... is infinitely parallelizable". Was that intended to say that the actual work is all launched at once? Or that we can freely reshape that work as much as we like?

I think a more accurate statement, starting with a revised version of the preceding thus-far-unchanged sentence, would be something like the following.

// As a starting point we assume that the actual work associated with the watchers happens
// in many waiting goroutines that are all resumed at once,
// each such goroutine taking 1/Nth of a seat for M milliseconds.
// We allow the accounting in APF of that work to be reshaped into
// another rectangle of equal area, for practical reasons.

Comment on lines 119 to 120
// 2) we are not changing the finalWork estimate - just potentially
// making it much longer. As long as the maximum seats setting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than saying "potentially making it much longer", where "it" grammatically refers to "the finalWork estimate" (which is work not time), I would say something like "reshaping it to be narrower and longer".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeSpreitzer - thank you Mike for the review (I felt a bit as on the English lesson :) )

PTAL

Comment on lines 119 to 120
// 2) we are not changing the finalWork estimate - just potentially
// making it much longer. As long as the maximum seats setting
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 81 to 84
// We simply estimate the total amount of final work for this request based
// on the above and rely on potential reshaping of the request if the
// concurrency limit for a given priority level will not allow to run
// request with that many seats.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly changed the version (as the multiple gorotuines part is not an assumption - it's a fact).

@wojtek-t
Copy link
Member Author

/retest

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, just a couple of minor comments inline.

additionalLatency = eventAdditionalDuration
finalWork := SeatsTimesDuration(float64(finalSeats), eventAdditionalDuration)

// While processing individual events is highly parallelizable,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same wording issue here. "Parallelizable" means can be parallel. But what we want to say here is that this processing is highly parallel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

finalWork := SeatsTimesDuration(float64(finalSeats), eventAdditionalDuration)

// While processing individual events is highly parallelizable,
// the design/implementation of P&F have a couple limitations that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/have/has/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +78 to +80
// we will work on tuning the algorithm later. Given that the actual work
// associated with processing watch events is happening in multiple
// goroutines (proportional to the number of watchers) that are all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MikeSpreitzer - PTAL

additionalLatency = eventAdditionalDuration
finalWork := SeatsTimesDuration(float64(finalSeats), eventAdditionalDuration)

// While processing individual events is highly parallelizable,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

finalWork := SeatsTimesDuration(float64(finalSeats), eventAdditionalDuration)

// While processing individual events is highly parallelizable,
// the design/implementation of P&F have a couple limitations that
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@MikeSpreitzer MikeSpreitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 1, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 7669498 into kubernetes:master Nov 1, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants