fairqueuing implementation with unit tests #84544

aaron-prindle · 2019-10-30T05:21:09Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
Implement package for fair queuing algorithm, which will be used in KEP priority and fariness.

Special notes for your reviewer:
This PR includes work from the feature/rate-limiting-branch from these PRs:
#80786, #81621, #81707, #81788

This PR only adds the fair queuing logic and required libraries. The wiring up the fair queuing into a full request manager is not in this PR.
Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md

MikeSpreitzer · 2019-10-30T16:58:32Z

/cc @MikeSpreitzer

staging/src/k8s.io/apimachinery/pkg/util/waitgroup/optionalwaitgroup.go

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/event_clock.go

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/dummy.go

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/event_clock.go

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/fairqueuing.go

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/fairqueuing_test.go

mars1024 · 2019-10-31T11:53:45Z

/cc @mars1024 @yue9944882

lavalamp

OK I made it through the rest of this file.

lavalamp · 2019-11-11T21:11:04Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

+func (qs *queueSet) enqueue(request *fq.Request) {
+	queue := request.Queue
+	queue.Enqueue(request)
+	qs.updateQueueVirtualStartTime(request, queue)


I would have expected this update to happen when we start executing the request, not when we enqueue it?

It looks like you are saying a pointer to the KEP is not good enough, we have to copy the reasoning from there into the code.

I expect this can be explained in one sentence and that sentence would be super enlightening. Worst case, the sentence is "there's no possible short explanation of this, see the KEP"; that would tell me that there's something major missing in my mental model of how this code works.

But I expect that a useful sentence like "The virtual clock on a queue starts when the first request is queued (and e.g. not when the request begins to execute) because ____." can be written.

lavalamp · 2019-11-11T21:22:51Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

+// https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md#dispatching
+func (qs *queueSet) updateQueueVirtualStartTime(packet *fq.Request, queue *fq.Queue) {
+	// When a request arrives to an empty queue with no requests executing:
+	// len(queue.Requests) == 1 as enqueue has just happened prior (vs  == 0)


Please make the comment explain why this is to be done? ("it's in the KEP" is not an explanation!)

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

lavalamp · 2019-11-11T21:27:21Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

+}
+
+// dequeue dequeues a request from the queueSet
+func (qs *queueSet) dequeue() (*fq.Request, bool) {


I notice that you did not mention this on all internal methods that (a) must be called with the lock held and (b) do not have "Locked" in their name. What is your opinion on how to mark locking constraints on methods?

The reason I didn't comment everywhere is simple: I reviewed in chunks and especially at first wasn't aware that there was a lock that some things needed to hold. It's best if the convention is universally applied.

lavalamp · 2019-11-11T21:33:12Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

+	if !ok {
+		return nil, false
+	}
+	qs.counter.Add(1)


Why shouldn't the thread reading from the channel do this?

If the answer is some variation of "locking" then why shouldn't this be done by where queue.RequestsExecuting and qs.numRequestsEnqueued are updated in dequeue?

By "this" you mean increment the counter? That is for the same reason that the counter is incremented before forking a goroutine: (a) this is the code responsible for making another goroutine active and (b) doing the increment later might be too late.

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

lavalamp · 2019-11-11T21:38:33Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

+	r.Queue.VirtualStart -= qs.estimatedServiceTime - S
+
+	// request has finished, remove from requests executing
+	r.Queue.RequestsExecuting--


can we decrement qs.counter here? (or defer such a decrement?)

No, there is no goroutine ending or going idle here.

lavalamp · 2019-11-11T21:46:27Z

I'm primarily bothered by two things, which may be the same thing:

Why not have a goroutine doing the enqueueings and dequeueings instead of tacking that work on at various places in existing goroutines?
Can we find more principled ways/places to increment/decrement the various counters? I'm not convinced everything is paired and I'm not sure how to convince myself. (That is, I'm like 80% sure it's proper--I didn't see anything obviously wrong--but I want to be 99.99%+ convinced.)

lavalamp · 2019-11-12T00:15:44Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset_test.go

+	}
+}
+
+// TestNoRestraint should fail because the dummy QueueSet exercises no control


What exactly should fail? Not the whole test, or we wouldn't check it in?

This was the first test function written, before we had a real implementation of QueueSet. Now that we do, this test is relatively uninteresting. To be fair, it does test the test.

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset_test.go

lavalamp · 2019-11-12T00:33:53Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset_test.go

+	exerciseQueueSetUniformScenario(t, qs, []uniformClient{
+		{1001001001, 5, 100, time.Second, time.Second},
+	}, time.Second*10, true, false, clk, counter)
+}


Do we test changing config anywhere in this file? That seems important, both adding and removing queues, and whatever else is needed.

I would feel a lot more confident about the locking / counting if we could make a "thrash" test, e.g. use a real clock and simulate a bunch of random requests.

Yes, a randomized tester with config changes would be a good add. It would not need to use a real clock.

I think some testing with a real clock adds assurance that there's no deadlocks hiding in the code.

MikeSpreitzer · 2019-11-12T15:34:55Z

Here is a suggestion of a way to cleanly package up the union-of-unblocks logic. Define an internal (i.e., must be accessed only while holding the QueueSet's lock) abstraction like the following.

type initiallyUnset struct {/* a GoRoutineCounter, a condition variable, a count of waiting goroutines , `isSet bool`, `value interface{}` */}

func (iu *initiallyUnset) set(value interface{}) {/* set the value, signal the CV, add count to GoRoutineCounter, zero the count (only to highlight symmetry) */}

func (iu *initiallyUnset) get() interface{} {/* if not set yet then increment count and decrement GoRoutineCounter then wait on the CV */}

BenTheElder · 2019-11-12T16:22:38Z

/uncc
Lavalamp can approve deps and there's a lot of other code still under review.

MikeSpreitzer · 2019-11-12T16:31:52Z

BTW, since we are not super close on this one yet, I suggest we not do force-push, so that it is easier for reviewers to find the recent deltas.

lavalamp · 2019-11-13T00:27:49Z

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/queueset/queueset.go

+			}
+		}
+		klog.V(5).Infof("request timed out after being enqueued\n")
+		metrics.AddReject(qs.config.Name, "time-out")


Why don't we need to decrement the counter here? Wouldn't it have been incremented prior to sending something down the dequeue channel?

MikeSpreitzer · 2019-11-13T09:09:28Z

I folded the latest changes from here into #85192 .

…pec and typo in queueset_test.go to fix all presubmit tests

k8s-ci-robot · 2019-11-13T17:04:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aaron-prindle
To complete the pull request process, please assign liggitt
You can assign the PR to them by writing /assign @liggitt in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2019-11-13T18:53:23Z

@aaron-prindle: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-bazel-test	`009a6bc`	link	`/test pull-kubernetes-bazel-test`
pull-kubernetes-e2e-gce-device-plugin-gpu	`009a6bc`	link	`/test pull-kubernetes-e2e-gce-device-plugin-gpu`
pull-kubernetes-e2e-gce	`009a6bc`	link	`/test pull-kubernetes-e2e-gce`
pull-kubernetes-verify	`009a6bc`	link	`/test pull-kubernetes-verify`
pull-kubernetes-e2e-kind	`009a6bc`	link	`/test pull-kubernetes-e2e-kind`
pull-kubernetes-kubemark-e2e-gce-big	`009a6bc`	link	`/test pull-kubernetes-kubemark-e2e-gce-big`
pull-kubernetes-e2e-gce-100-performance	`009a6bc`	link	`/test pull-kubernetes-e2e-gce-100-performance`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

MikeSpreitzer · 2019-11-15T01:29:29Z

This should be closed, #85192 has merged.

k8s-ci-robot requested review from jennybuckley, logicalhan and a team October 30, 2019 05:22

aaron-prindle force-pushed the fq-impl branch 3 times, most recently from 8b6d71e to 148e965 Compare October 30, 2019 09:14

k8s-ci-robot requested a review from MikeSpreitzer October 30, 2019 16:58

lavalamp reviewed Oct 30, 2019

View reviewed changes

staging/src/k8s.io/apimachinery/pkg/util/waitgroup/optionalwaitgroup.go Outdated Show resolved Hide resolved

lavalamp reviewed Oct 30, 2019

View reviewed changes

staging/src/k8s.io/apimachinery/pkg/util/waitgroup/optionalwaitgroup.go Outdated Show resolved Hide resolved

lavalamp reviewed Oct 30, 2019

View reviewed changes

staging/src/k8s.io/apiserver/pkg/util/flowcontrol/fairqueuing/event_clock.go Outdated Show resolved Hide resolved

MikeSpreitzer suggested changes Oct 31, 2019

View reviewed changes

k8s-ci-robot requested review from mars1024 and yue9944882 October 31, 2019 11:53

aaron-prindle force-pushed the fq-impl branch 5 times, most recently from 18a9a56 to e175f42 Compare October 31, 2019 17:46

lavalamp reviewed Nov 11, 2019

View reviewed changes

aaron-prindle force-pushed the fq-impl branch from 76e8b25 to 61d93ff Compare November 11, 2019 21:46

lavalamp reviewed Nov 12, 2019

View reviewed changes

aaron-prindle force-pushed the fq-impl branch 2 times, most recently from 57441fa to d1516b0 Compare November 12, 2019 05:25

aaron-prindle force-pushed the fq-impl branch 2 times, most recently from 4866471 to 24065cf Compare November 12, 2019 16:12

aaron-prindle force-pushed the fq-impl branch 2 times, most recently from 6619df1 to 8eaa6db Compare November 12, 2019 19:32

lavalamp reviewed Nov 13, 2019

View reviewed changes

k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Nov 13, 2019

aaron-prindle force-pushed the fq-impl branch 2 times, most recently from 03aa546 to 4bc670b Compare November 13, 2019 07:28

MikeSpreitzer mentioned this pull request Nov 13, 2019

Added fair queuing for server requests #85192

Merged

aaron-prindle added 4 commits November 13, 2019 09:01

fairqueuing implementation with unit tests

5e8fc6c

review changes

b425914

review changes - *Locked updates

222e33a

add flowcontrol/metrics to prometheus import whitelist. fix openapi s…

009a6bc

…pec and typo in queueset_test.go to fix all presubmit tests

aaron-prindle force-pushed the fq-impl branch from 4bc670b to 009a6bc Compare November 13, 2019 17:03

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 13, 2019

aaron-prindle closed this Nov 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fairqueuing implementation with unit tests #84544

fairqueuing implementation with unit tests #84544

aaron-prindle commented Oct 30, 2019 •

edited

MikeSpreitzer commented Oct 30, 2019

mars1024 commented Oct 31, 2019

lavalamp left a comment

lavalamp Nov 11, 2019

MikeSpreitzer Nov 12, 2019

lavalamp Nov 12, 2019

lavalamp Nov 11, 2019

lavalamp Nov 11, 2019

MikeSpreitzer Nov 12, 2019

aaron-prindle Nov 12, 2019

lavalamp Nov 12, 2019

lavalamp Nov 11, 2019

MikeSpreitzer Nov 12, 2019 •

edited

lavalamp Nov 11, 2019

MikeSpreitzer Nov 12, 2019

lavalamp commented Nov 11, 2019

lavalamp Nov 12, 2019

MikeSpreitzer Nov 12, 2019

lavalamp Nov 12, 2019

MikeSpreitzer Nov 12, 2019

lavalamp Nov 12, 2019

MikeSpreitzer commented Nov 12, 2019 •

edited

BenTheElder commented Nov 12, 2019

MikeSpreitzer commented Nov 12, 2019

lavalamp Nov 13, 2019

MikeSpreitzer commented Nov 13, 2019

k8s-ci-robot commented Nov 13, 2019

k8s-ci-robot commented Nov 13, 2019

MikeSpreitzer commented Nov 15, 2019

fairqueuing implementation with unit tests #84544

fairqueuing implementation with unit tests #84544

Conversation

aaron-prindle commented Oct 30, 2019 • edited

MikeSpreitzer commented Oct 30, 2019

mars1024 commented Oct 31, 2019

lavalamp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer Nov 12, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lavalamp commented Nov 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MikeSpreitzer commented Nov 12, 2019 • edited

BenTheElder commented Nov 12, 2019

MikeSpreitzer commented Nov 12, 2019

Choose a reason for hiding this comment

MikeSpreitzer commented Nov 13, 2019

k8s-ci-robot commented Nov 13, 2019

k8s-ci-robot commented Nov 13, 2019

MikeSpreitzer commented Nov 15, 2019

aaron-prindle commented Oct 30, 2019 •

edited

MikeSpreitzer Nov 12, 2019 •

edited

MikeSpreitzer commented Nov 12, 2019 •

edited