Extend watch queue with a timeout and size limit #2285

alexmavr · 2017-06-22T23:59:41Z

This PR extends the watch package with the following items:

A new Queue-like sink called the LimitQueue. This is near-identical to the implementation of a Queue from https://github.com/docker/go-events/blob/master/queue.go with the difference that the queue has an upper size limit. When that limit is reached, a channel is closed and it's up to the user to determine the desired behavior from there
A sink wrapper called the TimeoutSink which wraps another sink with a timeout. If the timeout is reached, the wrapped sink is closed and an error is returned to the writer.
A ChannelSinkGenerator interface which can be used to configure the sink-chain that watch.Queue creates for each watcher.
Support for contexts on watch.Queue
A new set of constructors that tie these concepts together, such as NewTimeoutLimitQueue, or the more genericNewQueueWithOpts. Edit: Feature replaced with a functional options constructor

To avoid disruption on existing users, the existing NewQueue constructor will return the previous configuration of sinks.

Unit test times:

ok  	github.com/docker/swarmkit/watch	1.005s
ok  	github.com/docker/swarmkit/watch/queue	0.093s

cc @aaronlehmann @nishanttotla

Signed-off-by: Alex Mavrogiannis alex.mavrogiannis@docker.com

aaronlehmann · 2017-06-23T01:36:16Z

The vendoring error is something weird that came up through colliding merges that both vendored different things. We fixed it earlier today on master, so if you rebase, it should resolve that problem.

I'm uncertain whether watch/queue and the new sinks belong in swarmkit, because they aren't even used by swarmkit. Possibly go-events is a better home? I'll do a review pass anyway though.

aaronlehmann · 2017-06-23T01:39:12Z

watch/queue/queue.go

+	"github.com/docker/go-events"
+)
+
+func ErrQueueFullCloseFailed(err error) error {


This seems to be unused. If it's meant for external users, it seems like a strange function to expose.

aaronlehmann · 2017-06-23T01:40:22Z

watch/queue/queue.go

+	return fmt.Errorf("The queue size reached its limit but couldn't be closed: %s", err)
+}
+
+var ErrQueueFull = fmt.Errorf("The queue size reached its limit and was closed")


How about queue closed due to size limit?

aaronlehmann · 2017-06-23T01:44:42Z

watch/sinks.go

+	events "github.com/docker/go-events"
+)
+
+var ErrSinkTimeout = fmt.Errorf("Timeout exceeded, tearing down sink")


Lowercase timeout in the message

aaronlehmann · 2017-06-23T01:46:59Z

watch/sinks.go

+	select {
+	case err := <-errChan:
+		return err
+	case <-time.After(s.timeout):


I'd suggest using time.NewTimer explicitly here instead of time.After, so that the timer can be stopped in the common case that the write succeeds before the timer fires. There's no functional difference, but it's a little more optimal from a resource management perspective.

aaronlehmann · 2017-06-23T01:50:30Z

watch/watch.go

+}
+
+// NewQueueWithOpts creates a new Queue using the full set of options.
+func NewQueueWithOpts(opts QueueOpts) *Queue {


To me this is begging for functional arguments: https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis

aaronlehmann · 2017-06-23T01:52:05Z

watch/watch.go

+// WatchWithCtx returns a channel where all items published to the queue will
+// be received. The channel will be closed when the provided context is
+// cancelled.
+func (q *Queue) WatchWithCtx(ctx context.Context) (eventq chan events.Event) {


grpc has a variant of Dial which takes a context that they call DialContext, so WatchContext?

stevvooe · 2017-06-23T02:01:45Z

I think we could have something like this in go-events.

We may want to back this with a buffered channel, depending on whether or not runtime does lazy allocation for buffered channel.

aaronlehmann · 2017-06-23T02:02:33Z

watch/watch.go

 		q.mu.Unlock()

 		if cancelFunc != nil {
 			cancelFunc()
 		}
 	}
+
+	outChan := make(chan events.Event)
+	go func() {


I'd rather avoid an extra goroutine for every watch. This will be a large number for swarmkit, and it's already quite hard to read stack dumps. Either it should be opt-in with a special variant of CallbackWatch, or we should just return the extra channels and let the caller implement this select if it wants to.

stevvooe · 2017-06-23T02:03:25Z

watch/queue/queue.go

+	return nil
+}
+
+// Full returns a channel that is closed when the queue becomes full for the


What happens when it is no longer full?

Typically, "full" and "empty" act pretty racy for concurrent queues. When would this be used?

If we want to act on full, might want a clamping function.

The channel is closed when a Write causes the queue to reach its limit. The queue can stop being full afterwards, and this channel is not meant as a mechanism for viewing the current full-ness state of the queue.

The main use case for this is to notify that least one Event has been dropped, and then it's up to the listener to determine if any action should be taken.

In the case of docker events, all API server implementations can receive a /events?since parameter to backfill past events, and the events stream is expected to be reliable in some versions of the CLI (1.13 to 17.03). Therefore, when a slow listener fills up its queue it's preferred to close their event stream entirely and have them re-establish it with an appropriate since parameter, rather than silently dropping events.

Couldn't this be handled with a callback on each dropped message? This seems very fragile.

alexmavr · 2017-06-23T18:40:00Z

Rebased off of master and added a commit which addresses all review comments until this point.

aaronlehmann · 2017-06-23T18:48:55Z

watch/sinks.go

+// debug log messages that may be confusing. It is possible that the queue
+// will try to write an event to its destination channel while the queue is
+// being removed from the broadcaster. Since the channel is closed before the
+// queue, there is a narrow window when this is possible. In some event-based


"some event-based systems"?

aaronlehmann · 2017-06-23T18:50:35Z

watch/sinks_test.go

+		for {
+			<-ch.C
+		}
+	}()


Looks like this test will leak a goroutine; is ch.C ever closed? If not, can you add another channel for this goroutine to select on so it will terminate when the test is finished?

aaronlehmann · 2017-06-23T18:51:09Z

watch/sinks_test.go

+	// Make sure that closing a sink closes the channel
+	var errClose error
+	go func() {
+		errClose = sink.Close()


Why is this in a separate goroutine?

aaronlehmann · 2017-06-23T18:51:34Z

watch/watch.go

+	cancelFuncs map[events.Sink]func()
+
+	// closeOutChan indicates whether the watchers' channels should be closed
+	// when a watcher queue reaches its limit or when


aaronlehmann · 2017-06-23T18:52:36Z

watch/watch.go

+	for _, option := range options {
+		err := option(q)
+		if err != nil {
+			logrus.Warnf("Failed to apply options to queue: %s", err)


I guess you don't want to change the signature of NewQueue, so how about a panic here instead?

aaronlehmann · 2017-06-23T18:54:36Z

watch/watch.go

+	}
+
+	if q.closeOutChan && q.limit == 0 {
+		logrus.Warnf("Unable to create queue with zero size limit and closeOutChan")


Hmm, I know I suggested this, but looking at the code I'm not sure there's any reason it can't be supported. It would just inhibit the optimization of returning ch.C (which is fine). Am I missing anything?

stevvooe · 2017-06-23T18:56:01Z

@alexmavr Any plans to PR this to go-events? This really belongs there as a primitive.

alexmavr · 2017-06-23T18:58:53Z

@stevvooe go-events is definitely the right location for this package or a subset of it. We were hoping to discuss such a migration as followup step in order to facilitate dependent bugfix work such as docker-archive/classicswarm#2747

aaronlehmann · 2017-06-23T19:03:43Z

watch/watch.go

+
+// NewTimeoutLimitQueue creates a queue with a size limit and a request write
+// timeout.
+func NewTimeoutLimitQueue(timeout time.Duration, limit uint64) *Queue {


Let's instead define functions like

func WithTimeout(timeout time.Duration) func(*Queue) error func WithLimit(limit uint64) func(*Queue) error func WithCloseOutChan(closeOutChan bool) func(*Queue) error

Then you can create a queue with something like:

NewQueue(WithTimeout(30*time.Second), WithCloseOutChan(true))

It's a bit more flexible that way.

GordonTheTurtle · 2017-06-23T19:55:50Z

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "watch-extensions" git@github.com:alexmavr/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842353884240
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

codecov · 2017-06-23T19:55:53Z

Codecov Report

Merging #2285 into master will increase coverage by 0.08%.
The diff coverage is 92.5%.

@@            Coverage Diff             @@
##           master    #2285      +/-   ##
==========================================
+ Coverage   60.96%   61.04%   +0.08%     
==========================================
  Files         126      128       +2     
  Lines       20391    20531     +140     
==========================================
+ Hits        12431    12534     +103     
- Misses       6589     6632      +43     
+ Partials     1371     1365       -6

stevvooe · 2017-06-23T20:07:24Z

@stevvooe go-events is definitely the right location for this package or a subset of it. We were hoping to discuss such a migration as followup step in order to facilitate dependent bugfix work such as docker-archive/classicswarm#2747

So, the bounded queue model makes a lot of sense in go-events and I think we can do it with less code (maybe). I understand the time pressure, but I think if we are making a commitment to quality, we should try to land this in the right place, rather than defer to later. Once it is tested and in the code base, there will be little incentive to move it over.

aaronlehmann · 2017-06-23T20:16:14Z

watch/sinks_test.go

-	go func() {
-		errClose = sink.Close()
-	}()
+	errClose = sink.Close()


minor: errClose := sink.Close() instead of predeclaring

aaronlehmann · 2017-06-23T20:16:35Z

watch/sinks_test.go

 	<-ch.Done()
 	require.NoError(errClose)
+
+	// Close the leaking goroutine
+	close(doneChan)


Minor: defer close(doneChan) right after the channel is created.

stevvooe · 2017-06-23T21:17:39Z

watch/queue/queue.go

+// If a size of 0 is provided, the LimitQueue is considered limitless.
+type LimitQueue struct {
+	dst        events.Sink
+	events     *list.List


I'm not sure how much time you have to make this work, but if you can instrument the entrance and exit of a regular queue, you can avoid having to replicate all of this logic.

stevvooe · 2017-06-23T21:19:35Z

watch/queue/queue.go

+		if !eq.fullClosed {
+			eq.fullClosed = true
+			close(eq.full)
+		}


I think we can do a callback here. That would avoid spilling the internal channel manipulation outside of the internals.

Make sure to release the locks, then return ErrQueueFull. That will allow writer and out of band notification. It also doesn't poison the queue.

I agree that a callback would be a better pattern here. Let's do it this way for the go-events followup

stevvooe · 2017-06-23T21:20:29Z

watch/queue/queue.go

+		}
+
+		if err := eq.dst.Write(event); err != nil {
+			// TODO(aaronl): Dropping events could be bad depending


@aaronlehmann Did we not remove this error message?

It's suppressed by a wrapper sink in swarmkit. Discussion here: docker/go-events#11

aaronlehmann · 2017-06-23T21:27:36Z

watch/queue/queue_test.go

@@ -13,10 +14,13 @@ type mockSink struct {
 	closed   bool
 	holdChan chan struct{}
 	data     []events.Event
+	mutex    *sync.Mutex


just do mutex sync.Mutex; then it doesn't need to be initialized.

aaronlehmann · 2017-06-23T21:30:02Z

watch/queue/queue_test.go

+}
+
+func TestLimitQueueNoLimit(t *testing.T) {
+	require := require.New(t)


That's cool; I didn't know about this feature.

aaronlehmann · 2017-06-23T22:19:23Z

watch/watch_test.go

+	go func() {
+		closed := false
+		for range events {
+			// After receiving the first event, block indefinitely


Comment is outdated

aaronlehmann · 2017-06-23T22:24:47Z

watch/watch_test.go

+
+	doneChan = make(chan struct{})
+	go func() {
+		for !eventsClosed {


I think the race detector will consider this a data race. If we make it a channel that gets closed, it avoids the problem. I'd kind of like to unify this goroutine with the select below anyway. How about something like this:

timeoutTimer := time.NewTimer(time.Minute) defer timeoutTimer.Stop() selectLoop: for { select { case <-eventsClosed: break selectLoop case <-time.After(writerSleepDuration): q.Publish("new event") case <-timeoutTimer.C: require.Fail("Timeout exceeded") } }

That's a great way to restructure this, thanks for the insight.

Didn't look into details, but sync.Once may help in certain areas...

aaronlehmann · 2017-06-23T23:21:54Z

watch/queue/queue_test.go

@@ -42,6 +48,10 @@ func (s *mockSink) Len() int {
 	return len(s.data)
 }

+func (s *mockSink) String() string {


Do you have to take the lock here to make the race detector happy?

Signed-off-by: Alex Mavrogiannis <alex.mavrogiannis@docker.com>

alexmavr · 2017-06-23T23:26:52Z

Addressed final comments and squashed

stevvooe · 2017-06-23T23:27:48Z

LGTM

aaronlehmann · 2017-06-23T23:36:47Z

LGTM

- moby/swarmkit#2266 (support for templating Node.Hostname in docker executor) - moby/swarmkit#2281 (change restore action on objects to be update, not delete/create) - moby/swarmkit#2285 (extend watch queue with timeout and size limit) - moby/swarmkit#2253 (version-aware failure tracking in the scheduler) - moby/swarmkit#2275 (update containerd and port executor to container client library) - moby/swarmkit#2292 (rename some generic resources) - moby/swarmkit#2300 (limit the size of the external CA response) - moby/swarmkit#2301 (delete global tasks when the node running them is deleted) Minor cleanups, dependency bumps, and vendoring: - moby/swarmkit#2271 - moby/swarmkit#2279 - moby/swarmkit#2283 - moby/swarmkit#2282 - moby/swarmkit#2274 - moby/swarmkit#2296 (dependency bump of etcd, go-winio) Signed-off-by: Ying Li <ying.li@docker.com> Upstream-commit: 4509a00 Component: engine

alexmavr force-pushed the watch-extensions branch 3 times, most recently from 813ed1a to 14f3120 Compare June 23, 2017 00:05

aaronlehmann reviewed Jun 23, 2017

View reviewed changes

stevvooe reviewed Jun 23, 2017

View reviewed changes

alexmavr force-pushed the watch-extensions branch from 14f3120 to 1bba22a Compare June 23, 2017 18:36

aaronlehmann reviewed Jun 23, 2017

View reviewed changes

GordonTheTurtle added the dco/no label Jun 23, 2017

alexmavr force-pushed the watch-extensions branch from f043ff7 to 9ce0554 Compare June 23, 2017 19:57

GordonTheTurtle removed the dco/no label Jun 23, 2017

aaronlehmann reviewed Jun 23, 2017

View reviewed changes

aaronlehmann mentioned this pull request Jun 23, 2017

Moving event handling for external subscribers to use go-events docker-archive/classicswarm#2747

Merged

stevvooe reviewed Jun 23, 2017

View reviewed changes

aaronlehmann reviewed Jun 23, 2017

View reviewed changes

Extend watch queue with timeout and size limit

2d31d73

Signed-off-by: Alex Mavrogiannis <alex.mavrogiannis@docker.com>

alexmavr force-pushed the watch-extensions branch from 8dbfddb to 2d31d73 Compare June 23, 2017 23:26

aaronlehmann merged commit 97727e1 into moby:master Jun 23, 2017

cyli mentioned this pull request Jul 11, 2017

Re-vendor swarmkit. moby/moby#34061

Merged

Extend watch queue with a timeout and size limit #2285

Extend watch queue with a timeout and size limit #2285

Conversation

alexmavr commented Jun 22, 2017 • edited

aaronlehmann commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevvooe commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexmavr commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevvooe commented Jun 23, 2017

alexmavr commented Jun 23, 2017 • edited

aaronlehmann Jun 23, 2017 • edited

Choose a reason for hiding this comment

GordonTheTurtle commented Jun 23, 2017

codecov bot commented Jun 23, 2017 • edited

Codecov Report

stevvooe commented Jun 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexmavr commented Jun 23, 2017

stevvooe commented Jun 23, 2017

aaronlehmann commented Jun 23, 2017

alexmavr commented Jun 22, 2017 •

edited

alexmavr commented Jun 23, 2017 •

edited

aaronlehmann Jun 23, 2017 •

edited

codecov bot commented Jun 23, 2017 •

edited