Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to subscriber pattern #828

Merged

Conversation

maleck13
Copy link
Contributor

@maleck13 maleck13 commented Mar 9, 2018

Describe what this PR does and why we need it:
Updates to how we handle the channels and subscribers based on the proposal outlined as part of
#638

@jmrodri @eriknelson initial WIP work here for early feedback. Not complete yet but also not miles away. Would like any feedback you have to ensure we are still on the same page and I am not heading down any rabbit holes 😄

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 9, 2018
Copy link
Contributor

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to mark this as request changes because this is only a first pass review. I need to look at the engine a little closer, I'm not understanding something that I want to look at a bit closer. Overall it's a nice job (no pun intended).

@@ -62,8 +62,6 @@ var (
)

const (
// MsgBufferSize - The buffer for the message channel.
MsgBufferSize = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're not using a buffered channel anymore? I guess if each one has its own channel it makes sense to be a blocking one item channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes also we are reading from that channel and handing of to the subscribers async, so we are always ready to read the next message. If we wanted to slow throughput we could very easily use a sync.WaitGroup in combination with a buffered channel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason to artificially slow the processing of msgs.

@@ -213,43 +211,43 @@ func CreateApp() App {
validateRegistryNames(app.registry)

log.Debug("Initializing WorkEngine")
app.engine = broker.NewWorkEngine(MsgBufferSize)
stateSubscriber := broker.NewJobStateSubscriber(app.dao)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -1059,13 +1059,6 @@ func (a AnsibleBroker) Unbind(
} else {
log.Warning("Broker configured to *NOT* launch and run APB unbind")
}

err = cleanupUnbind(&bindInstance, &serviceInstance, bindExtCreds, a.dao)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this gone? It was part of the unbind_subscriber before, what is doing this now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it was moved to cleanupAfterUnbind in the job_state_subscriber.

@@ -304,4 +304,12 @@ type SubscriberDAO interface {
SetState(id string, state apb.JobState) (string, error)
GetServiceInstance(id string) (*apb.ServiceInstance, error)
DeleteServiceInstance(id string) error
GetBindInstance(id string) (*apb.BindInstance, error)
DeleteBindInstance(id string) error
SetServiceInstance(id string, serviceInstance *apb.ServiceInstance) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the SubscriberDAO. It feels weird to me. But that's for a different day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is adding more and more methods, why don't we just make DAO now that DAO is an interface itself?

We either have or easily can create a mock for it as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep no issue with changing that as we already have the DAO interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shawn-hurley How do you feel about me creating a follow on issue to remove the subscriber dao and create the mocks etc? I don't see an existing mock at the moment and quite a few tests would need updating. I would like to get this landed and base the apb state work from master with these changes already available?

SetServiceInstance(id string, serviceInstance *apb.ServiceInstance) error
}

// WorkSubscriber - Interface tha wraps the Subscribe method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: tha -> that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it says that wraps the Subscribe method, is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope comment needs changing 👍

func NewWorkEngine(bufferSize int) *WorkEngine {
return &WorkEngine{topics: make(map[WorkTopic]chan JobMsg), bufsz: bufferSize}
func NewWorkEngine() *WorkEngine {
return &WorkEngine{jobs: make(map[string]chan JobMsg), subscribers: map[WorkTopic][]WorkSubscriber{}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, why is the channel a non buffered channel?

Copy link
Contributor Author

@maleck13 maleck13 Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently we are handing off the messages immediately and asynchronously we should always be able to read from and send to the channel so in this case there is no benefit to a buffered channel.

Where we may benefit from a buffered channel is if we used it in conjunction with a sync.WaitGroup

Something like:

  wg := sync.WaitGroup{}

  for msg := range engine.jobs[token] {
      wg.Add(len(engine.subscribers))
      for _, sub := range engine.subscribers[topic] {
          go sub.Notify(msg, wg) // All subscribers receive msg at same time, and each subscriber will call Done() when it is finished
      }
      wg.Wait() // wait for all subscribers to be done
  }

This design would allow each of the subscribers to receive the messages at the same time, but force us to wait until they are all complete before accepting the next message. In this case having a buffer would be valuable as it would allow the Job to continue to send messages even if our subscribers were running a little slow.
The question really is if there is value in waiting for all the subscribers to finish before accepting the next message.
I am leaning towards yes on this due to the fact that state will be changed and modified and there is no guarantee on that the order a set of go routines were created will also be the order they are executed in.

@@ -50,10 +50,33 @@ func (engine *WorkEngine) StartNewAsyncJob(
if token == "" {
token = engine.Token()
}
go work.Run(token, engine.topic(topic))
go engine.start(token, work, topic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels really weird to me. Starting the engine for each job. Feels like the engine should always be running and waiting for new jobs to be added to it to be run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be called startSubscriberListeningAndRun(). I would prefer to keep in smaller maybe something like startJob?

pkg/mock/dao.go Outdated
@@ -86,6 +86,49 @@ func (mp *SubscriberDAO) GetServiceInstance(id string) (*apb.ServiceInstance, er
return retOb.(*apb.ServiceInstance), mp.Errs["GetServiceInstance"]
}

//DeleteBindInstance mock impl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaced after the //

@@ -304,4 +304,12 @@ type SubscriberDAO interface {
SetState(id string, state apb.JobState) (string, error)
GetServiceInstance(id string) (*apb.ServiceInstance, error)
DeleteServiceInstance(id string) error
GetBindInstance(id string) (*apb.BindInstance, error)
DeleteBindInstance(id string) error
SetServiceInstance(id string, serviceInstance *apb.ServiceInstance) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is adding more and more methods, why don't we just make DAO now that DAO is an interface itself?

We either have or easily can create a mock for it as well.

for msg := range engine.jobs[token] {
for _, sub := range engine.subscribers[topic] {
//TODO edge case consider the fact that a subscriber may never exit and so we would leak go routines
go sub.Notify(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regards to the comment, I do think that having a context timeout makes sense here.

I think a simple implementation would be to wrap this in a func() that waited for a configurable amount of time.

The notify interface could then have an error return that we would log if something does go wrong. and if it returns with nil then everythng is all good.

...
for _, sub := range engine.subscribers[topic] {
         c := make(chan error, 1)
        go waitForNotify(engine.duration, sub, msg)
....

func waitForNotify(d time.Duration,  w WorkSubscriber, j JobMessage) {
              select {
            case err := w.Notify(j):
                 handle error or if nil return
            case <-d:
            }
}

Copy link
Contributor Author

@maleck13 maleck13 Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to chew this one over 🤔 I am not sure you can do the function call this way in the select statement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your right, I think you need to pass a channel to waitForNotify function and add the timeout bits to a function that gets launched with the go routine above.

Something like this?
https://play.golang.org/p/pRjyCXfZgSi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to take into account your feedback plus look at adding in the wait group and using context https://gist.github.com/maleck13/d2c903eac6064422966d64715118345e

@@ -50,10 +50,33 @@ func (engine *WorkEngine) StartNewAsyncJob(
if token == "" {
token = engine.Token()
}
go work.Run(token, engine.topic(topic))
go engine.start(token, work, topic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be called startSubscriberListeningAndRun(). I would prefer to keep in smaller maybe something like startJob?

@shawn-hurley
Copy link
Contributor

I think this is looking really good, does a good job of reducing complexity!

@maleck13 maleck13 force-pushed the issue-638-subcriber-improvements branch from f7be0e0 to 308ef44 Compare March 12, 2018 16:16
@maleck13
Copy link
Contributor Author

Ok so I have updated based on feedback and discussions. It is still WIP, but again wanted to give opportunity for feedback.
After much discussion (thanks guys) the design changed to make the start job more complex than the original commit, but I think it provides a more robust solution.

Main changes
We have introduced a WaitGroup for each message that comes in, along with a context object with timeout.
We now have the following on a per job basis:

msg-> buffered-channel -> fanout-to-subscribers-async -> lock-until-all-subs-done-> read-next-msg

The key concern for me here was to avoid a situation where writes could be executed in the wrong order.
There is no guarantee on the order that go routines will execute. So without the synchronization, there would be the possibility that a write could happen in the wrong order. Now however I do not believe this is the case as we use the WaitGroup to ensure subscribers are finished and the order messages come off a channel is FIFO

ping @shawn-hurley @jmrodri @eriknelson

Will be adding a number of unit tests here and also looking at a metrics subscriber in the next day or so.

Copy link
Contributor

@jmrodri jmrodri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comment changes, variable name change (or question). Overall I like this updated version better.

// defaultClusterURLPreFix - prefix for the ansible service broker.
defaultClusterURLPreFix = "/ansible-service-broker"
// MsgBufferSize - The buffer for the message channel.
MsgBufferSize = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this makes me feel better :)

pkg/app/app.go Outdated
@@ -213,43 +213,43 @@ func CreateApp() App {
validateRegistryNames(app.registry)

log.Debug("Initializing WorkEngine")
stateSubscriber := broker.NewJobStateSubscriber(app.dao)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to the single stateSubscriber.

log.Debugf("JobStateSubscriber Notify : msg state %v ", msg.State)
id := msg.InstanceUUID
if isBinding(msg) {
id = msg.BindingUUID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can msg.BindingUUID ever be ""? I would probably change isBinding to also ensure BindingUUID is not empty. Otherwise, we would store an empty string as the key. Maybe there is no way this could happen, I'll see how the flow goes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems if it can ever be empty that it should fail before getting here, similar with instanceUUID.

topics map[WorkTopic]chan JobMsg
bufsz int
subscribers map[WorkTopic][]WorkSubscriber
jobs map[string]chan JobMsg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobs is weird because these aren't the jobs. They're the channels used by the jobs for messages. The name implies the job structs are in this map but that is not the case. Calling jobs["token"] will not return me the job I created. Consider renaming it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to jobChannels

@@ -50,10 +54,65 @@ func (engine *WorkEngine) StartNewAsyncJob(
if token == "" {
token = engine.Token()
}
go work.Run(token, engine.topic(topic))
go engine.startJob(token, work, topic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startJob is better and more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'm not sure it should be a private fn though? I suppose it's fine since we're really the only ones using it but I thought it was useful to design it from a perspective of "this is a public lib". No requested changes, but spitballing.


go func() {
// listen for a new message for the job keyed to this token and hand off to the subscribers async. Wait for them all to be done before accepting
// the next message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformat this comment to be a couple shorter lines. The first one is really long.

}()
// notify the subscriber
go waitForNotify(sub, msg, notifySignal)
//act on whichever happens first the subscriber's notify method completing or the timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after //

ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second) //TODO make configurable
// used to tell us when the subscribers notify method is completed
notifySignal := make(chan struct{})
//If our subscriber times out or returns normally we will always clean up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after //

@@ -66,7 +125,7 @@ func (engine *WorkEngine) StartNewSyncJob(
token = engine.Token()
}

work.Run(token, engine.topic(topic))
engine.startJob(token, work, topic)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

return engine.topics
// GetActiveJobs - Get list of active jobs
func (engine *WorkEngine) GetActiveJobs() map[string]chan JobMsg {
return engine.jobs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really active jobs? I would expect it to return an array of jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to GetActiveJobChannels

@maleck13 maleck13 force-pushed the issue-638-subcriber-improvements branch from 308ef44 to 3213f38 Compare March 20, 2018 13:23
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2018
@maleck13 maleck13 force-pushed the issue-638-subcriber-improvements branch from 3213f38 to 2f8adeb Compare March 26, 2018 12:23
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2018
@maleck13
Copy link
Contributor Author

maleck13 commented Mar 27, 2018

@jmrodri I added back the clean up of unbind state when launch_apb_on_bind is false. I think this will fix the CI failure. At lease it does locally when running make ci
Wondering what you think to moving the unbind clean up into the dao layer as it is essentially a fascade around a set of dao calls?

Maybe

func(dao d)DeleteBinding(bindingInstance, serviceInstance)error{
    if err := d.DeleteBindInstance(bindInstance.ID.String()); err != nil {
        return nil, false, err
    }
    serviceInstance.RemoveBinding(bindInstance.ID)
    if err = d.SetServiceInstance(serviceInstance.ID.String(), serviceInstance); err != nil {
        return nil, false, err
    }
    return nil
}

@maleck13 maleck13 force-pushed the issue-638-subcriber-improvements branch from bfcf48a to 6211d29 Compare March 27, 2018 16:22
@maleck13
Copy link
Contributor Author

added to DAO can remove if need be

@maleck13 maleck13 changed the title [wip] initial changes to subscriber pattern Changes to subscriber pattern Mar 27, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 27, 2018
@maleck13
Copy link
Contributor Author

@shawn-hurley @jmrodri @eriknelson this is no longer WIP please review when ready.

@@ -79,6 +79,9 @@ type Dao interface {
// DeleteBindInstance - Delete the binding instance for an id in the kvp API.
DeleteBindInstance(string) error

// DeleteBinding - Delete the binding instance and remove the assocation with the service instance.
DeleteBinding(apb.BindInstance, apb.ServiceInstance) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@shawn-hurley
Copy link
Contributor

@maleck13 If you could rebase this, the build should start passing

@maleck13 maleck13 force-pushed the issue-638-subcriber-improvements branch from 6211d29 to 13be109 Compare March 29, 2018 07:16
Copy link
Contributor

@eriknelson eriknelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No requests for change; I love how much code this removed, it's a great improvement. Thanks @maleck13

case <-ctx.Done():
return
default:
signal <- struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a trigger? Usually I see bools used.

wg.Add(1)
// ensure things don't get locked up.
// Each subscriber has up to the configured amount of time to complete its action
ctx, cancel := context.WithTimeout(context.Background(), engine.subscriberTimeout*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much better 👍

@eriknelson eriknelson merged commit a3b1ceb into openshift:master Apr 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants