Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup/simplify shutdown orchestration #97

Open
nsauro opened this issue Jul 13, 2016 · 2 comments
Open

Cleanup/simplify shutdown orchestration #97

nsauro opened this issue Jul 13, 2016 · 2 comments

Comments

@nsauro
Copy link
Contributor

nsauro commented Jul 13, 2016

The orchestration of shutdown between QueueProcessor, Queue and Workers is a bit duplicative in terms of who is watching who.

Currently the QueueProcessor receives the initial Shutdown message, and then it will tell the Queue to shutdown. The Queue then messages to the the Workers NoWorkLeft, which begins their termination. However if the Queue still has Work messages queued, it is unclear what happens to them(I think they just get dropped?). The Queue then enters the retiring state, where it waits for a RetiringTimeout message. Once it receives this, it then sends more NoWorkLeft messages to the Workers and then shuts itself down. This termination is then watched by both the QueueProcessor and the Workers, which they then further react to. The QueueProcessor will send Retire signals to the Workers.

I don't think there are any real adverse effects here, other than the potential Work being dropped, which I need to verify, but this could be simplified a bit.

@kailuowang
Copy link
Member

The Queue tries to finish all work already enqueued, that's the intention.
Upon receiving the Retire, it first does a dispatchWork which will try dispatch all the work in the queue.
The dispatchWork once called, will either exhaust the work queue or the worker queue. Thus, if there are still workers left, then it means all the work in the queue are dispatched, the left workers can be retired, so it sends the NoWorkLeft message to them. If there is still work left in the queue, then there won't be any Worker left to receive that NoWorkLeft message.
Then in the retiring state, the Queue will try to dispatch multiple times to finish the leftover work.

Bottom line is, ShutdownGracefully does mean shutdown gracefully, no work left behind.
Another factor that adds to the confusion is that other actors such as Worker and QueueProcessor watches Queue but that's more a sanity check than part of the shutdown process. If the Queue accidentally dies, there is no reason to keep either QueueProcessor or Workers around. The whole system better be recreated by the supervisor which is the Dispatcher, although we need some tests around that.

@kailuowang kailuowang modified the milestones: 0.4.0, 1.0-RC1 Jul 20, 2016
@nsauro
Copy link
Contributor Author

nsauro commented Jul 29, 2016

I've been poking around a bit at this.

I have a basic idea which I just wanted to put down in writing somewhere, so I figured where better than here 😄

So, here's a rundown of things I think we need to address, one way or another:

  • Currently, there is Work loss potential. IE: If the Queue receives the final RetiringTimeout, and there is Work in the queue still, that Work goes with it. To fix this, we could easily at this point in time, send a message back to the original sender, similar to EnqueueRejected(work). While the message is slightly off, it might be better to just reuse the same public message that is being used now to reject work, vs introducing a new one.
  • The above idea, of having Work which was enqueued, subsequently get rejected, started to make me think, that we might be missing something from acknowledgement. Currently, the sender can know when Work is enqueued or rejected, but the sender never knows when the Work was completed(one way or another). This might be handy to have, since it allows the sender(not the replyTo) to track the status of Work from start to finish. This would allow the sender to fully react to the status of Work. Def a thing up for debate. In either event, we would want to have messages like these optional to send.
  • So, the idea behind graceful shutdown is that all work completes, and then all components shut themselves down. The only slight wrinkle I see with this, is that in the retiring state, we are still actively scheduling work(if it is already enqueued). The issue is, we have no control or guarantee that work we schedule while in this state will successfully be completed because of the RetireTimeout. Figure, if during the retiring state, right at T-1(before shutdown), a Worker gets new work, sends a message to its Routee, and then the system shuts down. That message is now in limbo(from the end user of Kanaloa's point of view). Did it go through? Did it timeout? If this was a transaction of some sort, there is no record of what happened(the replyTo and the sender both get nothing). The thing that rubs me the wrong way here is that in graceful shutdown, we can actively be increasing our chances of not having a graceful shutdown(by scheduling more work).

I think the safest thing to do, (as things are implemented currently), is upon retire, immediately reject all queued messages(similar to how things would need to be rejected in the above point). This gives the sender the ability to confidently handle things that we had no guarantee of executing, and it can react accordingly. To the sender, this is just another rejection event, which it already has to handle.

The graceful shutdown would then be a signal for "in flight" operations. IE: did everything that was in flight succeed, and were all unexecuted queued message successfully returned to the sender?

I know this is a bit of a departure from what was originally envisioned, but personally, I feel like we need to do our absolute best to make sure Kanaloa never drops work.

One interesting thing to note, and one thing we might want to explore, is that there actually might be diverging behaviors on shutdown between the pulling and pushing models.

One other thing to note is that based on our earlier chats about this, there actually might be the case for implementing multiple types of shutdown behavior. I think there were many good points bought up about both the current and potential implementations, and they don't need to be mutually exclusive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants