Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DurableExecutorService re-executes completed tasks in case of node failure. #9965

Closed
shalakhansidmul opened this issue Feb 22, 2017 · 3 comments

Comments

Projects
None yet
3 participants
@shalakhansidmul
Copy link

commented Feb 22, 2017

Hello,
I tried the following use case. I have some questions regarding behaviour of re-execution of tasks in case of node failure.

https://github.com/shalakhansidmul/hazelcast_test
In the provided example, there are 4 tasks:
SumTask, PutSumInAMapTask, SquareTask, PrintTheSquareTask.
All are HazelcastInstanceAware.
Order of execution:

Node1 will submit Sumtask to the executor.
Just before returning, Sumtask will submit PutSumInAMapTask to the executor.
Just before returning, PutSumInAMapTask will submit SquareTask to the executor.
Just before returning, SquareTask will submit PrintThe SquareTask to the executor.
There are 4 nodes in my Hazelcast cluster.
Steps:
Start the MasterMember(Node1) and three SlaveMembers(Node2,Node3,Node4) .
Observation:
SumTask submitted from node1. It executes on Node2
It submits PutSumInAMapTask and returns.
Randomly, PutSumInAMapTask also executes on Node2.
While the task is still running, I kill Node2.
The Hazelcast cluster repartitions the data and re-executes SumTask and PutSumInAMapTask on Node3.
Why does it execute SumTask even when it is completed?
It should ideally just resume/restart PutSumInAMapTask , right?

@jerrinot jerrinot added this to the 3.9 milestone Feb 23, 2017

@tkountis tkountis self-assigned this Feb 27, 2017

tkountis added a commit to tkountis/hazelcast that referenced this issue Mar 1, 2017

Fix re-execution of completed task when killing the owner node
When a task is submitted to a key owner, and the task completes, we replace the runnable
with the result in the internal ring buffer. However, there is no backup-operation to sync
the result with the backup nodes. If the node shuts down gracefully then the replication
will properly handle it, but when the node is killed, the backup nodes still have the task
in their ring-buffer rather than the result, thus, we re-execute it.

Fix hazelcast#9965
@tkountis

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2017

Hi @shalakhansidmul thanks for the finding. We have identified the cause, and pushed a fix on the PR above. Feel free to checkout the code and give it a try.

tkountis added a commit to tkountis/hazelcast that referenced this issue Mar 1, 2017

Fix re-execution of completed task when killing the owner node
When a task is submitted to a key owner, and the task completes, we replace the runnable
with the result in the internal ring buffer. However, there is no backup-operation to sync
the result with the backup nodes. If the node shuts down gracefully then the replication
will properly handle it, but when the node is killed, the backup nodes still have the task
in their ring-buffer rather than the result, thus, we re-execute it.

Fix hazelcast#9965
@shalakhansidmul

This comment has been minimized.

Copy link
Author

commented Mar 2, 2017

Hello @tkountis
I tried the solution you have provided.
It works like a charm :)
Thank you for the quick resolution of my problem. :) great job !
I have 2 more requirements from the durable task executor.
1 .Support for submitting and persisting callback along with task to Durable executor. Persisting the callback to handle node failure and failover of task.
2. Support for providing member selector for task execution to Durable executor.

I would be filing another issue for them. If this is also resolved soon, it would be perfect!
Thanks again.

@tkountis

This comment has been minimized.

Copy link
Contributor

commented Mar 2, 2017

glad that it works for you @shalakhansidmul. Hopefully we will merge it soon, and be available in next SNAPSHOT and in the upcoming patch of 3.8.1

----- Following discussion unrelated to this issue -----

Regarding your feature request, this introduces new API, which can only be considered for upcoming versions, so you won’t have as fast resolution as on this one. It will also be based on how useful this will be to other users too, if we see enough up-votes.

however, reading through your request I can offer your an alternative that might make it easier for your to accomplish what you want.

Hazelcast has recently released the ScheduledExecutorService which allows for scheduling on members as well, keep in mind though, that when scheduling on members, the tasks are not durable in an event of a failure. On key owners they are. For your callback chaining needs, I would recommend, on using the StatefulTask on the ScheduledExecutorService and keeping the last completed task on the state. If the task gets migrated due to a node failure, you can check the state and continue from where your last known condition. I think it will be a much cleaner approach. Now, you might say that you don’t want to schedule the task, but execute it now; if you specify 0 as delay, then you will have similar behaviour to the executor. have a go at it and let us know if the result is any better

tkountis added a commit that referenced this issue Mar 6, 2017

Fix re-execution of completed task when killing the owner node (#9995)
* Fix re-execution of completed task when killing the owner node

When a task is submitted to a key owner, and the task completes, we replace the runnable
with the result in the internal ring buffer. However, there is no backup-operation to sync
the result with the backup nodes. If the node shuts down gracefully then the replication
will properly handle it, but when the node is killed, the backup nodes still have the task
in their ring-buffer rather than the result, thus, we re-execute it.

Fix #9965
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.