Fix rescheduling on stopped tasks after migrating back to origin #10604

tkountis · 2017-05-17T13:16:44Z

Starting a task on a single node, and then adding one more member, causes some of the tasks to migrate to partitions owned by the latter member.
During this process, the container stops currently running task, leaving it in a cancelled state. If this state, doesn't get disposed (eg. cluster never gets bigger), when the latter member goes down, the tasks will migrate back to the first one. However, since their state there is marked as cancelled, they never get re-scheduled.

The fix, introduces a stop method, which still cancels the task, but doesn't interact with the state.
Fix #10603

Also, minor cleanup and namings.

Donnerbart · 2017-05-17T13:34:37Z

hazelcast/src/main/java/com/hazelcast/scheduledexecutor/impl/ScheduledTaskStatisticsImpl.java

-        return "ScheduledTaskStatisticsImpl{ runs=" + runs + ", createdAt="
-                + createdAt + ", firstRunStart=" + firstRunStart + ", lastRunStart=" + lastRunStart + ", lastRunEnd=" + lastRunEnd
-                + ", lastIdleTime=" + lastIdleTime + ", totalRunTime=" + totalRunTime + ", totalIdleTime=" + totalIdleTime + '}';
+        return "ScheduledTaskStatisticsImpl{ runs=" + runs


ScheduledTaskStatisticsImpl{runs= without a space matches the majority of our toString() methods.

Donnerbart · 2017-05-17T13:36:13Z

hazelcast/src/test/java/com/hazelcast/scheduledexecutor/ScheduledExecutorServiceSlowTest.java

+        HazelcastInstance second = factory.newHazelcastInstance();
+        waitAllForSafeState(first, second);
+
+        // Kill the second member, tasks should now get rescheduled back in first member


How can you be sure the task was migrated to the second member in between? assertTrueEventually(new AllTasksRunning(scheduler)); should also pass if there was no migration at all, shouldn't it?

Well, I am not sure, I can add an extra check to see if there are tasks in the second member, specifically, but I tried to keep the taskCount rather high so that at least some will migrate. But yes, its not deterministic. I will modify to make it so.

Donnerbart · 2017-05-17T13:39:20Z

hazelcast/src/test/java/com/hazelcast/scheduledexecutor/ScheduledExecutorServiceSlowTest.java

+        HazelcastInstance first = factory.newHazelcastInstance();
+
+        int tasksCount = 1000;
+        final IScheduledExecutorService scheduler = getScheduledExecutor(new HazelcastInstance[] {first }, "scheduler");


I thought there would be more logic inside the getScheduledExecutor() method, but since it isn't I would just use first.getScheduledExecutorService("scheduler");

its used in the Client tests and its slightly different there. eg. ClientScheduledExecutorServiceBasicTest

I don't want to remove the whole method. But this test looks like it doesn't need it, since it requires to construct an array (which looks more complex than just retrieving the scheduler).

devOpsHazelcast · 2017-05-17T13:43:23Z

Test PASSed.

devOpsHazelcast · 2017-05-17T14:51:12Z

Test PASSed.

ruslan-belinskyy · 2017-08-07T17:32:37Z

Guys,
i'm currently getting:

java.lang.NullPointerException
	at com.hazelcast.scheduledexecutor.impl.ScheduledTaskDescriptor.stopForMigration(ScheduledTaskDescriptor.java:146) ~[hazelcast-all-3.8.3.jar:3.8.3]

When trying to schedule the Task:

IScheduledExecutorService service = hazelcastInstance.getScheduledExecutorService(executorName);
service.scheduleAtFixedRate(new HzcastTimerExchangeSender(this.getEndpoint().getEndpointUri(), executorName), endpoint.getDelay(), endpoint.getPeriod(), TimeUnit.MILLISECONDS);

I have looked to the fix you have here and it doesn't look safe and could cause issue i have.

You had:

try {
   descriptor.cancel(true);
   descriptor.setScheduledFuture(null);
   descriptor.setTaskOwner(false);
} catch (Exception ex) {
   throw rethrow(ex);
}

And now:

try {
    descriptor.stopForMigration();
} catch (Exception ex) {
   throw rethrow(ex);
}

Where descriptor.stopForMigration() has:

void stopForMigration() {
 +        // Result is not set, allowing task to get re-scheduled, if/when needed.
 +        this.isTaskOwner = false;
 +        this.future.cancel(true); //Nullpointer here
 +        this.future = null;
 +    }

And in old code :

descriptor.cancel(true);

Which has:

boolean cancel(boolean mayInterrupt)
            throws ExecutionException, InterruptedException {
        if (!resultRef.compareAndSet(null, new ScheduledTaskResult(true)) || future == null) {
            return false;
        }

        return future.cancel(mayInterrupt);
    }

This look way safer.

Nullpointer i have on 2 machines which have code you can see above.

mmedenjak · 2017-08-07T17:55:45Z

@Batter2014 can you please submit a new issue for this?

ruslan-belinskyy · 2017-08-07T17:56:36Z

Yep, just did that: #11047

tkountis added Team: Core Type: Defect labels May 17, 2017

tkountis added this to the 3.9 milestone May 17, 2017

tkountis self-assigned this May 17, 2017

Donnerbart reviewed May 17, 2017

View reviewed changes

Fix rescheduling on stopped tasks after migrating back to origin

c094417

tkountis force-pushed the fix/3.9/reschedule_stoped_after_migrating_back_to_origin branch from 1cd9ac8 to c094417 Compare May 17, 2017 14:23

Donnerbart approved these changes May 17, 2017

View reviewed changes

gurbuzali approved these changes May 22, 2017

View reviewed changes

tkountis merged commit 273fe67 into hazelcast:master May 22, 2017

tkountis deleted the fix/3.9/reschedule_stoped_after_migrating_back_to_origin branch May 22, 2017 08:35

ruslan-belinskyy mentioned this pull request Aug 7, 2017

[scheduled-executor] ScheduledTaskDescriptor.stopForMigration Nullpointer #11047

Closed

mmedenjak added the Source: Internal PR or issue was opened by an employee label Apr 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix rescheduling on stopped tasks after migrating back to origin #10604

Fix rescheduling on stopped tasks after migrating back to origin #10604

tkountis commented May 17, 2017

Donnerbart May 17, 2017

Donnerbart May 17, 2017

tkountis May 17, 2017

Donnerbart May 17, 2017

tkountis May 17, 2017

Donnerbart May 17, 2017

devOpsHazelcast commented May 17, 2017

devOpsHazelcast commented May 17, 2017

ruslan-belinskyy commented Aug 7, 2017 •

edited by mmedenjak

mmedenjak commented Aug 7, 2017

ruslan-belinskyy commented Aug 7, 2017

Fix rescheduling on stopped tasks after migrating back to origin #10604

Fix rescheduling on stopped tasks after migrating back to origin #10604

Conversation

tkountis commented May 17, 2017

Donnerbart May 17, 2017

Choose a reason for hiding this comment

Donnerbart May 17, 2017

Choose a reason for hiding this comment

tkountis May 17, 2017

Choose a reason for hiding this comment

Donnerbart May 17, 2017

Choose a reason for hiding this comment

tkountis May 17, 2017

Choose a reason for hiding this comment

Donnerbart May 17, 2017

Choose a reason for hiding this comment

devOpsHazelcast commented May 17, 2017

devOpsHazelcast commented May 17, 2017

ruslan-belinskyy commented Aug 7, 2017 • edited by mmedenjak

mmedenjak commented Aug 7, 2017

ruslan-belinskyy commented Aug 7, 2017

ruslan-belinskyy commented Aug 7, 2017 •

edited by mmedenjak