Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rescheduling on stopped tasks after migrating back to origin #10604

Conversation

tkountis
Copy link
Contributor

Starting a task on a single node, and then adding one more member, causes some of the tasks to migrate to partitions owned by the latter member.
During this process, the container stops currently running task, leaving it in a cancelled state. If this state, doesn't get disposed (eg. cluster never gets bigger), when the latter member goes down, the tasks will migrate back to the first one. However, since their state there is marked as cancelled, they never get re-scheduled.

The fix, introduces a stop method, which still cancels the task, but doesn't interact with the state.
Fix #10603

Also, minor cleanup and namings.

@tkountis tkountis added this to the 3.9 milestone May 17, 2017
@tkountis tkountis self-assigned this May 17, 2017
return "ScheduledTaskStatisticsImpl{ runs=" + runs + ", createdAt="
+ createdAt + ", firstRunStart=" + firstRunStart + ", lastRunStart=" + lastRunStart + ", lastRunEnd=" + lastRunEnd
+ ", lastIdleTime=" + lastIdleTime + ", totalRunTime=" + totalRunTime + ", totalIdleTime=" + totalIdleTime + '}';
return "ScheduledTaskStatisticsImpl{ runs=" + runs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScheduledTaskStatisticsImpl{runs= without a space matches the majority of our toString() methods.

HazelcastInstance second = factory.newHazelcastInstance();
waitAllForSafeState(first, second);

// Kill the second member, tasks should now get rescheduled back in first member
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you be sure the task was migrated to the second member in between? assertTrueEventually(new AllTasksRunning(scheduler)); should also pass if there was no migration at all, shouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I am not sure, I can add an extra check to see if there are tasks in the second member, specifically, but I tried to keep the taskCount rather high so that at least some will migrate. But yes, its not deterministic. I will modify to make it so.

HazelcastInstance first = factory.newHazelcastInstance();

int tasksCount = 1000;
final IScheduledExecutorService scheduler = getScheduledExecutor(new HazelcastInstance[] {first }, "scheduler");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there would be more logic inside the getScheduledExecutor() method, but since it isn't I would just use first.getScheduledExecutorService("scheduler");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its used in the Client tests and its slightly different there. eg. ClientScheduledExecutorServiceBasicTest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to remove the whole method. But this test looks like it doesn't need it, since it requires to construct an array (which looks more complex than just retrieving the scheduler).

@devOpsHazelcast
Copy link
Collaborator

Test PASSed.

@tkountis tkountis force-pushed the fix/3.9/reschedule_stoped_after_migrating_back_to_origin branch from 1cd9ac8 to c094417 Compare May 17, 2017 14:23
@devOpsHazelcast
Copy link
Collaborator

Test PASSed.

@tkountis tkountis merged commit 273fe67 into hazelcast:master May 22, 2017
@tkountis tkountis deleted the fix/3.9/reschedule_stoped_after_migrating_back_to_origin branch May 22, 2017 08:35
@ruslan-belinskyy
Copy link

ruslan-belinskyy commented Aug 7, 2017

Guys,
i'm currently getting:

java.lang.NullPointerException
	at com.hazelcast.scheduledexecutor.impl.ScheduledTaskDescriptor.stopForMigration(ScheduledTaskDescriptor.java:146) ~[hazelcast-all-3.8.3.jar:3.8.3]

When trying to schedule the Task:

IScheduledExecutorService service = hazelcastInstance.getScheduledExecutorService(executorName);
service.scheduleAtFixedRate(new HzcastTimerExchangeSender(this.getEndpoint().getEndpointUri(), executorName), endpoint.getDelay(), endpoint.getPeriod(), TimeUnit.MILLISECONDS);

I have looked to the fix you have here and it doesn't look safe and could cause issue i have.

You had:

try {
   descriptor.cancel(true);
   descriptor.setScheduledFuture(null);
   descriptor.setTaskOwner(false);
} catch (Exception ex) {
   throw rethrow(ex);
}

And now:

try {
    descriptor.stopForMigration();
} catch (Exception ex) {
   throw rethrow(ex);
}

Where descriptor.stopForMigration() has:

void stopForMigration() {
 +        // Result is not set, allowing task to get re-scheduled, if/when needed.
 +        this.isTaskOwner = false;
 +        this.future.cancel(true); //Nullpointer here
 +        this.future = null;
 +    }

And in old code :

descriptor.cancel(true);

Which has:

boolean cancel(boolean mayInterrupt)
            throws ExecutionException, InterruptedException {
        if (!resultRef.compareAndSet(null, new ScheduledTaskResult(true)) || future == null) {
            return false;
        }

        return future.cancel(mayInterrupt);
    }

This look way safer.

Nullpointer i have on 2 machines which have code you can see above.

@mmedenjak
Copy link
Contributor

@Batter2014 can you please submit a new issue for this?

@ruslan-belinskyy
Copy link

Yep, just did that: #11047

@mmedenjak mmedenjak added the Source: Internal PR or issue was opened by an employee label Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Source: Internal PR or issue was opened by an employee Team: Core Type: Defect
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scheduled task remains cancelled after migration
6 participants