Outdated checkpoint-info in clutered enviroment #87

PoloShock · 2016-11-25T12:28:56Z

In clustered enviroment with central jdbc repository, outdated reader-checkpoint-infos are loaded when switching executions between instances of a cluster. I will describe the issue by example:

Let there be server A and server B in cluster:

a job is started in server A
job is running, executing chunk items and storing reader-checkpoint-infos in jdbc database
the job is stopped, leaving the chekpoint info at count 100
the job is restarted on server B
server B resumes from checkpoint-info holding count 100
server B is stopped after fifty more chunk-items done, leaving checkpoint-info at 150
the job is restarted on server A
server A resumes from outdated checkpoint-info holding count of 100, which it was the last time this server stopped the job

After some debuging it seems JBeret is cashing JobExecutionImpl objects in AbstractPersistentRepository which are not up to date during job restart process.

I am currently using JBeret 1.2.0 final.

chengfang · 2016-11-28T17:45:05Z

At step 7, you are restarting the job execution (jobExecution2) that was stopped at step 6 on server B. jobExecution2 is different than the one that you started at step 1 on server A. So jobExecution2 is not cached in server A.

I think at step 7, JBeret is getting the right job execution (jobExecution2) from db. But the retrieval of jobExecution2 from db does not include its constituent step executions. So JBeret mistakenly uses the obsolete step execution from jobExecution1.

chengfang · 2016-11-28T17:47:30Z

The fix should be something like this:

diff --git a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
index 661ec7f..45ea516 100644
--- a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
+++ b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
@@ -168,7 +168,12 @@ public abstract class AbstractPersistentRepository extends AbstractRepository im
     public StepExecutionImpl findOriginalStepExecutionForRestart(final String stepName,
                                                                  final JobExecutionImpl jobExecutionToRestart,
                                                                  final ClassLoader classLoader) {
-        for (final StepExecution stepExecution : jobExecutionToRestart.getStepExecutions()) {
+        List<StepExecution> stepExecutions = jobExecutionToRestart.getStepExecutions();
+        if (stepExecutions.isEmpty()) {
+            stepExecutions = getStepExecutions(jobExecutionToRestart.getExecutionId(), classLoader);
+        }
+
+        for (final StepExecution stepExecution : stepExecutions) {
             if (stepName.equals(stepExecution.getStepName())) {
                 return (StepExecutionImpl) stepExecution;
             }

the above diff is based on the master branch, so the line numbers may be different for 1.2.0.

chengfang · 2016-11-28T20:48:56Z

Can you try the above patch in your app? I'll set up a similar test scenario but will not identical as yours.

Which version of WildFly are you using?

chengfang · 2016-11-28T21:00:27Z

Cloned this issue in JIRA: JBERET-285
https://issues.jboss.org/browse/JBERET-285

chengfang · 2016-11-29T04:37:59Z

In fact, JdbcRepository (a subclass of AbstractPersistentRepository) already overrides findOriginalStepExecutionForRestart method to retrieve the original failed/stopped step from db, but it calls super.findOriginalStepExecutionForRestart first in hope of finding the right step from cache to save db access cost. In this case, the cache happens to contain an oboselete step by the same name. I think the following fix should also work, and I will research more to see which one is better.

/Users/cfang/dev/jsr352 > git diff -U10
diff --git a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
index 661ec7f..0ff853f 100644
--- a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
+++ b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
@@ -166,31 +166,13 @@ public abstract class AbstractPersistentRepository extends AbstractRepository im

     @Override
     public StepExecutionImpl findOriginalStepExecutionForRestart(final String stepName,
                                                                  final JobExecutionImpl jobExecutionToRestart,
                                                                  final ClassLoader classLoader) {
         for (final StepExecution stepExecution : jobExecutionToRestart.getStepExecutions()) {
             if (stepName.equals(stepExecution.getStepName())) {
                 return (StepExecutionImpl) stepExecution;
             }
         }
-        StepExecutionImpl result = null;
-        // the same-named StepExecution is not found in the jobExecutionToRestart.  It's still possible the same-named
-        // StepExecution may exit in JobExecution earlier than jobExecutionToRestart for the same JobInstance.
-        final long instanceId = jobExecutionToRestart.getJobInstance().getInstanceId();
-        for (final SoftReference<JobExecutionImpl, Long> e : jobExecutions.values()) {
-            final JobExecutionImpl jobExecutionImpl = e.get();
-            //skip the JobExecution that has already been checked above
-            if (jobExecutionImpl != null && instanceId == jobExecutionImpl.getJobInstance().getInstanceId() &&
-                    jobExecutionImpl.getExecutionId() != jobExecutionToRestart.getExecutionId()) {
-                for (final StepExecution stepExecution : jobExecutionImpl.getStepExecutions()) {
-                    if (stepExecution.getStepName().equals(stepName)) {
-                        if (result == null || result.getStepExecutionId() < stepExecution.getStepExecutionId()) {
-                            result = (StepExecutionImpl) stepExecution;
-                        }
-                    }
-                }
-            }
-        }
-        return result;
+        return null;
     }
 }

PoloShock · 2016-11-29T14:58:54Z

Thank you for your responses, I am happy you identified the source of the issue. However, I am not able to test the patch right now. I am using Wildfly 9.0.0.

chengfang · 2016-11-29T20:14:02Z

I was able to reproduce the problem, using wildfly-jberet-samples/deserialization/, with slight modifications, deployed to 2 server a and server b.

chengfang · 2016-11-29T21:52:18Z

pushed the fix the master branch:
77118ef

chengfang self-assigned this Nov 26, 2016

chengfang closed this as completed Dec 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outdated checkpoint-info in clutered enviroment #87

Outdated checkpoint-info in clutered enviroment #87

PoloShock commented Nov 25, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 29, 2016

PoloShock commented Nov 29, 2016

chengfang commented Nov 29, 2016

chengfang commented Nov 29, 2016

Outdated checkpoint-info in clutered enviroment #87

Outdated checkpoint-info in clutered enviroment #87

Comments

PoloShock commented Nov 25, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 28, 2016

chengfang commented Nov 29, 2016

PoloShock commented Nov 29, 2016

chengfang commented Nov 29, 2016

chengfang commented Nov 29, 2016