Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outdated checkpoint-info in clutered enviroment #87

Closed
PoloShock opened this issue Nov 25, 2016 · 8 comments
Closed

Outdated checkpoint-info in clutered enviroment #87

PoloShock opened this issue Nov 25, 2016 · 8 comments
Assignees

Comments

@PoloShock
Copy link

In clustered enviroment with central jdbc repository, outdated reader-checkpoint-infos are loaded when switching executions between instances of a cluster. I will describe the issue by example:

Let there be server A and server B in cluster:

  1. a job is started in server A
  2. job is running, executing chunk items and storing reader-checkpoint-infos in jdbc database
  3. the job is stopped, leaving the chekpoint info at count 100
  4. the job is restarted on server B
  5. server B resumes from checkpoint-info holding count 100
  6. server B is stopped after fifty more chunk-items done, leaving checkpoint-info at 150
  7. the job is restarted on server A
  8. server A resumes from outdated checkpoint-info holding count of 100, which it was the last time this server stopped the job

After some debuging it seems JBeret is cashing JobExecutionImpl objects in AbstractPersistentRepository which are not up to date during job restart process.

I am currently using JBeret 1.2.0 final.

@chengfang chengfang self-assigned this Nov 26, 2016
@chengfang
Copy link
Contributor

At step 7, you are restarting the job execution (jobExecution2) that was stopped at step 6 on server B. jobExecution2 is different than the one that you started at step 1 on server A. So jobExecution2 is not cached in server A.

I think at step 7, JBeret is getting the right job execution (jobExecution2) from db. But the retrieval of jobExecution2 from db does not include its constituent step executions. So JBeret mistakenly uses the obsolete step execution from jobExecution1.

@chengfang
Copy link
Contributor

The fix should be something like this:

diff --git a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
index 661ec7f..45ea516 100644
--- a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
+++ b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
@@ -168,7 +168,12 @@ public abstract class AbstractPersistentRepository extends AbstractRepository im
     public StepExecutionImpl findOriginalStepExecutionForRestart(final String stepName,
                                                                  final JobExecutionImpl jobExecutionToRestart,
                                                                  final ClassLoader classLoader) {
-        for (final StepExecution stepExecution : jobExecutionToRestart.getStepExecutions()) {
+        List<StepExecution> stepExecutions = jobExecutionToRestart.getStepExecutions();
+        if (stepExecutions.isEmpty()) {
+            stepExecutions = getStepExecutions(jobExecutionToRestart.getExecutionId(), classLoader);
+        }
+
+        for (final StepExecution stepExecution : stepExecutions) {
             if (stepName.equals(stepExecution.getStepName())) {
                 return (StepExecutionImpl) stepExecution;
             }

the above diff is based on the master branch, so the line numbers may be different for 1.2.0.

@chengfang
Copy link
Contributor

Can you try the above patch in your app? I'll set up a similar test scenario but will not identical as yours.

Which version of WildFly are you using?

@chengfang
Copy link
Contributor

Cloned this issue in JIRA: JBERET-285
https://issues.jboss.org/browse/JBERET-285

@chengfang
Copy link
Contributor

In fact, JdbcRepository (a subclass of AbstractPersistentRepository) already overrides findOriginalStepExecutionForRestart method to retrieve the original failed/stopped step from db, but it calls super.findOriginalStepExecutionForRestart first in hope of finding the right step from cache to save db access cost. In this case, the cache happens to contain an oboselete step by the same name. I think the following fix should also work, and I will research more to see which one is better.

/Users/cfang/dev/jsr352 > git diff -U10
diff --git a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
index 661ec7f..0ff853f 100644
--- a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
+++ b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
@@ -166,31 +166,13 @@ public abstract class AbstractPersistentRepository extends AbstractRepository im

     @Override
     public StepExecutionImpl findOriginalStepExecutionForRestart(final String stepName,
                                                                  final JobExecutionImpl jobExecutionToRestart,
                                                                  final ClassLoader classLoader) {
         for (final StepExecution stepExecution : jobExecutionToRestart.getStepExecutions()) {
             if (stepName.equals(stepExecution.getStepName())) {
                 return (StepExecutionImpl) stepExecution;
             }
         }
-        StepExecutionImpl result = null;
-        // the same-named StepExecution is not found in the jobExecutionToRestart.  It's still possible the same-named
-        // StepExecution may exit in JobExecution earlier than jobExecutionToRestart for the same JobInstance.
-        final long instanceId = jobExecutionToRestart.getJobInstance().getInstanceId();
-        for (final SoftReference<JobExecutionImpl, Long> e : jobExecutions.values()) {
-            final JobExecutionImpl jobExecutionImpl = e.get();
-            //skip the JobExecution that has already been checked above
-            if (jobExecutionImpl != null && instanceId == jobExecutionImpl.getJobInstance().getInstanceId() &&
-                    jobExecutionImpl.getExecutionId() != jobExecutionToRestart.getExecutionId()) {
-                for (final StepExecution stepExecution : jobExecutionImpl.getStepExecutions()) {
-                    if (stepExecution.getStepName().equals(stepName)) {
-                        if (result == null || result.getStepExecutionId() < stepExecution.getStepExecutionId()) {
-                            result = (StepExecutionImpl) stepExecution;
-                        }
-                    }
-                }
-            }
-        }
-        return result;
+        return null;
     }
 }

@PoloShock
Copy link
Author

Thank you for your responses, I am happy you identified the source of the issue. However, I am not able to test the patch right now. I am using Wildfly 9.0.0.

@chengfang
Copy link
Contributor

I was able to reproduce the problem, using wildfly-jberet-samples/deserialization/, with slight modifications, deployed to 2 server a and server b.

@chengfang
Copy link
Contributor

pushed the fix the master branch:
77118ef

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants