New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outdated checkpoint-info in clutered enviroment #87
Comments
At step 7, you are restarting the job execution (jobExecution2) that was stopped at step 6 on server B. jobExecution2 is different than the one that you started at step 1 on server A. So jobExecution2 is not cached in server A. I think at step 7, JBeret is getting the right job execution (jobExecution2) from db. But the retrieval of jobExecution2 from db does not include its constituent step executions. So JBeret mistakenly uses the obsolete step execution from jobExecution1. |
The fix should be something like this: diff --git a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
index 661ec7f..45ea516 100644
--- a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
+++ b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
@@ -168,7 +168,12 @@ public abstract class AbstractPersistentRepository extends AbstractRepository im
public StepExecutionImpl findOriginalStepExecutionForRestart(final String stepName,
final JobExecutionImpl jobExecutionToRestart,
final ClassLoader classLoader) {
- for (final StepExecution stepExecution : jobExecutionToRestart.getStepExecutions()) {
+ List<StepExecution> stepExecutions = jobExecutionToRestart.getStepExecutions();
+ if (stepExecutions.isEmpty()) {
+ stepExecutions = getStepExecutions(jobExecutionToRestart.getExecutionId(), classLoader);
+ }
+
+ for (final StepExecution stepExecution : stepExecutions) {
if (stepName.equals(stepExecution.getStepName())) {
return (StepExecutionImpl) stepExecution;
} the above diff is based on the master branch, so the line numbers may be different for 1.2.0. |
Can you try the above patch in your app? I'll set up a similar test scenario but will not identical as yours. Which version of WildFly are you using? |
Cloned this issue in JIRA: JBERET-285 |
In fact, JdbcRepository (a subclass of AbstractPersistentRepository) already overrides /Users/cfang/dev/jsr352 > git diff -U10
diff --git a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
index 661ec7f..0ff853f 100644
--- a/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
+++ b/jberet-core/src/main/java/org/jberet/repository/AbstractPersistentRepository.java
@@ -166,31 +166,13 @@ public abstract class AbstractPersistentRepository extends AbstractRepository im
@Override
public StepExecutionImpl findOriginalStepExecutionForRestart(final String stepName,
final JobExecutionImpl jobExecutionToRestart,
final ClassLoader classLoader) {
for (final StepExecution stepExecution : jobExecutionToRestart.getStepExecutions()) {
if (stepName.equals(stepExecution.getStepName())) {
return (StepExecutionImpl) stepExecution;
}
}
- StepExecutionImpl result = null;
- // the same-named StepExecution is not found in the jobExecutionToRestart. It's still possible the same-named
- // StepExecution may exit in JobExecution earlier than jobExecutionToRestart for the same JobInstance.
- final long instanceId = jobExecutionToRestart.getJobInstance().getInstanceId();
- for (final SoftReference<JobExecutionImpl, Long> e : jobExecutions.values()) {
- final JobExecutionImpl jobExecutionImpl = e.get();
- //skip the JobExecution that has already been checked above
- if (jobExecutionImpl != null && instanceId == jobExecutionImpl.getJobInstance().getInstanceId() &&
- jobExecutionImpl.getExecutionId() != jobExecutionToRestart.getExecutionId()) {
- for (final StepExecution stepExecution : jobExecutionImpl.getStepExecutions()) {
- if (stepExecution.getStepName().equals(stepName)) {
- if (result == null || result.getStepExecutionId() < stepExecution.getStepExecutionId()) {
- result = (StepExecutionImpl) stepExecution;
- }
- }
- }
- }
- }
- return result;
+ return null;
}
} |
Thank you for your responses, I am happy you identified the source of the issue. However, I am not able to test the patch right now. I am using Wildfly 9.0.0. |
I was able to reproduce the problem, using wildfly-jberet-samples/deserialization/, with slight modifications, deployed to 2 server a and server b. |
pushed the fix the master branch: |
In clustered enviroment with central jdbc repository, outdated reader-checkpoint-infos are loaded when switching executions between instances of a cluster. I will describe the issue by example:
Let there be server A and server B in cluster:
After some debuging it seems JBeret is cashing JobExecutionImpl objects in AbstractPersistentRepository which are not up to date during job restart process.
I am currently using JBeret 1.2.0 final.
The text was updated successfully, but these errors were encountered: