Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Severe JobRunr Exception #337

Closed
datnguyen293 opened this issue Jan 25, 2022 · 20 comments
Closed

Severe JobRunr Exception #337

datnguyen293 opened this issue Jan 25, 2022 · 20 comments
Labels
wontfix This will not be worked on

Comments

@datnguyen293
Copy link

SevereJobRunrException occurred in BackgroundJobServer 89d7c0b4-19ab-4da1-93bb-5e34e7629426: Could not resolve ConcurrentJobModificationException

Runtime information

  • Timestamp: 2022-01-21T05:41:34.990173023Z
  • Location: BackgroundJobServer 89d7c0b4-19ab-4da1-93bb-5e34e7629426
  • JobRunr Version: 4.0.1
  • StorageProvider: org.jobrunr.storage.sql.mariadb.MariaDbStorageProvider
  • Java Version: 15.0.1
  • Is running from nested jar: false

Background Job Servers

  • BackgroundJobServer id: 89d7c0b4-19ab-4da1-93bb-5e34e7629426
    (workerPoolSize: 4, pollIntervalInSeconds: 5, firstHeartbeat: 2022-01-21T05:40:38.121464Z, lastHeartbeat: 2022-01-21T05:41:33.343126Z)

Diagnostics from exception

Concurrent modified jobs:

Job id: eab73d9e-3c68-4249-852a-c82d14a098e8
Local version: 5; Storage version: 6
Local state: DELETED (at 2022-01-21T05:41:04.576913404Z) ← PROCESSING (at 2022-01-21T05:41:04.537562334Z on BackgroundJobServer 89d7c0b4-19ab-4da1-93bb-5e34e7629426) ← ENQUEUED (at 2022-01-21T05:36:05.000067844Z)
Storage state: DELETED (at 2022-01-21T05:41:01.534887099Z) ← PROCESSING (at 2022-01-21T05:40:59.403639452Z on BackgroundJobServer 89d7c0b4-19ab-4da1-93bb-5e34e7629426) ← ENQUEUED (at 2022-01-21T05:36:05.000067844Z)

Exception

org.jobrunr.SevereJobRunrException: Could not resolve ConcurrentJobModificationException
	at org.jobrunr.server.JobZooKeeper.processJobList(JobZooKeeper.java:214)
	at org.jobrunr.server.JobZooKeeper.updateJobsThatAreBeingProcessed(JobZooKeeper.java:89)
	at org.jobrunr.server.JobZooKeeper.run(JobZooKeeper.java:74)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:305)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

SevereJobRunrException occurred in BackgroundJobServer 8ce1ecda-52c8-4034-9d41-8cc6a0a5a1b0: Could not resolve ConcurrentJobModificationException

Runtime information

  • Timestamp: 2022-01-22T03:48:19.624264531Z
  • Location: BackgroundJobServer 8ce1ecda-52c8-4034-9d41-8cc6a0a5a1b0
  • JobRunr Version: 4.0.1
  • StorageProvider: org.jobrunr.storage.sql.mariadb.MariaDbStorageProvider
  • Java Version: 15.0.1
  • Is running from nested jar: false

Background Job Servers

  • BackgroundJobServer id: 8ce1ecda-52c8-4034-9d41-8cc6a0a5a1b0
    (workerPoolSize: 4, pollIntervalInSeconds: 5, firstHeartbeat: 2022-01-22T03:46:05.448662Z, lastHeartbeat: 2022-01-22T03:48:19.568798Z)

Diagnostics from exception

Concurrent modified jobs:

Job id: 8f72af34-9efb-424c-9a51-9b8bdd12c25a
Local version: 12; Storage version: 13
Local state: SCHEDULED (at 2022-01-22T03:48:19.589220618Z) ← FAILED (at 2022-01-22T03:48:19.583438212Z) ← PROCESSING (at 2022-01-22T03:47:26.967061153Z on BackgroundJobServer 8ce1ecda-52c8-4034-9d41-8cc6a0a5a1b0)
Storage state: SUCCEEDED (at 2022-01-22T03:48:19.568197190Z) ← PROCESSING (at 2022-01-22T03:47:26.967061153Z on BackgroundJobServer 8ce1ecda-52c8-4034-9d41-8cc6a0a5a1b0) ← ENQUEUED (at 2022-01-22T03:47:25.748309585Z)

Exception

org.jobrunr.SevereJobRunrException: Could not resolve ConcurrentJobModificationException
	at org.jobrunr.server.JobZooKeeper.processJobList(JobZooKeeper.java:214)
	at org.jobrunr.server.JobZooKeeper.processJobList(JobZooKeeper.java:198)
	at org.jobrunr.server.JobZooKeeper.checkForOrphanedJobs(JobZooKeeper.java:124)
	at org.jobrunr.server.JobZooKeeper.runMasterTasksIfCurrentServerIsMaster(JobZooKeeper.java:96)
	at org.jobrunr.server.JobZooKeeper.run(JobZooKeeper.java:75)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset$$$capture(FutureTask.java:305)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
@rdehuyss
Copy link
Contributor

Hi Dat, is it an option to try to use Postgres?

I see this error mostly with MariaDb/MySql and I can't find the root cause of what is going on to be honest.

@tan9
Copy link
Contributor

tan9 commented Jan 25, 2022

@datnguyen293 What is your MariaDB setup? Is it a High Availability configuration?

@datnguyen293
Copy link
Author

Yes I have a semi-sync replication.

@rdehuyss
Copy link
Contributor

Are there multiple instances / regions where JobRunr is connecting to these replicated MariaDB's?

@datnguyen293
Copy link
Author

Hi Ronald, I have 2 instance of JobRunr connected to the same active master instance of MySQL. The standby master instance is read only and currently acting as backup server, not being used by the apps.

@tan9
Copy link
Contributor

tan9 commented Jan 26, 2022

I had a similar situation in a PostgreSQL with Pgpool-II cluster installation. JobZooKeeper popups ConcurrentJobModificationException occasionally and BackgroundJobServer shutdown after five times of this exception. I found the root cause came from line 190:

void processJobList(Supplier<List<Job>> jobListSupplier, Consumer<Job> jobConsumer) {
List<Job> jobs = getJobsToProcess(jobListSupplier);
while (!jobs.isEmpty()) {
processJobList(jobs, jobConsumer);
jobs = getJobsToProcess(jobListSupplier);
}
}

I wonder if the PostgreSQL instance used by JobZooKeep may differ from the one used by some other BackgroundJobServer. And data not been synchronized in the 220ms (03:48:19.589220618Z - 03:48:19.568197190Z) gap between PostgreSQL instances becomes eventually-consistency.

I have no idea how to deal with this issue inside JobRunr, and end up setting up another non-HA database for JobRunr to rely on.

@rdehuyss
Copy link
Contributor

@tan9 - thanks for thinking with us what the reason could be.

I may not have enough knowledge of database replication but I guess JobRunr just connects to the MariaDb/Postgres insurance that is configured in the Datasource - the active master.

Is your idea then that these databases internally read from other instances thus having not the latest data at hand as it is not synced yet?

Are the concurrent modifications completely solved after setting up a non HA database?

@rdehuyss
Copy link
Contributor

rdehuyss commented Jan 26, 2022

Some history: I added this logic (everything related to concurrent job modifications) to make sure that users report errors as it was not easy to get the code right from the first time due to the concurrency (inside one JVM and even multiple if there are multiple background job servers). I also solved some issues thanks to these big reports in the past.

Lately however, I'm not getting a lot of these reports though and in JobRunr 4.0.3 I added the option to set the ConcurrentJobModificationPolicy. It's not in documentation yet as I prefer to understand the root cause where these exceptions are coming from.

However, the ConcurrentJobModificationPolicy allows you to configure what JobRunr should do in case of a concurrent job modification. There are now 2 implementations, the default which is what you already know and a new one: UseStorageProviderJobConcurrentJobModificationResolver. In this case, JobRunr will not complain if there are concurrent job modifications and it will always use the one from storage. Use at your own risk 🤠 but please do report back on how this works out

If @datnguyen293 is sure there are no concurrent job modification errors when using a non HA database (and thus I can rest assured there are no programming errors), the UseStorageProviderJobConcurrentJobModificationResolver might be the solution for HA databases.

But I still don't understand the root cause as to why these exceptions pop up which frustrates me.

@datnguyen293
Copy link
Author

Thanks @rdehuyss.
Is your idea then that these databases internally read from other instances thus having not the latest data at hand as it is not synced yet?
In my case I don't think this would be the issue as currently I'm reading from only one active master instance. The standby won't be touch unless the active master goes down.

I will have a look at the UseStorageProviderJobConcurrentJobModificationResolver later this weekend.

@datnguyen293
Copy link
Author

datnguyen293 commented Jan 26, 2022

Hi @rdehuyss ,
I'm not sure if below issue is related to above. It's occurred with recurring jobs (as discussed via email, I have changed back the poll_interval back to 15 seconds), everytime a recurring job is executed I got one of this issue.

19:09:59.535 [backgroundjob-zookeeper-pool-4-thread-2] ERROR org.jobrunr.jobs.filters.JobPerformingFilters - Error evaluating jobfilter org.jobrunr.jobs.filters.DeleteFilter
java.lang.NullPointerException: null
	at java.time.Instant.plus(Instant.java:787) ~[?:?]
	at org.jobrunr.jobs.filters.DeleteFilter$DeletionPolicy.handle(DeleteFilter.java:91) ~[jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.jobs.filters.DeleteFilter.onStateElection(DeleteFilter.java:29) ~[jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.jobs.filters.JobPerformingFilters.lambda$runOnStateElectionFilter$0(JobPerformingFilters.java:23) ~[jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.jobs.filters.AbstractJobFilters.lambda$catchThrowable$5(AbstractJobFilters.java:95) ~[jobrunr-pro-4.0.1.jar!/:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) ~[?:?]
	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497) ~[?:?]
	at org.jobrunr.jobs.filters.JobPerformingFilters.runOnStateElectionFilter(JobPerformingFilters.java:23) ~[jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.jobs.filters.JobFilterUtils.runOnStateElectionFilter(JobFilterUtils.java:33) ~[jobrunr-pro-4.0.1.jar!/:?]
	at java.util.ArrayList.forEach(ArrayList.java:1541) [?:?]
	at org.jobrunr.jobs.filters.JobFilterUtils.runOnStateElectionFilter(JobFilterUtils.java:41) [jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.server.JobZooKeeper.processJobList(JobZooKeeper.java:207) [jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.server.JobZooKeeper.processJobList(JobZooKeeper.java:198) [jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.server.JobZooKeeper.checkForSucceededJobsThanCanGoToDeletedState(JobZooKeeper.java:130) [jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.server.JobZooKeeper.runMasterTasksIfCurrentServerIsMaster(JobZooKeeper.java:97) [jobrunr-pro-4.0.1.jar!/:?]
	at org.jobrunr.server.JobZooKeeper.run(JobZooKeeper.java:75) [jobrunr-pro-4.0.1.jar!/:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]

Also I don't think this is timeout issue as currently the database (MySQL) is free without much load. I have tested with only 2 recurring jobs which run every 5 and 10 minutes.

@rdehuyss
Copy link
Contributor

Hi Dat,

Can you try JobRunr Pro 4.0.2 for this latest bug report?

Thanks,
Ronald

@datnguyen293
Copy link
Author

datnguyen293 commented Jan 27, 2022 via email

@tan9
Copy link
Contributor

tan9 commented Jan 27, 2022

Are the concurrent modifications completely solved after setting up a non HA database?

@rdehuyss @datnguyen293 I can confirm this issue was gone after we run JobRunr over a single PostgreSQL instance.

The standby master instance is read only and currently acting as backup server, not being used by the apps.

I am not familiar with MySQL HA setup. But if the standby master is configured as a read replica, and query can be executed on the instance, outdated data may be fetched.

The standby won't be touch unless the active master goes down.

@datnguyen293 in what situation the master will be marked as down?

@datnguyen293
Copy link
Author

Hi Dat,

Can you try JobRunr Pro 4.0.2 for this latest bug report?

Thanks, Ronald

I would like to report a bug in upgrading JobRunr Pro from 4.0.1 to 4.0.2. Would have a migration I guess.

09:23:46.431 [backgroundjob-worker-pool-5-thread-1] ERROR org.jobrunr.server.BackgroundJobServer - Unexpected exception running `CheckIfAllJobsExistTask`
org.jobrunr.JobRunrException: JobRunr encountered a problematic exception. Please create a bug report (if possible, provide the code to reproduce this and the stacktrace)
	at org.jobrunr.JobRunrException.shouldNotHappenException(JobRunrException.java:43) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at org.jobrunr.utils.mapper.jackson.JacksonJsonMapper.deserialize(JacksonJsonMapper.java:84) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at org.jobrunr.jobs.mappers.JobMapper.deserializeRecurringJob(JobMapper.java:28) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at org.jobrunr.storage.sql.common.RecurringJobTable.toRecurringJob(RecurringJobTable.java:56) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]
	at org.jobrunr.storage.sql.common.db.SqlSpliterator.tryAdvance(SqlSpliterator.java:42) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at java.util.Spliterator.forEachRemaining(Spliterator.java:326) ~[?:?]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) ~[?:?]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) ~[?:?]
	at org.jobrunr.storage.sql.common.RecurringJobTable.selectAll(RecurringJobTable.java:47) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at org.jobrunr.storage.sql.common.DefaultSqlStorageProvider.getRecurringJobs(DefaultSqlStorageProvider.java:325) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at org.jobrunr.storage.ThreadSafeStorageProvider.getRecurringJobs(ThreadSafeStorageProvider.java:175) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at org.jobrunr.server.tasks.CheckIfAllJobsExistTask.getDistinctRecurringJobSignaturesThatDoNotExistAnymore(CheckIfAllJobsExistTask.java:123) ~[jobrunr-pro-4.0.2.jar!/:4.0.2]
	at org.jobrunr.server.tasks.CheckIfAllJobsExistTask.run(CheckIfAllJobsExistTask.java:46) [jobrunr-pro-4.0.2.jar!/:4.0.2]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "cronExpression" (class org.jobrunr.jobs.RecurringJob), not marked as ignorable (12 known properties: "priority", "jobDetails", "serverTag", "version", "jobSignature", "mutex", "zoneId", "jobName", "id", "scheduleExpression", "isPaused", "createdAt"])
 at [Source: (String)"{"version":0,"jobSignature":"com.gobysend.gobyjobs.jobservices.WorkflowStatusChangedJobService.scheduledForTriggerByDate(java.lang.Long)","jobName":"Scheduled triggering by date for workflow 2501","jobDetails":{"className":"com.gobysend.gobyjobs.jobservices.WorkflowStatusChangedJobService","staticFieldName":null,"methodName":"scheduledForTriggerByDate","jobParameters":[{"className":"long","actualClassName":"java.lang.Long","object":2501}],"cacheable":false},"serverTag":"DEFAULT","mutex":null,"pr"[truncated 115 chars]; line: 1, column: 560] (through reference chain: org.jobrunr.jobs.RecurringJob["cronExpression"])

@rdehuyss
Copy link
Contributor

Hi @datnguyen293 , indeed - my mistake. I'm doing a little bit too much for the moment 😞 .

This should have become JobRunr Pro 5 as there is now support for scheduling jobs with an interval and it is not backwards compatible.

I'll release JobRunr Pro 5 soon and keep you posted. Sorry for the inconvenience.

@datnguyen293
Copy link
Author

Thanks Ronald, I think I can reset the job tables in case upgrading to version 5.
At the moment I'm facing an issue when running in HA environment. I mean when having multiple JobRunr Pro instances to the same database. I'm investigating little bit more before submitting another issue.

Thanks for your hard working,
Dat

@rdehuyss
Copy link
Contributor

Are you trying to run JobRunr in HA Active-Active environment? As that won't work due to not having one single source of truth (as the DB is replicated).

@datnguyen293
Copy link
Author

That's exactly what I'm trying to check. I'm going to turn off one instance of database (so we have only one DB node). But as far as my experience, it won't be the issue. The standby DB instance won't be used for query.

@rdehuyss
Copy link
Contributor

Hi @datnguyen293 , were you able to reproduce it with only one DB instance?

@stale
Copy link

stale bot commented Feb 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Feb 12, 2022
@stale stale bot closed this as completed Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants