Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Scheduler sometimes misses events #13
Error that appears on console -
If you manually trigger the event via Signals, it works.
This issue is quite rare in occurrence; however, it has affected all the Scheduler nodes hosted on the same machine, but running different Node hosts.
The fix is to disable then enable the Scheduler node.
This would appear to be related to the underlying thread handing of Java (and therefore Jython), with threads not waking correctly at the scheduled time. I might try and run the scheduler on something like a Pi to see if the reduced CPU capability can induce the problem more reliably.
we had this happen to us today on ALL schedules. Since I just moved to WOL I had to run all the gallery on actions manually to get the museum up and running.
if the schedule is missed, why dont we just catch the miss, and re-run the schedule? I think the schedule is mission critical.
It's very hard to reproduce this on demand without setting up lengthy tests. Regardless, I've still taken a close look at what happens under the hood in the apscheduler Python code as well as the underlying Jython and Java.
There could be many reasons jobs that go well into the future seem to be "missed" occasionally. My theories are:
(The common Java primitives
As these start- and end-of-day schedules are obviously not millisecond critical (*), as a workaround I propose we simply disable the "misfire checking" like this:
Current scheduler.py (line 491)
... if difference > grace_time: # Notify listeners about a missed run event = JobEvent(EVENT_JOB_MISSED, job, run_time) self._notify_listeners(event) logger.warning('Run time of job "%s" was missed by %s', job, difference) else: try: job.add_instance() ...
... if difference > grace_time: # Simply log and continue logger.warning('Ignoring lateness - "%s" was late by %s', job, difference) if False: print 'dead code' else: try: job.add_instance() ...
(I used dead code to minimise changes to the file.)
(*) As far as I know, all hardware being equal, this issue is exacerbated the longer the period is between trigger times. Conversely, the smaller the period, the more reliable and precise the timing is.
There are ways to improve the accuracy of apsheduler's scheduling algorithm but they will involve complex changes to the source code. Until that's done, I'll relabel this issue as 'workaround exists' and keep it open.
Please report back if anyone can help reproduce this on demand. Can I suggest we add the following lines to your node so we have more info to look when the condition occurs:
Top of scheduler nodes' script.py files:
... logging.basicConfig() logging.getLogger('apscheduler.scheduler').level = logging.DEBUG logging.getLogger('apscheduler.threadpool').level = logging.DEBUG ...
Yes that'll work too but I never mentioned it because it'll end up ignoring the condition without reporting it. I was hoping someone would post more detailed logs (*) when those delays occur.
BTW, I've currently got a few tests in place running on CPU-restricted hardware which should lead to a better workaround or perhaps even a fix.
(*) Can I ask to have this line changed in your node script too so we get accurate timestamps and threading info when things go wrong:
Top of scheduler nodes' script.py files:
Got some good news.
The VM that nodel runs on was paused yesterday for a bit while we performed some server maintenance. This caused the scheduler to miss its schedule, and run it 13 minutes late. I think this is because the scheduler isnt based on the clock, but is based on a Seconds countdown from the last time the scheduler checked for an event and is working off a delta to next event. when the VM was paused, the countdown was paused and then resumed with the previous amount of ticks remaining. There was no catch up or reevaluation of the delta to the next event and so the schedule ran late.
Generally, this appears to be caused by a problem with the underlying apscheduler module which is going to be difficult to fix.
I'm going to close this for now as we are releasing the new scheduler recipe which I believe most people will use instead of this.
In the limited occasions that this scheduler might still be used over the new recipe (like for basic standalone operation), I think the workaround is sufficient.