New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler sometimes misses events #13

Closed
keithvaz opened this Issue Oct 21, 2014 · 7 comments

Comments

Projects
None yet
4 participants
@keithvaz
Contributor

keithvaz commented Oct 21, 2014

Error that appears on console -
WARNING:apscheduler.scheduler:Run time of job "callevent (trigger: cron[month='', day='', day_of_week='*', hour='9', minute='0'], next run at: 2014-10-23 09:00:00)" was missed by 0:00:11.240000

If you manually trigger the event via Signals, it works.

This issue is quite rare in occurrence; however, it has affected all the Scheduler nodes hosted on the same machine, but running different Node hosts.

The fix is to disable then enable the Scheduler node.

@keithvaz keithvaz added the bug label Oct 21, 2014

@mcartmel

This comment has been minimized.

Show comment
Hide comment
@mcartmel

mcartmel Dec 8, 2014

Contributor

This would appear to be related to the underlying thread handing of Java (and therefore Jython), with threads not waking correctly at the scheduled time. I might try and run the scheduler on something like a Pi to see if the reduced CPU capability can induce the problem more reliably.

Contributor

mcartmel commented Dec 8, 2014

This would appear to be related to the underlying thread handing of Java (and therefore Jython), with threads not waking correctly at the scheduled time. I might try and run the scheduler on something like a Pi to see if the reduced CPU capability can induce the problem more reliably.

@scienceworld

This comment has been minimized.

Show comment
Hide comment
@scienceworld

scienceworld Dec 17, 2014

we had this happen to us today on ALL schedules. Since I just moved to WOL I had to run all the gallery on actions manually to get the museum up and running.

if the schedule is missed, why dont we just catch the miss, and re-run the schedule? I think the schedule is mission critical.

12-17 09:00:55 - WARNING:apscheduler.scheduler:Run time of job "callevent (trigger: cron[month='*', day='*', day_of_week='1-6', hour='9', minute='0'], next run at: 2014-12-18 09:00:00)" was missed by 0:00:55.710999

scienceworld commented Dec 17, 2014

we had this happen to us today on ALL schedules. Since I just moved to WOL I had to run all the gallery on actions manually to get the museum up and running.

if the schedule is missed, why dont we just catch the miss, and re-run the schedule? I think the schedule is mission critical.

12-17 09:00:55 - WARNING:apscheduler.scheduler:Run time of job "callevent (trigger: cron[month='*', day='*', day_of_week='1-6', hour='9', minute='0'], next run at: 2014-12-18 09:00:00)" was missed by 0:00:55.710999

@justparking

This comment has been minimized.

Show comment
Hide comment
@justparking

justparking Dec 18, 2014

Contributor

It's very hard to reproduce this on demand without setting up lengthy tests. Regardless, I've still taken a close look at what happens under the hood in the apscheduler Python code as well as the underlying Jython and Java.

There could be many reasons jobs that go well into the future seem to be "missed" occasionally. My theories are:

  1. apsheduler indirectly calls a Condition.awaitNanos(long nanosTimeout) Java method. I've looked at its source-code and can see it is implemented in a compound fashion which means it could theoretically block longer than nanosTimeout and thus affect the apscheduler's scheduling algorithm.

(The common Java primitives Thread.sleep(time) or Object.wait(time), on the other hand, are atomic operations, but are useless within apscheduler.)

  1. apsheduler uses its own hand-rolled thread-pooling mechanism which may behave strangely under a heavy, highly multi-threaded burst of activity.

As these start- and end-of-day schedules are obviously not millisecond critical (*), as a workaround I propose we simply disable the "misfire checking" like this:

Current scheduler.py (line 491)

            ...
            if difference > grace_time:
                # Notify listeners about a missed run
                event = JobEvent(EVENT_JOB_MISSED, job, run_time)
                self._notify_listeners(event)
                logger.warning('Run time of job "%s" was missed by %s',
                               job, difference)
            else:
                try:
                    job.add_instance()
            ...

Workaround:

            ...
            if difference > grace_time:
                # Simply log and continue
                logger.warning('Ignoring lateness - "%s" was late by %s',
                               job, difference)
            if False:
                print 'dead code'
            else:
                try:
                    job.add_instance()
            ...

(I used dead code to minimise changes to the file.)

(*) As far as I know, all hardware being equal, this issue is exacerbated the longer the period is between trigger times. Conversely, the smaller the period, the more reliable and precise the timing is.

There are ways to improve the accuracy of apsheduler's scheduling algorithm but they will involve complex changes to the source code. Until that's done, I'll relabel this issue as 'workaround exists' and keep it open.

Please report back if anyone can help reproduce this on demand. Can I suggest we add the following lines to your node so we have more info to look when the condition occurs:

Top of scheduler nodes' script.py files:

...
logging.basicConfig()

logging.getLogger('apscheduler.scheduler').level = logging.DEBUG
logging.getLogger('apscheduler.threadpool').level = logging.DEBUG
...
Contributor

justparking commented Dec 18, 2014

It's very hard to reproduce this on demand without setting up lengthy tests. Regardless, I've still taken a close look at what happens under the hood in the apscheduler Python code as well as the underlying Jython and Java.

There could be many reasons jobs that go well into the future seem to be "missed" occasionally. My theories are:

  1. apsheduler indirectly calls a Condition.awaitNanos(long nanosTimeout) Java method. I've looked at its source-code and can see it is implemented in a compound fashion which means it could theoretically block longer than nanosTimeout and thus affect the apscheduler's scheduling algorithm.

(The common Java primitives Thread.sleep(time) or Object.wait(time), on the other hand, are atomic operations, but are useless within apscheduler.)

  1. apsheduler uses its own hand-rolled thread-pooling mechanism which may behave strangely under a heavy, highly multi-threaded burst of activity.

As these start- and end-of-day schedules are obviously not millisecond critical (*), as a workaround I propose we simply disable the "misfire checking" like this:

Current scheduler.py (line 491)

            ...
            if difference > grace_time:
                # Notify listeners about a missed run
                event = JobEvent(EVENT_JOB_MISSED, job, run_time)
                self._notify_listeners(event)
                logger.warning('Run time of job "%s" was missed by %s',
                               job, difference)
            else:
                try:
                    job.add_instance()
            ...

Workaround:

            ...
            if difference > grace_time:
                # Simply log and continue
                logger.warning('Ignoring lateness - "%s" was late by %s',
                               job, difference)
            if False:
                print 'dead code'
            else:
                try:
                    job.add_instance()
            ...

(I used dead code to minimise changes to the file.)

(*) As far as I know, all hardware being equal, this issue is exacerbated the longer the period is between trigger times. Conversely, the smaller the period, the more reliable and precise the timing is.

There are ways to improve the accuracy of apsheduler's scheduling algorithm but they will involve complex changes to the source code. Until that's done, I'll relabel this issue as 'workaround exists' and keep it open.

Please report back if anyone can help reproduce this on demand. Can I suggest we add the following lines to your node so we have more info to look when the condition occurs:

Top of scheduler nodes' script.py files:

...
logging.basicConfig()

logging.getLogger('apscheduler.scheduler').level = logging.DEBUG
logging.getLogger('apscheduler.threadpool').level = logging.DEBUG
...

@justparking justparking added workaround exists and removed bug labels Dec 18, 2014

@scienceworld

This comment has been minimized.

Show comment
Hide comment
@scienceworld

scienceworld Dec 19, 2014

Thanks! I've copied the debugging and updated if statement to all of my advanced schedules.

Couldn't we instead/also increase the grace time to something like 300 seconds?

scienceworld commented Dec 19, 2014

Thanks! I've copied the debugging and updated if statement to all of my advanced schedules.

Couldn't we instead/also increase the grace time to something like 300 seconds?

@justparking

This comment has been minimized.

Show comment
Hide comment
@justparking

justparking Dec 19, 2014

Contributor

Yes that'll work too but I never mentioned it because it'll end up ignoring the condition without reporting it. I was hoping someone would post more detailed logs (*) when those delays occur.

BTW, I've currently got a few tests in place running on CPU-restricted hardware which should lead to a better workaround or perhaps even a fix.

(*) Can I ask to have this line changed in your node script too so we get accurate timestamps and threading info when things go wrong:

Top of scheduler nodes' script.py files:
logging.basicConfig()

logging.basicConfig(format='%(asctime)-15s.%(msecs).03d %(levelname)-8s %(threadName)-19s %(message)s',datefmt='%Y-%m-%d %H:%M:%S')

Cheers.

Contributor

justparking commented Dec 19, 2014

Yes that'll work too but I never mentioned it because it'll end up ignoring the condition without reporting it. I was hoping someone would post more detailed logs (*) when those delays occur.

BTW, I've currently got a few tests in place running on CPU-restricted hardware which should lead to a better workaround or perhaps even a fix.

(*) Can I ask to have this line changed in your node script too so we get accurate timestamps and threading info when things go wrong:

Top of scheduler nodes' script.py files:
logging.basicConfig()

logging.basicConfig(format='%(asctime)-15s.%(msecs).03d %(levelname)-8s %(threadName)-19s %(message)s',datefmt='%Y-%m-%d %H:%M:%S')

Cheers.

@scienceworld

This comment has been minimized.

Show comment
Hide comment
@scienceworld

scienceworld Feb 18, 2015

Got some good news.

The VM that nodel runs on was paused yesterday for a bit while we performed some server maintenance. This caused the scheduler to miss its schedule, and run it 13 minutes late. I think this is because the scheduler isnt based on the clock, but is based on a Seconds countdown from the last time the scheduler checked for an event and is working off a delta to next event. when the VM was paused, the countdown was paused and then resumed with the previous amount of ticks remaining. There was no catch up or reevaluation of the delta to the next event and so the schedule ran late.

02-17 09:14:02 - INFO:apscheduler.scheduler:Scheduler started 02-17 09:14:02 - DEBUG:apscheduler.scheduler:Looking for jobs to run 02-17 09:14:02 - DEBUG:apscheduler.scheduler:Next wakeup is due at 2015-02-17 17:00:00 (in 27957.788001 seconds) 02-17 17:13:25 - DEBUG:apscheduler.scheduler:Looking for jobs to run 02-17 17:13:26 - DEBUG:apscheduler.scheduler:Next wakeup is due at 2015-02-18 09:00:00 (in 56794.083000 seconds) 02-17 17:13:26 - DEBUG:apscheduler.threadpool:Started worker thread 02-17 17:13:26 - WARNING:apscheduler.scheduler:Ignoring lateness - "callevent (trigger: cron[month='*', day='*', day_of_week='1-4', hour='17', minute='0'], next run at: 2015-02-18 17:00:00)" was late by 0:13:26.104000 02-17 17:13:26 - INFO:apscheduler.scheduler:Running job "callevent (trigger: cron[month='*', day='*', day_of_week='1-4', hour='17', minute='0'], next run at: 2015-02-18 17:00:00)" (scheduled at 2015-02-17 17:00:00) 02-17 17:13:26 - calling: AllOff 02-17 17:13:26 - INFO:apscheduler.scheduler:Job "callevent (trigger: cron[month='*', day='*', day_of_week='1-4', hour='17', minute='0'], next run at: 2015-02-18 17:00:00)" executed successfully 02-17 17:13:27 - DEBUG:apscheduler.threadpool:Exiting worker thread

scienceworld commented Feb 18, 2015

Got some good news.

The VM that nodel runs on was paused yesterday for a bit while we performed some server maintenance. This caused the scheduler to miss its schedule, and run it 13 minutes late. I think this is because the scheduler isnt based on the clock, but is based on a Seconds countdown from the last time the scheduler checked for an event and is working off a delta to next event. when the VM was paused, the countdown was paused and then resumed with the previous amount of ticks remaining. There was no catch up or reevaluation of the delta to the next event and so the schedule ran late.

02-17 09:14:02 - INFO:apscheduler.scheduler:Scheduler started 02-17 09:14:02 - DEBUG:apscheduler.scheduler:Looking for jobs to run 02-17 09:14:02 - DEBUG:apscheduler.scheduler:Next wakeup is due at 2015-02-17 17:00:00 (in 27957.788001 seconds) 02-17 17:13:25 - DEBUG:apscheduler.scheduler:Looking for jobs to run 02-17 17:13:26 - DEBUG:apscheduler.scheduler:Next wakeup is due at 2015-02-18 09:00:00 (in 56794.083000 seconds) 02-17 17:13:26 - DEBUG:apscheduler.threadpool:Started worker thread 02-17 17:13:26 - WARNING:apscheduler.scheduler:Ignoring lateness - "callevent (trigger: cron[month='*', day='*', day_of_week='1-4', hour='17', minute='0'], next run at: 2015-02-18 17:00:00)" was late by 0:13:26.104000 02-17 17:13:26 - INFO:apscheduler.scheduler:Running job "callevent (trigger: cron[month='*', day='*', day_of_week='1-4', hour='17', minute='0'], next run at: 2015-02-18 17:00:00)" (scheduled at 2015-02-17 17:00:00) 02-17 17:13:26 - calling: AllOff 02-17 17:13:26 - INFO:apscheduler.scheduler:Job "callevent (trigger: cron[month='*', day='*', day_of_week='1-4', hour='17', minute='0'], next run at: 2015-02-18 17:00:00)" executed successfully 02-17 17:13:27 - DEBUG:apscheduler.threadpool:Exiting worker thread

@mcartmel

This comment has been minimized.

Show comment
Hide comment
@mcartmel

mcartmel Mar 22, 2017

Contributor

Generally, this appears to be caused by a problem with the underlying apscheduler module which is going to be difficult to fix.

I'm going to close this for now as we are releasing the new scheduler recipe which I believe most people will use instead of this.

In the limited occasions that this scheduler might still be used over the new recipe (like for basic standalone operation), I think the workaround is sufficient.

Contributor

mcartmel commented Mar 22, 2017

Generally, this appears to be caused by a problem with the underlying apscheduler module which is going to be difficult to fix.

I'm going to close this for now as we are releasing the new scheduler recipe which I believe most people will use instead of this.

In the limited occasions that this scheduler might still be used over the new recipe (like for basic standalone operation), I think the workaround is sufficient.

@mcartmel mcartmel closed this Mar 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment