Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does job not go from Started Execution to Executed #30

Closed
Juanjoorrego opened this issue Jul 17, 2018 · 11 comments
Closed

Why does job not go from Started Execution to Executed #30

Juanjoorrego opened this issue Jul 17, 2018 · 11 comments

Comments

@Juanjoorrego
Copy link

Hi there,

I have noticed that some jobs never go from Started Execution to Executed for some reason. There is no clear bug, it just happens. Is there a reason for this?

screen shot 2018-07-16 at 3 33 22 pm

@Juanjoorrego
Copy link
Author

After this happens, jobs supposed to happen later either show up sometimes, or they just don't.

@AnthonyBRoberts
Copy link

I'm having a similar issue. Each job seems to run twice, one never makes it out of Started Execution state, the other usually makes it to Executed state. I have Django running behind Gunicorn, so I had to use this hack to get it to only run on the master node and not on the workers.

The scheduler is initialized at the bottom of the wsgi.py file. Otherwise, it follows the above StackOverflow method to avoid the APScheduler running on every node.

@przemyslaw-szurmak
Copy link

przemyslaw-szurmak commented Nov 22, 2019

hey @Juanjoorrego @AnthonyBRoberts have you been able to figure out what was the issue ? i'm having exactly same problem as you...

@sdementen
Copy link

sdementen commented Apr 11, 2020

hello, I am also having the issue both while using the django runserver in local or using uwsgi in production, my jobs are run as many times there are processes running django.

@jarekwg I have what seems to be a fix by adding a lock in the _get_jobs method to avoid running in a "race" condition when the method get_due_jobs is called multiple times at the same time in different processes/threads. I have added the transaction.atomic, the select_for_update() for the lock and the update of the next_run_time parts

    def _get_jobs(self, **filters):
        from django.db import transaction
        with transaction.atomic():
            job_states = DjangoJob.objects.filter(**filters).select_for_update().values_list('id', 'job_state')
            jobs = []
            failed_job_ids = set()
            for job_id, job_state in job_states:
                try:
                    jobs.append(self._reconstitute_job(job_state))
                except:
                    self._logger.exception('Unable to restore job "%s" -- removing it', job_id)
                    failed_job_ids.add(job_id)

            # set the next_run_time to None to avoid reselecting the current job
            if "next_run_time__lte" in filters:
                job_states.update(next_run_time=None)

            # Remove all the jobs we failed to restore
            if failed_job_ids:
                LOGGER.warning("Remove bad jobs: %s", failed_job_ids)
                DjangoJob.objects.filter(id__in=failed_job_ids).delete()

            def map_jobs(job):
                job.next_run_time = deserialize_dt(job.next_run_time)
                return job

            return list(map(map_jobs, jobs))

Could this be a proper fix according to you ?

@sdementen
Copy link

this could also fix #12

@jarekwg
Copy link
Collaborator

jarekwg commented Apr 13, 2020

Hi @sdementen,
I don't really have a strong opinion on best approaches in this package anymore.
At a high level, I'd say either:

  • lock the rows from being read/updated concurrently (as you've done here).
  • lock the _get_jobs function from being run concurrently (a db-level semaphore would be pretty neat, though idk if that'd limit support to postgres or not..)

Either way, probs won't merge a change like this myself as I'm too far removed at this point. Would you (or anyone else here) be interested in helping maintain this package? Happy to grant write access.

@sdementen
Copy link

I think locking the _get_jobs is not enough as the update of the jobs is done in a 2nd phase in APScheduler. And between a first call of _get_jobs and a second call of _get_jobs , the update job may not have happened yet leading the second call of _get_jobs to return the same jobs as the first call ==> there is a need to mark the job as "read" in _get_jobs (what I do with setting the job.next_run_time to None.
According to your knowledge, is this enough ?

I can do a PR and update the tests/docs if needed. For the maintenance, I can help if you want (even though I have not a deep knowledge of APScheduler) but I do not feel I can take it over on the LT (not in my core or secondary focus, more my third...)

@jarekwg
Copy link
Collaborator

jarekwg commented Apr 13, 2020

Hmm, so your idea of somehow marking the DjangoJobs as already queued is sound, however doing this by wiping the next_run_time field feels like a hack:

  • it overloads the usage of this field.
  • it could accidentally get broken if next_run_time calculation behaviour were to change in future.
  • it could break/confuse other things that refer to next_run_time. eg fix issue #41 #68 assumes next_run_time being blank implies the Job is paused.

I'm not familiar w the code anymore, but could next_run_time just get bumped to actual next run time here instead? That's stop it getting picked up multiple times.
Would probably be prudent to change the function name as well, since _get_jobs implies it's not writing to the db.

@sdementen
Copy link

Indeed, it felt to me like a hack too.
I agree also that the name is not right as it has a side effect. Yet, given APScheduler is first getting the jobs and updating afterwards the next run time (as it assumes a single scheduler), I think we can keep this get & updating behavior and rename the function to make it more obvious.
Re marking the job as "retrieved for execution", it maybe better to add a boolean field to be even more explicit than mangling the current field. What do you think?

@jarekwg
Copy link
Collaborator

jarekwg commented Apr 14, 2020

Most job runners have a status field that includes this, plus other aspects of the Job life cycle. eg. PENDING, QUEUED, RUNNING, SUCCESS, FAILURE
How are the remaining states managed? By APScheduler? Maybe we could pull them all through to a Job.status field? If that doesn't make sense in this case, then yeah, a bool is_queued or a timestamp queued_at (in case knowing when it was queued is useful) might be the way to go.

@jcass77
Copy link
Owner

jcass77 commented Jul 18, 2020

Fixed via #86.

@jcass77 jcass77 closed this as completed Jul 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants