Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs terminated by Worker report as "Running" forever #633

Closed
glennmatthews opened this issue Jun 30, 2021 · 7 comments
Closed

Jobs terminated by Worker report as "Running" forever #633

glennmatthews opened this issue Jun 30, 2021 · 7 comments
Labels
type: bug Something isn't working as expected

Comments

@glennmatthews
Copy link
Contributor

Environment

  • Python version:
  • Nautobot version: 1.0.3

Steps to Reproduce

  1. Run a Job that allocates more memory than is actually available (e.g. with the worker running in a memory-limited Docker container) so that it gets killed by the system.
  2. Nautobot JobResult gets stuck as "Running" and never shows as completed in the user UI
  3. An admin can access admin/background-tasks/ and see that the RQ task associated with the Job was killed, but this information is not captured in the JobResult or otherwise visible to non-admin users.

Similar behavior is likely to be seen if a job runs for longer than the configured maximum timeout and is killed by RQ because of that.

I recognize that this specific symptom may be changed as we move to replace RQ with Celery (#531) but these same sorts of scenarios likely need to be accounted for with the Celery worker as well.

Expected Behavior

Nautobot needs to be made aware when a worker task fails or aborts and update the JobResult accordingly. For RQ, the docs have some possible approaches; I'm sure there are similar options for Celery.

Observed Behavior

@jathanism jathanism added the type: bug Something isn't working as expected label Jul 1, 2021
@jathanism
Copy link
Contributor

As RQ is deprecated we won't need to solve for that.

We addressed the timeout scenario w/ the Celery workers by implementing soft/hard time limits. But what about if a Celery worker dies/disappears, where are we expecting the call flow to result in the job getting updated by something else? (Such as the connective tissue between the Celery task state and the JobResult object).

@glennmatthews
Copy link
Contributor Author

Thinking a bit more about this, this may be something we want to address once #275 / #374 get implemented - have a periodic task that looks for JobResults that are marked as Running but aren't actually being used by any task, perhaps?

@lampwins lampwins added this to the v1.4.0 milestone Apr 8, 2022
@lampwins lampwins modified the milestones: v1.4.0, v2.0.0 Apr 26, 2022
@lampwins
Copy link
Member

See #765

@bryanculver bryanculver mentioned this issue Jun 13, 2022
9 tasks
@lampwins lampwins changed the title Jobs terminated by RQ report as "Running" forever Jobs terminated by Worker report as "Running" forever Sep 8, 2022
@bryanculver
Copy link
Member

See #1622 as solving this.

@jathanism
Copy link
Contributor

jathanism commented Feb 4, 2023

As a follow-up even with the work in #1622, when a hard time_limit is reached, Celery does a SIGKILL (-9) on the worker process executing the long-running task, leaving the JobResult in a "hung" state, not receiving a status update or even storing a traceback or any other incremental updates. Internally, a mark_as_failure event is fired on the database backend, but the process is killed before the event can culminate and update the backend correctly.

The only real workaround here is making proper use of soft_time_limit on jobs that are known to take a long time an have them catch SoftTimeLimitExceeded and clean up appropriately as we have recommended in our documentation.

Longer term, there may be a way to solve this by use of subtasks when a hard time-limit is detected, but in the current default implementation this is results in the JobResult being out of phase with the task being terminated. As far as I know, this is the only lingering edge case here, and it is one that should be incredibly rare in practice given proper use of soft_time_limit by Job authors.

@bryanculver
Copy link
Member

@jathanism can we close this because of #3085 and #3084?

@jathanism
Copy link
Contributor

@jathanism can we close this because of #3085 and #3084?

Yep!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug Something isn't working as expected
Projects
No open projects
Archived in project
Development

No branches or pull requests

5 participants