Jobs terminated by Worker report as "Running" forever #633

glennmatthews · 2021-06-30T15:03:22Z

Environment

Python version:
Nautobot version: 1.0.3

Steps to Reproduce

Run a Job that allocates more memory than is actually available (e.g. with the worker running in a memory-limited Docker container) so that it gets killed by the system.
Nautobot JobResult gets stuck as "Running" and never shows as completed in the user UI
An admin can access admin/background-tasks/ and see that the RQ task associated with the Job was killed, but this information is not captured in the JobResult or otherwise visible to non-admin users.

Similar behavior is likely to be seen if a job runs for longer than the configured maximum timeout and is killed by RQ because of that.

I recognize that this specific symptom may be changed as we move to replace RQ with Celery (#531) but these same sorts of scenarios likely need to be accounted for with the Celery worker as well.

Expected Behavior

Nautobot needs to be made aware when a worker task fails or aborts and update the JobResult accordingly. For RQ, the docs have some possible approaches; I'm sure there are similar options for Celery.

Observed Behavior

The text was updated successfully, but these errors were encountered:

jathanism · 2021-07-27T19:45:35Z

As RQ is deprecated we won't need to solve for that.

We addressed the timeout scenario w/ the Celery workers by implementing soft/hard time limits. But what about if a Celery worker dies/disappears, where are we expecting the call flow to result in the job getting updated by something else? (Such as the connective tissue between the Celery task state and the JobResult object).

glennmatthews · 2021-07-27T20:06:52Z

Thinking a bit more about this, this may be something we want to address once #275 / #374 get implemented - have a periodic task that looks for JobResults that are marked as Running but aren't actually being used by any task, perhaps?

lampwins · 2022-06-13T17:51:55Z

See #765

bryanculver · 2022-09-13T19:51:11Z

See #1622 as solving this.

jathanism · 2023-02-04T20:17:24Z

As a follow-up even with the work in #1622, when a hard time_limit is reached, Celery does a SIGKILL (-9) on the worker process executing the long-running task, leaving the JobResult in a "hung" state, not receiving a status update or even storing a traceback or any other incremental updates. Internally, a mark_as_failure event is fired on the database backend, but the process is killed before the event can culminate and update the backend correctly.

The only real workaround here is making proper use of soft_time_limit on jobs that are known to take a long time an have them catch SoftTimeLimitExceeded and clean up appropriately as we have recommended in our documentation.

Longer term, there may be a way to solve this by use of subtasks when a hard time-limit is detected, but in the current default implementation this is results in the JobResult being out of phase with the task being terminated. As far as I know, this is the only lingering edge case here, and it is one that should be incredibly rare in practice given proper use of soft_time_limit by Job authors.

bryanculver · 2023-02-09T21:51:45Z

@jathanism can we close this because of #3085 and #3084?

jathanism · 2023-02-09T22:03:42Z

@jathanism can we close this because of #3085 and #3084?

Yep!

jathanism added the type: bug Something isn't working as expected label Jul 1, 2021

jathanism added the wontfix label Jul 27, 2021

jathanism added status: near-term and removed wontfix labels Jul 27, 2021

jedelman8 removed the status: near-term label Aug 3, 2021

lampwins added this to the v1.4.0 milestone Apr 8, 2022

lampwins modified the milestones: v1.4.0, v2.0.0 Apr 26, 2022

bryanculver mentioned this issue Jun 13, 2022

Epic: Jobs Overhaul #765

Closed

9 tasks

glennmatthews mentioned this issue Jul 1, 2022

Jobs sometimes fail with Can't decode message body: DecodeError(InterfaceError('connection already closed')) #1985

Closed

bryanculver mentioned this issue Jul 11, 2022

Epic: Job Queue Routing #1860

Closed

5 tasks

lampwins changed the title ~~Jobs terminated by RQ report as "Running" forever~~ Jobs terminated by Worker report as "Running" forever Sep 8, 2022

bryanculver removed the status: accepted label Sep 27, 2022

bryanculver mentioned this issue Nov 1, 2022

Jobs Overhaul: Implement database backend for Celery results #1622

Closed

12 tasks

bryanculver removed this from the v2.0.0 milestone Nov 3, 2022

jathanism mentioned this issue Feb 7, 2023

Added a system task for handling Jobs where the hard time_limit is hit. #3254

Closed

7 tasks

gsnider2195 mentioned this issue Feb 8, 2023

Fixed job result not updating when job hard time limit is reached #3273

Merged

7 tasks

bryanculver closed this as completed Feb 9, 2023

github-actions bot locked as resolved and limited conversation to collaborators May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs terminated by Worker report as "Running" forever #633

Jobs terminated by Worker report as "Running" forever #633

glennmatthews commented Jun 30, 2021

jathanism commented Jul 27, 2021

glennmatthews commented Jul 27, 2021

lampwins commented Jun 13, 2022

bryanculver commented Sep 13, 2022

jathanism commented Feb 4, 2023 •

edited

Loading

bryanculver commented Feb 9, 2023

jathanism commented Feb 9, 2023

Jobs terminated by Worker report as "Running" forever #633

Jobs terminated by Worker report as "Running" forever #633

Comments

glennmatthews commented Jun 30, 2021

Environment

Steps to Reproduce

Expected Behavior

Observed Behavior

jathanism commented Jul 27, 2021

glennmatthews commented Jul 27, 2021

lampwins commented Jun 13, 2022

bryanculver commented Sep 13, 2022

jathanism commented Feb 4, 2023 • edited Loading

bryanculver commented Feb 9, 2023

jathanism commented Feb 9, 2023

jathanism commented Feb 4, 2023 •

edited

Loading