Fixes workers crashing on restart edge case #3283
Conversation
|
I just hand tested this. Against the current head of master f223542 I was able to reproduce the deadlocking when killing a worker with sigkill, restarting it, and then sending it new work. With this patch I confirmed that I could not reproduce the issue, and that new work started as expected. |
|
@bmbouter Any objections to the logic of the change? |
server/pulp/server/async/tasks.py
Outdated
| @@ -631,14 +631,18 @@ def _handle_cProfile(self, task_id): | |||
| self.pr.dump_stats("%s/%s" % (profile_directory, task_id)) | |||
|
|
|||
|
|
|||
| def cancel(task_id): | |||
| def cancel(task_id, worker_cleanup=False): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about making this worker_cleanup=True and making the if statement instead be if worker_cleanup.
I think that would be clearer because currently if worker_cleanup is False, it actually does cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point about this being unclear but I disagree that the logic should be changed (the naming, perhaps. I think you parsed it in the opposite way from what I intended).
To clarify, my intent for worker_cleanup was to refer to the context in which the function is being used, and not what we want the cancel() function to do.
This is the exception, not the rule - only one place in the codebase attempts to cancel tasks during worker cleanup, but lots of places in the codebase attempt to cancel tasks during normal operation. If we were to flip this default, we would then need to specify worker_cleanup=False on every call to cancel() but this one, and remember to do so in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes I'm suggestion would keep it as the exception not the rule because in addition to changing the default value from False to True you would also change the s/if not worker_cleanup/if worker_cleanup/
Since the default would be to perform cleanup, we would only need to specify worker_cleanup=False on that one call that wants to modify the "cleanup by default" behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so this is basically switching the semantic meaning of worker_cleanup. That is reasonable, I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dralley This tested perfectly, and the code looks great except for the default value of worker_cleanup. See the comment I left there. Can that be switched? I can LGTM then.
Thank you so much for working on this! I think the solution you came up with is very elegant.
|
@dralley if you're going to go back and make changes, can I request the changes to server/pulp/server/async/app.py be removed from this PR and be opened in a separate PR? |
|
@daviddavis sure |
Fixes an issue where workers would crash when attempting to revoke() a previously running task on startup. If the user sends SIGKILL to the worker while running a task, the worker will attempt to mark the task as stale when restarted. This causes a hang for yet-unknown reasons. Calling revoke() is not necessary when a dead worker is being cleaned up because all work goes through dedicated queues which disappear when the worker does. We leave revoke() active only when a task is being cancelled outside of worker cleanup. closes #2835 https://pulp.plan.io/issues/2835 modified: server/pulp/server/async/tasks.py modified: server/test/unit/server/async/test_tasks.py
|
@bmbouter @daviddavis changes made |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks exactly correct. Thank you so much @dralley!
Fixes an issue where workers would crash when
attempting to revoke() a previously running task
on startup. If the user sends SIGKILL to the
worker while running a task, the worker will
attempt to mark the task as stale when restarted.
This causes a hang for yet-unknown reasons. Calling
revoke() is not necessary when a dead worker is being
cleaned up because all work goes through dedicated
queues which disappear when the worker does. We
leave revoke() active only when a task is being
cancelled outside of worker cleanup.
closes #2835
https://pulp.plan.io/issues/2835