Skip to content
This repository has been archived by the owner on Dec 7, 2022. It is now read-only.

Fixes workers crashing on restart edge case #3283

Merged
merged 1 commit into from Jan 29, 2018
Merged

Fixes workers crashing on restart edge case #3283

merged 1 commit into from Jan 29, 2018

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Jan 24, 2018

Fixes an issue where workers would crash when
attempting to revoke() a previously running task
on startup. If the user sends SIGKILL to the
worker while running a task, the worker will
attempt to mark the task as stale when restarted.
This causes a hang for yet-unknown reasons. Calling
revoke() is not necessary when a dead worker is being
cleaned up because all work goes through dedicated
queues which disappear when the worker does. We
leave revoke() active only when a task is being
cancelled outside of worker cleanup.

closes #2835
https://pulp.plan.io/issues/2835

@bmbouter
Copy link
Member

I just hand tested this. Against the current head of master f223542 I was able to reproduce the deadlocking when killing a worker with sigkill, restarting it, and then sending it new work. With this patch I confirmed that I could not reproduce the issue, and that new work started as expected.

@dralley
Copy link
Contributor Author

dralley commented Jan 29, 2018

@bmbouter Any objections to the logic of the change?

@@ -631,14 +631,18 @@ def _handle_cProfile(self, task_id):
self.pr.dump_stats("%s/%s" % (profile_directory, task_id))


def cancel(task_id):
def cancel(task_id, worker_cleanup=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about making this worker_cleanup=True and making the if statement instead be if worker_cleanup.

I think that would be clearer because currently if worker_cleanup is False, it actually does cleanup.

Copy link
Contributor Author

@dralley dralley Jan 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point about this being unclear but I disagree that the logic should be changed (the naming, perhaps. I think you parsed it in the opposite way from what I intended).

To clarify, my intent for worker_cleanup was to refer to the context in which the function is being used, and not what we want the cancel() function to do.

This is the exception, not the rule - only one place in the codebase attempts to cancel tasks during worker cleanup, but lots of places in the codebase attempt to cancel tasks during normal operation. If we were to flip this default, we would then need to specify worker_cleanup=False on every call to cancel() but this one, and remember to do so in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes I'm suggestion would keep it as the exception not the rule because in addition to changing the default value from False to True you would also change the s/if not worker_cleanup/if worker_cleanup/

Since the default would be to perform cleanup, we would only need to specify worker_cleanup=False on that one call that wants to modify the "cleanup by default" behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so this is basically switching the semantic meaning of worker_cleanup. That is reasonable, I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! Thank you.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dralley This tested perfectly, and the code looks great except for the default value of worker_cleanup. See the comment I left there. Can that be switched? I can LGTM then.

Thank you so much for working on this! I think the solution you came up with is very elegant.

@daviddavis
Copy link
Contributor

@dralley if you're going to go back and make changes, can I request the changes to server/pulp/server/async/app.py be removed from this PR and be opened in a separate PR?

@dralley
Copy link
Contributor Author

dralley commented Jan 29, 2018

@daviddavis sure

Fixes an issue where workers would crash when
attempting to revoke() a previously running task
on startup. If the user sends SIGKILL to the
worker while running a task, the worker will
attempt to mark the task as stale when restarted.
This causes a hang for yet-unknown reasons. Calling
revoke() is not necessary when a dead worker is being
cleaned up because all work goes through dedicated
queues which disappear when the worker does. We
leave revoke() active only when a task is being
cancelled outside of worker cleanup.

closes #2835
https://pulp.plan.io/issues/2835

	modified:   server/pulp/server/async/tasks.py
	modified:   server/test/unit/server/async/test_tasks.py
@dralley
Copy link
Contributor Author

dralley commented Jan 29, 2018

@bmbouter @daviddavis changes made

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks exactly correct. Thank you so much @dralley!

@dralley dralley merged commit cf86842 into pulp:master Jan 29, 2018
@dralley dralley deleted the 2835-better-fix branch January 29, 2018 17:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants