-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent tasking system deadlocks when Redis is restarted/data lost #1058
Conversation
695552a
to
68a16f6
Compare
Attached issue: https://pulp.plan.io/issues/7912 Attached issue: https://pulp.plan.io/issues/7912 |
68a16f6
to
3d3f308
Compare
a1460fc
to
a273da5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a 1 to 1 relationship of workers and queues? In that case it should be possible merge the cleanup into one routine.
As you do not enque the resource job anymore, you should have a look at this line too:
pulpcore/pulpcore/tasking/util.py
Line 49 in 25e92d0
resource_job = Job(id=str(task_status._resource_job_id), connection=redis_conn) |
This resource job is actually the "resource manager job" whereas the one I got rid of was the "release resources job". So unfortunately we still need it since they do different things. |
ea19612
to
e1d076f
Compare
pulpcore/tasking/worker_watcher.py
Outdated
def handle_worker_offline(worker_name): | ||
""" | ||
This is a generic function for handling workers going offline. | ||
def check_dropped_queues(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name could be clearer I think. Somehow working in the idea that it also cancels would do it for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check_and_cancel_missing_tasks
?
pulpcore/tasking/worker_watcher.py
Outdated
_delete_worker() task is called to handle any work cleanup associated with a worker going | ||
offline. Logging at the info level is also done. | ||
In some situations such as a restart of Redis, Jobs can be dropped from the Redis | ||
queues and "forgotten". Therefore the Task will never be marked completed in Pulp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another sentence here would be good I think. It could add that it's problematic because tasks that never complete never release their resources and that causes the tasking system to deadlock.
except NoSuchJobError: | ||
cancel(task.pk) | ||
|
||
# Also go through all of the tasks that were still queued up on the resource manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Niiiiice 👍
CHANGES/7912.bugfix
Outdated
@@ -0,0 +1 @@ | |||
Provide a mechanism to automatically resolve issues and prevent deadlocks when Redis experiences data loss (such as a restart). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: add a newline at the end?
I ran some hand tests where I simulated OOM kill against a long running task and it was marked as failed nicely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is very excellent. I am requesting very small docstring, function name, and a newline change.
Thank you for making this! It's much simpler and better! 🥇
e1d076f
to
73f01f2
Compare
[noissue]
Move it to an after-task action re: #7912 https://pulp.plan.io/issues/7912
73f01f2
to
9fce6fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement!
redis_conn = connection.get_redis_connection() | ||
|
||
assigned_and_unfinished_tasks = Task.objects.filter( | ||
state__in=TASK_INCOMPLETE_STATES, worker__in=Worker.objects.online_workers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmbouter Should this be worker__isnull=False
? In theory I guess you could unplug a machine and after a while the worker is no longer "online", but it's task would still be assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I suppose the worker watcher should handle that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I tested this actually I pkilled all workers including parents and their workhourse children. They remained in the running state until the worker watcher timeout occurred and they were transitioned to cancelled by the resource manager running its heartbeat check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
No description provided.