Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent tasking system deadlocks when Redis is restarted/data lost #1058

Merged
merged 3 commits into from
Jan 22, 2021

Conversation

dralley
Copy link
Contributor

@dralley dralley commented Dec 14, 2020

No description provided.

@dralley dralley force-pushed the tasking-deadlock branch 2 times, most recently from 695552a to 68a16f6 Compare December 14, 2020 21:47
@pulpbot
Copy link
Member

pulpbot commented Dec 14, 2020

Attached issue: https://pulp.plan.io/issues/7912

Attached issue: https://pulp.plan.io/issues/7912

Copy link
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a 1 to 1 relationship of workers and queues? In that case it should be possible merge the cleanup into one routine.

As you do not enque the resource job anymore, you should have a look at this line too:

resource_job = Job(id=str(task_status._resource_job_id), connection=redis_conn)

pulpcore/tasking/worker_watcher.py Outdated Show resolved Hide resolved
@dralley
Copy link
Contributor Author

dralley commented Dec 15, 2020

This resource job is actually the "resource manager job" whereas the one I got rid of was the "release resources job". So unfortunately we still need it since they do different things.

@dralley dralley force-pushed the tasking-deadlock branch 2 times, most recently from ea19612 to e1d076f Compare December 15, 2020 14:56
def handle_worker_offline(worker_name):
"""
This is a generic function for handling workers going offline.
def check_dropped_queues():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name could be clearer I think. Somehow working in the idea that it also cancels would do it for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check_and_cancel_missing_tasks ?

_delete_worker() task is called to handle any work cleanup associated with a worker going
offline. Logging at the info level is also done.
In some situations such as a restart of Redis, Jobs can be dropped from the Redis
queues and "forgotten". Therefore the Task will never be marked completed in Pulp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another sentence here would be good I think. It could add that it's problematic because tasks that never complete never release their resources and that causes the tasking system to deadlock.

except NoSuchJobError:
cancel(task.pk)

# Also go through all of the tasks that were still queued up on the resource manager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiiiice 👍

@@ -0,0 +1 @@
Provide a mechanism to automatically resolve issues and prevent deadlocks when Redis experiences data loss (such as a restart).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: add a newline at the end?

@bmbouter
Copy link
Member

I ran some hand tests where I simulated OOM kill against a long running task and it was marked as failed nicely.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is very excellent. I am requesting very small docstring, function name, and a newline change.

Thank you for making this! It's much simpler and better! 🥇

Copy link
Member

@dkliban dkliban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement!

redis_conn = connection.get_redis_connection()

assigned_and_unfinished_tasks = Task.objects.filter(
state__in=TASK_INCOMPLETE_STATES, worker__in=Worker.objects.online_workers()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmbouter Should this be worker__isnull=False? In theory I guess you could unplug a machine and after a while the worker is no longer "online", but it's task would still be assigned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I suppose the worker watcher should handle that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tested this actually I pkilled all workers including parents and their workhourse children. They remained in the running state until the worker watcher timeout occurred and they were transitioned to cancelled by the resource manager running its heartbeat check.

Copy link
Member

@bmbouter bmbouter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@dralley dralley merged commit ac83f33 into pulp:master Jan 22, 2021
@dralley dralley deleted the tasking-deadlock branch January 22, 2021 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants