Prevent tasking system deadlocks when Redis is restarted/data lost #1058

dralley · 2020-12-14T21:43:01Z

No description provided.

pulpbot · 2020-12-14T21:47:53Z

Attached issue: https://pulp.plan.io/issues/7912

mdellweg

Is there a 1 to 1 relationship of workers and queues? In that case it should be possible merge the cleanup into one routine.

As you do not enque the resource job anymore, you should have a look at this line too:

pulpcore/pulpcore/tasking/util.py

Line 49 in 25e92d0

resource_job = Job(id=str(task_status._resource_job_id), connection=redis_conn)

pulpcore/tasking/worker_watcher.py

dralley · 2020-12-15T14:10:33Z

This resource job is actually the "resource manager job" whereas the one I got rid of was the "release resources job". So unfortunately we still need it since they do different things.

bmbouter · 2021-01-21T21:12:27Z

pulpcore/tasking/worker_watcher.py

-def handle_worker_offline(worker_name):
-    """
-    This is a generic function for handling workers going offline.
+def check_dropped_queues():


This name could be clearer I think. Somehow working in the idea that it also cancels would do it for me.

Maybe check_and_cancel_missing_tasks ?

bmbouter · 2021-01-21T21:13:46Z

pulpcore/tasking/worker_watcher.py

-    _delete_worker() task is called to handle any work cleanup associated with a worker going
-    offline. Logging at the info level is also done.
+    In some situations such as a restart of Redis, Jobs can be dropped from the Redis
+    queues and "forgotten". Therefore the Task will never be marked completed in Pulp.


Another sentence here would be good I think. It could add that it's problematic because tasks that never complete never release their resources and that causes the tasking system to deadlock.

bmbouter · 2021-01-21T21:14:32Z

pulpcore/tasking/worker_watcher.py

+        except NoSuchJobError:
+            cancel(task.pk)
+
+    # Also go through all of the tasks that were still queued up on the resource manager


Niiiiice 👍

bmbouter · 2021-01-21T21:17:04Z

CHANGES/7912.bugfix

@@ -0,0 +1 @@
+Provide a mechanism to automatically resolve issues and prevent deadlocks when Redis experiences data loss (such as a restart).


nitpick: add a newline at the end?

bmbouter · 2021-01-21T21:25:31Z

I ran some hand tests where I simulated OOM kill against a long running task and it was marked as failed nicely.

bmbouter

I think this PR is very excellent. I am requesting very small docstring, function name, and a newline change.

Thank you for making this! It's much simpler and better! 🥇

[noissue]

Move it to an after-task action re: #7912 https://pulp.plan.io/issues/7912

closes: #7912 https://pulp.plan.io/issues/7912

dkliban

This is a great improvement!

dralley · 2021-01-21T22:21:54Z

pulpcore/tasking/worker_watcher.py

+    redis_conn = connection.get_redis_connection()
+
+    assigned_and_unfinished_tasks = Task.objects.filter(
+        state__in=TASK_INCOMPLETE_STATES, worker__in=Worker.objects.online_workers()


@bmbouter Should this be worker__isnull=False? In theory I guess you could unplug a machine and after a while the worker is no longer "online", but it's task would still be assigned.

Although I suppose the worker watcher should handle that

Yeah I tested this actually I pkilled all workers including parents and their workhourse children. They remained in the running state until the worker watcher timeout occurred and they were transitioned to cancelled by the resource manager running its heartbeat check.

bmbouter

Thank you!

dralley force-pushed the tasking-deadlock branch 2 times, most recently from 695552a to 68a16f6 Compare December 14, 2020 21:47

dralley force-pushed the tasking-deadlock branch from 68a16f6 to 3d3f308 Compare December 14, 2020 21:50

dralley requested review from bmbouter and mdellweg December 14, 2020 21:51

dralley force-pushed the tasking-deadlock branch 2 times, most recently from a1460fc to a273da5 Compare December 15, 2020 04:49

mdellweg reviewed Dec 15, 2020

View reviewed changes

pulpcore/tasking/worker_watcher.py Outdated Show resolved Hide resolved

dralley force-pushed the tasking-deadlock branch 2 times, most recently from ea19612 to e1d076f Compare December 15, 2020 14:56

bmbouter reviewed Jan 21, 2021

View reviewed changes

bmbouter requested changes Jan 21, 2021

View reviewed changes

dralley force-pushed the tasking-deadlock branch from e1d076f to 73f01f2 Compare January 21, 2021 21:30

dralley added 3 commits January 21, 2021 16:35

Delete the "tasking.services" module

aa24481

[noissue]

Remove separate task for releasing resources

721918e

Move it to an after-task action re: #7912 https://pulp.plan.io/issues/7912

Avoid deadlocks when Redis is shut down and tasks are lost

9fce6fc

closes: #7912 https://pulp.plan.io/issues/7912

dralley force-pushed the tasking-deadlock branch from 73f01f2 to 9fce6fc Compare January 21, 2021 21:35

dkliban approved these changes Jan 21, 2021

View reviewed changes

dralley commented Jan 21, 2021

View reviewed changes

bmbouter approved these changes Jan 22, 2021

View reviewed changes

dralley merged commit ac83f33 into pulp:master Jan 22, 2021

dralley deleted the tasking-deadlock branch January 22, 2021 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent tasking system deadlocks when Redis is restarted/data lost #1058

Prevent tasking system deadlocks when Redis is restarted/data lost #1058

dralley commented Dec 14, 2020

pulpbot commented Dec 14, 2020

mdellweg left a comment

dralley commented Dec 15, 2020

bmbouter Jan 21, 2021

bmbouter Jan 21, 2021

bmbouter Jan 21, 2021

bmbouter Jan 21, 2021

bmbouter Jan 21, 2021

bmbouter commented Jan 21, 2021

bmbouter left a comment

dkliban left a comment

dralley Jan 21, 2021

dralley Jan 22, 2021

bmbouter Jan 22, 2021

bmbouter left a comment

		@@ -0,0 +1 @@
		Provide a mechanism to automatically resolve issues and prevent deadlocks when Redis experiences data loss (such as a restart).

Prevent tasking system deadlocks when Redis is restarted/data lost #1058

Prevent tasking system deadlocks when Redis is restarted/data lost #1058

Conversation

dralley commented Dec 14, 2020

pulpbot commented Dec 14, 2020

mdellweg left a comment

Choose a reason for hiding this comment

dralley commented Dec 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmbouter commented Jan 21, 2021

bmbouter left a comment

Choose a reason for hiding this comment

dkliban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmbouter left a comment

Choose a reason for hiding this comment