Summary
When an immediate task exceeds the 5-second IMMEDIATE_TIMEOUT, the log message is
correctly produced:
pulpcore.tasking.tasks:INFO: Immediate task <uuid> timed out after 5 seconds.
However, the task can remain stuck in running state rather than transitioning to
failed. This affects both the PostgreSQL and Redis (WORKER_TYPE=redis) worker paths,
since both share the same _execute_task/_aexecute_task code path.
Root cause
When asyncio.wait_for cancels the inner coroutine on timeout, Django's
sync_to_async (with thread_sensitive=True) serializes all ORM operations through
the main thread. The cancelled coroutine's thread may still be running a database
operation, blocking the main thread queue. The subsequent set_failed call (which
needs the same main thread) is therefore delayed. During this window the task appears
stuck in running.
If a cancel request arrives during this delay:
set_canceling() and the delayed set_failed race to UPDATE the task row.
- If
set_failed wins: task transitions to failed with the timeout message.
- If
set_canceling wins: task transitions to canceling; set_failed then finds
0 matching rows (state is no longer running) and raises RuntimeError, which
propagates up uncaught, leaving the task stuck in canceling until a worker
eventually cleans it up as canceled.
Expected behavior
The task transitions to failed immediately after the 5-second timeout with an error
describing the timeout.
Actual behavior
The task remains in running state. It may eventually transition to failed (with the
timeout error) after a delay, or to canceled if a cancel request races ahead of the
delayed set_failed.
Environment
- pulpcore version: 3.108.0
WORKER_TYPE: redis
Summary
When an immediate task exceeds the 5-second
IMMEDIATE_TIMEOUT, the log message iscorrectly produced:
However, the task can remain stuck in
runningstate rather than transitioning tofailed. This affects both the PostgreSQL and Redis (WORKER_TYPE=redis) worker paths,since both share the same
_execute_task/_aexecute_taskcode path.Root cause
When
asyncio.wait_forcancels the inner coroutine on timeout, Django'ssync_to_async(withthread_sensitive=True) serializes all ORM operations throughthe main thread. The cancelled coroutine's thread may still be running a database
operation, blocking the main thread queue. The subsequent
set_failedcall (whichneeds the same main thread) is therefore delayed. During this window the task appears
stuck in
running.If a cancel request arrives during this delay:
set_canceling()and the delayedset_failedrace to UPDATE the task row.set_failedwins: task transitions tofailedwith the timeout message.set_cancelingwins: task transitions tocanceling;set_failedthen finds0 matching rows (state is no longer
running) and raisesRuntimeError, whichpropagates up uncaught, leaving the task stuck in
cancelinguntil a workereventually cleans it up as
canceled.Expected behavior
The task transitions to
failedimmediately after the 5-second timeout with an errordescribing the timeout.
Actual behavior
The task remains in
runningstate. It may eventually transition tofailed(with thetimeout error) after a delay, or to
canceledif a cancel request races ahead of thedelayed
set_failed.Environment
WORKER_TYPE:redis