-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues where tasks records remain in the 'waiting' state forever. #712
Conversation
Still just a partial fix. |
Attached issue: https://pulp.plan.io/issues/6449 |
I'm wondering if the fix would involve some sort of change in this area where the queue is created. |
I'm not sure. I think it might be a little racy. If I step through this function [0] with a debugger, it works... [0] https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/tasks.py#L51 |
What about reproducing it like this?
|
I might know what it is. The task is spawned onto a queue with the name of the worker they are assigned to. This worker is determined via If it happens quickly enough, it could try to assign the task to a dead worker, and dump the task into a queue for the dead worker. That should be easy enough to verify... This would have been introduced when we started using a random(ish?) number component in the worker names. |
Confirmed that ^^ is the problem. I halted execution just after the worker was assigned and waited about 15 seconds, and refreshed the record. It went from So we have a window of some not-insignificant timeframe where tasks are lost by being assigned to dead workers -- and their reserved resources which will never be released block subsequent tasks from progressing as well. |
Pulp's worker design currently is supposed to handle this. It's ok for the resource manager to route work to dead workers. It's not ideal but that's a separate issue. So to me the problem is that cleanup mechanism isn't working. That code lives here: https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/services/worker_watcher.py#L53-L170 |
1c77a62
to
b9fb363
Compare
@bmbouter A little more detail than I mentioned the other day: I've seen 3 different scenarios:
|
b9fb363
to
738f88b
Compare
No description provided.