Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues where tasks records remain in the 'waiting' state forever. #712

Closed
wants to merge 1 commit into from

Conversation

dralley
Copy link
Contributor

@dralley dralley commented May 20, 2020

No description provided.

@dralley
Copy link
Contributor Author

dralley commented May 20, 2020

Still just a partial fix.

@pulpbot
Copy link
Member

pulpbot commented May 20, 2020

Attached issue: https://pulp.plan.io/issues/6449

@bmbouter
Copy link
Member

I'm wondering if the fix would involve some sort of change in this area where the queue is created.

@dralley
Copy link
Contributor Author

dralley commented May 20, 2020

I'm not sure. I think it might be a little racy. If I step through this function [0] with a debugger, it works...

[0] https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/tasks.py#L51

@bmbouter
Copy link
Member

What about reproducing it like this?

  1. Add a long sleep statement to a file sync, like 300 seconds.
  2. Create N+1 repositories to sync into where N is the number of pulp workers. So 2 workers by default, create 3 repos.
  3. Initiate a sync of the N+1 repos. I expect N to start and 1 to be 'waiting'
  4. sudo pkill -9 -f redis; sudo systemctl start redis
  5. I think the 1 'waiting' task will be stuck.

@dralley
Copy link
Contributor Author

dralley commented May 20, 2020

I might know what it is. The task is spawned onto a queue with the name of the worker they are assigned to. This worker is determined via _acquire_worker with a database query. When redis goes down, all the workers get killed off (and recreated with different names), but the records of the old workers remain in the database for 30 seconds until they get cleaned up in the background.

If it happens quickly enough, it could try to assign the task to a dead worker, and dump the task into a queue for the dead worker. That should be easy enough to verify...

This would have been introduced when we started using a random(ish?) number component in the worker names.

@dralley
Copy link
Contributor Author

dralley commented May 20, 2020

Confirmed that ^^ is the problem. I halted execution just after the worker was assigned and waited about 15 seconds, and refreshed the record. It went from online=True && missing=False to the reverse.

So we have a window of some not-insignificant timeframe where tasks are lost by being assigned to dead workers -- and their reserved resources which will never be released block subsequent tasks from progressing as well.

@bmbouter
Copy link
Member

Pulp's worker design currently is supposed to handle this. It's ok for the resource manager to route work to dead workers. It's not ideal but that's a separate issue. So to me the problem is that cleanup mechanism isn't working.

That code lives here: https://github.com/pulp/pulpcore/blob/master/pulpcore/tasking/services/worker_watcher.py#L53-L170

@dralley
Copy link
Contributor Author

dralley commented May 31, 2020

@bmbouter A little more detail than I mentioned the other day:

I've seen 3 different scenarios:

  • The enqueued task gets assigned to a live worker when the resource manager comes back up and it completes successfully
  • The enqueued task is assigned to a dead worker, the association is committed to the database before the worker is cleaned up, and the relationship followed and the task cancelled when the worker is marked offline
    • I've only seen this happen once, I think it's pretty rare for this to work as expected
  • The enqueued task is assigned to a dead worker, the association is committed to the database after the worker has already been cleaned up and the task is left "waiting".

@dralley dralley closed this Sep 4, 2020
@dralley dralley deleted the fix-waiting-tasks branch September 4, 2020 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants