Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect the "zombie" jobs, and kill them 馃 馃敨 #741

Closed
severo opened this issue Jan 31, 2023 · 20 comments 路 Fixed by #826
Closed

Detect the "zombie" jobs, and kill them 馃 馃敨 #741

severo opened this issue Jan 31, 2023 · 20 comments 路 Fixed by #826
Assignees

Comments

@severo
Copy link
Collaborator

severo commented Jan 31, 2023

Sometimes the pods crash:

prod-datasets-server-worker-first-rows-8579994756-vgpkg           0/1     OutOfmemory              0               92m
prod-datasets-server-worker-first-rows-8579994756-vmvk7           0/1     OutOfmemory              0               92m
prod-datasets-server-worker-first-rows-8579994756-vsmmt           1/1     Running                  0               4h27m
prod-datasets-server-worker-first-rows-8579994756-vxtwn           0/1     OutOfmemory              0               3h12m
prod-datasets-server-worker-first-rows-8579994756-vzs6j           0/1     OutOfmemory              0               3h12m
prod-datasets-server-worker-first-rows-8579994756-w2k55           1/1     Running                  2 (3h25m ago)   4h16m
prod-datasets-server-worker-first-rows-8579994756-w5c6m           1/1     Running                  0               4h27m
prod-datasets-server-worker-first-rows-8579994756-w5ks6           1/1     Running                  1 (4h21m ago)   4h27m
prod-datasets-server-worker-first-rows-8579994756-w7ds5           1/1     Running                  0               4h27m
prod-datasets-server-worker-first-rows-8579994756-wbqlq           1/1     Running                  0               4h16m

The job that was running then stays forever in the "started" status, what we can call a zombie.

The issue is that, for the rules implemented in the queue logic, it can prevent other jobs for the same dataset, or for the same user, to be processed. It even prevents to re-run the same job.

Ideally, we should detect that the job has failed, change its status to "error" and put an error response in the cache database.

To implement this, an option proposed by @XciD is:

  • to have a parallel thread (heartbeat) that will update the job in the database every xxx seconds
  • a "garbage collector" loop will look for zombies and finish them as described above
@severo severo changed the title Detect the "zombie" jobs, and re-run them Detect the "zombie" jobs, and kill them 馃 馃敨 Jan 31, 2023
@AndreaFrancis
Copy link
Contributor

AndreaFrancis commented Feb 7, 2023

Why not to handle that validation before worker loop starts? I mean, before https://github.com/huggingface/datasets-server/blob/main/workers/datasets_based/src/datasets_based/main.py#L15 look for all "started" jobs and cancel them and create another in the queue (Like in force-refresh).
With this approach we could avoid having parallel extra validation. I assume that while Service is alive and a job is in "started" status, it is because something is being computed and indeed is "alive" but, if service starts and there are "started" jobs, they should be "death".

@severo
Copy link
Collaborator Author

severo commented Feb 7, 2023

Yes, you're right, but it's only part of the problem. While the Service is alive, a lot of pods crash. We can see it in the list of pods: ContainerStatusUnknown, Error, OutOfmemory, OutOfcpu, OOMKilled.
But the app is still waiting for them to finish, and in particular, as the number of concurrent jobs for the same user is limited, no other jobs are started once if the limit has been reached.

@severo
Copy link
Collaborator Author

severo commented Feb 7, 2023

Also, when a worker starts, others may already be running. We have currently no way to know

@AndreaFrancis
Copy link
Contributor

Sorry, I was totally forgetting that there are multiple pods for the same service

@AndreaFrancis
Copy link
Contributor

I think we can start with "garbage collector" step first without implementing the heartbeat yet:

Garbage collector:

  • A loop would sleep every T1 seconds
  • The loop could have a process that
    • Reads all jobs in db with "started" mode
    • Filter only those jobs which time difference ("started_at" - now) > T2
    • Cancel the job
    • Inserts a new job in the queue with "waiting" status if there isn't already one
      Considerations:
  • Need to define if this collector will run only in one pod or many - I think one is enough at the beginning since we dont have huge data with "starting" (Maybe when doing a backfill)
  • Need to define if we want one collector per worker
  • Need to define T1 - It could be a small number (5 min), at first it is not critical. After "heartbeat" is implemented, this value should be consistent with a T3 time interval (Interval for heartbeat updates the db).
  • Need to define T2 - At the beginning, we should have a value per worker, we could assume that having a media o duration per succeeded/error job and it should be a env configuration param. (i.e 3 hrs for first-rows and 10min for /config-names).
    Once heartbeat is implemented, we should change the value to a minor one that is consistent with T3 (heartbeat time interval).

Cons of implementing initially only collector step:

  • Less accuracy, since T2 is based on some assumption and would have a big value (for first rows for example)

After heartbeat step is implemented:
Pros

  • More accuracy since T3 is consistent with T1 and if we give it a small number, will be able to identify the "zombies" quicker
    Cons
  • Many modifications to the db per job and worker, would it increase db usage?
  • Would having an extra thread on execution time per job increase the memory usage?

@severo
Copy link
Collaborator Author

severo commented Feb 10, 2023

It seems like a good plan to me. WDYT @huggingface/datasets-server?

@lhoestq
Copy link
Member

lhoestq commented Feb 10, 2023

Inserts a new job in the queue with "waiting" status if there isn't already one

We'll just end up with a queue made of unfeasible job no ? We'd probably need allow it a max number of retries (2 or 3)

Cancel the job

Some operations like /parquet-and-dataset-info write data to hub repos so we need to clean it up at one point if it gets canceled (for this one we just need to delete some files).

Also we'd need a way to communicate to the worker that it has to cancel the job. Right now we don't know which worker is doing what. Instead, maybe the worker that is executing the job can cancel itself if it takes too long ? That would mean that the worker loop process should be controlled by another process that is responsible for checking if the job needs to be canceled.

@severo
Copy link
Collaborator Author

severo commented Feb 10, 2023

We'd probably need allow it a max number of retries (2 or 3)

Yes, you're right: we should just insert an error in the cache, with an error code "pod crashed" or something like that. No need to retry, I think.

Some operations like /parquet-and-dataset-info write data to hub repos so we need to clean it up at one point if it gets canceled (for this one we just need to delete some files).

You're right, but it can be hard, because we're not sure if the files have been updated by the worker, or are from a previous job.

the worker loop process should be controlled by another process that is responsible for checking if the job needs to be canceled.

hmmm: the heartbeat thread running along the worker could be in charge of stopping the worker when the duration is above the limit (timeout).

@lhoestq
Copy link
Member

lhoestq commented Feb 10, 2023

You're right, but it can be hard, because we're not sure if the files have been updated by the worker, or are from a previous job.

We just need to revert commits until we reach a valid one ? Valid commits can be stored in the db when a job succeeds

@AndreaFrancis
Copy link
Contributor

AndreaFrancis commented Feb 10, 2023

Yes, you're right: we should just insert an error in the cache, with an error code "pod crashed" or something like that. No need to retry, I think.

What if a job failed because of suddenly pod crash but it is a small one, I agree to support a limit number of retry opportunities, that will also let us identify those conflicting datasets.

Some operations like /parquet-and-dataset-info write data to hub repos so we need to clean it up at one point if it gets canceled (for this one we just need to delete some files).

We need to implement a "rollback" logic for the parquet specific case first, since currently we don't have that logic in cancel-jobs endpoint https://github.com/huggingface/datasets-server/blob/main/services/admin/src/admin/routes/cancel_jobs.py#L33
https://github.com/huggingface/datasets-server/blob/main/libs/libcommon/src/libcommon/queue.py#L433 maybe we are committing partial parquet files when doing a prod deploy

@lhoestq
Copy link
Member

lhoestq commented Feb 10, 2023

Opened #804 regarding rollback for /parquet-and-dataset-info

@lhoestq
Copy link
Member

lhoestq commented Feb 12, 2023

Btw I recall Celery (a task queue) already implements time limits with clean ups and automatic retries with delays. Is someone familiar with it ? It looks interesting

@severo
Copy link
Collaborator Author

severo commented Feb 13, 2023

Valid commits can be stored in the db when a job succeeds

The commit is stored in the cache, along with the worker version

@severo
Copy link
Collaborator Author

severo commented Feb 13, 2023

What if a job failed because of suddenly pod crash but it is a small one, I agree to support a limit number of retry opportunities, that will also let us identify those conflicting datasets.

Yes, but I think it's better to manage it in another PR, to focus on one feature at a time.

@severo
Copy link
Collaborator Author

severo commented Feb 13, 2023

committing partial parquet files

We are using only one commit to send all the parquet files, I don't know if it's possible for this commit to be "partial"; I don't think so (we use https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.create_commit).

@severo
Copy link
Collaborator Author

severo commented Feb 13, 2023

Btw I recall Celery (a task queue) already implements time limits with clean ups and automatic retries with delays. Is someone familiar with it ? It looks interesting

I look at it along with dagster and airflow (even if the aim is not the same). Maybe we could give it a try, but I'm not sure how we can manage the priorities in the queue (various levels of priority: NORMAL/LOW, limit the number of concurrent jobs for the same user, prioritize the jobs from users that have no started jobs yet, etc.)
Also: would it be able to work both with kubernetes and locally without kubernetes?

@lhoestq
Copy link
Member

lhoestq commented Feb 13, 2023

There's a priority level you can set for each task in Celery : celery/celery#2635 (comment)

And we can set it based on the user's history for example

And I've seen examples of kubernetes and docker-compose deployments on github - you just need to define the celery workers, the rabbitmq queue, etc.

@severo
Copy link
Collaborator Author

severo commented Feb 13, 2023

OK. it might be a good replacement for part of the code. Should we explore this as part of this issue, or fix the issue first and try to migrate afterward?

@lhoestq
Copy link
Member

lhoestq commented Feb 13, 2023

I think we're still learning our needs, so for now I'd continue with the current implem to keep full flexibility. Let's think about it again once we have random access to rows ?

@lhoestq lhoestq self-assigned this Feb 14, 2023
This was referenced Feb 15, 2023
@lhoestq
Copy link
Member

lhoestq commented Feb 17, 2023

Zombies jobs are now killed every 10 minutes if the last heartbeat is >5min (there's one heartbeat every minute).

Next is to put an error response in the cache database when it happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants