Detect the "zombie" jobs, and kill them 🧟 🔫 #741

severo · 2023-01-31T13:37:32Z

Sometimes the pods crash:

prod-datasets-server-worker-first-rows-8579994756-vgpkg           0/1     OutOfmemory              0               92m
prod-datasets-server-worker-first-rows-8579994756-vmvk7           0/1     OutOfmemory              0               92m
prod-datasets-server-worker-first-rows-8579994756-vsmmt           1/1     Running                  0               4h27m
prod-datasets-server-worker-first-rows-8579994756-vxtwn           0/1     OutOfmemory              0               3h12m
prod-datasets-server-worker-first-rows-8579994756-vzs6j           0/1     OutOfmemory              0               3h12m
prod-datasets-server-worker-first-rows-8579994756-w2k55           1/1     Running                  2 (3h25m ago)   4h16m
prod-datasets-server-worker-first-rows-8579994756-w5c6m           1/1     Running                  0               4h27m
prod-datasets-server-worker-first-rows-8579994756-w5ks6           1/1     Running                  1 (4h21m ago)   4h27m
prod-datasets-server-worker-first-rows-8579994756-w7ds5           1/1     Running                  0               4h27m
prod-datasets-server-worker-first-rows-8579994756-wbqlq           1/1     Running                  0               4h16m

The job that was running then stays forever in the "started" status, what we can call a zombie.

The issue is that, for the rules implemented in the queue logic, it can prevent other jobs for the same dataset, or for the same user, to be processed. It even prevents to re-run the same job.

Ideally, we should detect that the job has failed, change its status to "error" and put an error response in the cache database.

To implement this, an option proposed by @XciD is:

to have a parallel thread (heartbeat) that will update the job in the database every xxx seconds
a "garbage collector" loop will look for zombies and finish them as described above

The text was updated successfully, but these errors were encountered:

AndreaFrancis · 2023-02-07T15:01:47Z

Why not to handle that validation before worker loop starts? I mean, before https://github.com/huggingface/datasets-server/blob/main/workers/datasets_based/src/datasets_based/main.py#L15 look for all "started" jobs and cancel them and create another in the queue (Like in force-refresh).
With this approach we could avoid having parallel extra validation. I assume that while Service is alive and a job is in "started" status, it is because something is being computed and indeed is "alive" but, if service starts and there are "started" jobs, they should be "death".

severo · 2023-02-07T15:05:14Z

Yes, you're right, but it's only part of the problem. While the Service is alive, a lot of pods crash. We can see it in the list of pods: ContainerStatusUnknown, Error, OutOfmemory, OutOfcpu, OOMKilled.
But the app is still waiting for them to finish, and in particular, as the number of concurrent jobs for the same user is limited, no other jobs are started once if the limit has been reached.

severo · 2023-02-07T16:54:00Z

Also, when a worker starts, others may already be running. We have currently no way to know

AndreaFrancis · 2023-02-07T18:15:12Z

Sorry, I was totally forgetting that there are multiple pods for the same service

AndreaFrancis · 2023-02-10T12:58:10Z

I think we can start with "garbage collector" step first without implementing the heartbeat yet:

Garbage collector:

A loop would sleep every T1 seconds
The loop could have a process that
- Reads all jobs in db with "started" mode
- Filter only those jobs which time difference ("started_at" - now) > T2
- Cancel the job
- Inserts a new job in the queue with "waiting" status if there isn't already one
  Considerations:
Need to define if this collector will run only in one pod or many - I think one is enough at the beginning since we dont have huge data with "starting" (Maybe when doing a backfill)
Need to define if we want one collector per worker
Need to define T1 - It could be a small number (5 min), at first it is not critical. After "heartbeat" is implemented, this value should be consistent with a T3 time interval (Interval for heartbeat updates the db).
Need to define T2 - At the beginning, we should have a value per worker, we could assume that having a media o duration per succeeded/error job and it should be a env configuration param. (i.e 3 hrs for first-rows and 10min for /config-names).
Once heartbeat is implemented, we should change the value to a minor one that is consistent with T3 (heartbeat time interval).

Cons of implementing initially only collector step:

Less accuracy, since T2 is based on some assumption and would have a big value (for first rows for example)

After heartbeat step is implemented:
Pros

More accuracy since T3 is consistent with T1 and if we give it a small number, will be able to identify the "zombies" quicker
Cons
Many modifications to the db per job and worker, would it increase db usage?
Would having an extra thread on execution time per job increase the memory usage?

severo · 2023-02-10T13:15:55Z

It seems like a good plan to me. WDYT @huggingface/datasets-server?

lhoestq · 2023-02-10T16:40:19Z

Inserts a new job in the queue with "waiting" status if there isn't already one

We'll just end up with a queue made of unfeasible job no ? We'd probably need allow it a max number of retries (2 or 3)

Cancel the job

Some operations like /parquet-and-dataset-info write data to hub repos so we need to clean it up at one point if it gets canceled (for this one we just need to delete some files).

Also we'd need a way to communicate to the worker that it has to cancel the job. Right now we don't know which worker is doing what. Instead, maybe the worker that is executing the job can cancel itself if it takes too long ? That would mean that the worker loop process should be controlled by another process that is responsible for checking if the job needs to be canceled.

severo · 2023-02-10T16:45:34Z

We'd probably need allow it a max number of retries (2 or 3)

Yes, you're right: we should just insert an error in the cache, with an error code "pod crashed" or something like that. No need to retry, I think.

Some operations like /parquet-and-dataset-info write data to hub repos so we need to clean it up at one point if it gets canceled (for this one we just need to delete some files).

You're right, but it can be hard, because we're not sure if the files have been updated by the worker, or are from a previous job.

the worker loop process should be controlled by another process that is responsible for checking if the job needs to be canceled.

hmmm: the heartbeat thread running along the worker could be in charge of stopping the worker when the duration is above the limit (timeout).

lhoestq · 2023-02-10T17:19:53Z

You're right, but it can be hard, because we're not sure if the files have been updated by the worker, or are from a previous job.

We just need to revert commits until we reach a valid one ? Valid commits can be stored in the db when a job succeeds

AndreaFrancis · 2023-02-10T20:48:28Z

Yes, you're right: we should just insert an error in the cache, with an error code "pod crashed" or something like that. No need to retry, I think.

What if a job failed because of suddenly pod crash but it is a small one, I agree to support a limit number of retry opportunities, that will also let us identify those conflicting datasets.

Some operations like /parquet-and-dataset-info write data to hub repos so we need to clean it up at one point if it gets canceled (for this one we just need to delete some files).

We need to implement a "rollback" logic for the parquet specific case first, since currently we don't have that logic in cancel-jobs endpoint https://github.com/huggingface/datasets-server/blob/main/services/admin/src/admin/routes/cancel_jobs.py#L33
https://github.com/huggingface/datasets-server/blob/main/libs/libcommon/src/libcommon/queue.py#L433 maybe we are committing partial parquet files when doing a prod deploy

lhoestq · 2023-02-10T22:11:59Z

Opened #804 regarding rollback for /parquet-and-dataset-info

lhoestq · 2023-02-12T18:00:28Z

Btw I recall Celery (a task queue) already implements time limits with clean ups and automatic retries with delays. Is someone familiar with it ? It looks interesting

severo · 2023-02-13T10:36:26Z

Valid commits can be stored in the db when a job succeeds

The commit is stored in the cache, along with the worker version

severo · 2023-02-13T10:37:28Z

What if a job failed because of suddenly pod crash but it is a small one, I agree to support a limit number of retry opportunities, that will also let us identify those conflicting datasets.

Yes, but I think it's better to manage it in another PR, to focus on one feature at a time.

severo · 2023-02-13T10:39:43Z

committing partial parquet files

We are using only one commit to send all the parquet files, I don't know if it's possible for this commit to be "partial"; I don't think so (we use https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.create_commit).

severo · 2023-02-13T10:43:19Z

Btw I recall Celery (a task queue) already implements time limits with clean ups and automatic retries with delays. Is someone familiar with it ? It looks interesting

I look at it along with dagster and airflow (even if the aim is not the same). Maybe we could give it a try, but I'm not sure how we can manage the priorities in the queue (various levels of priority: NORMAL/LOW, limit the number of concurrent jobs for the same user, prioritize the jobs from users that have no started jobs yet, etc.)
Also: would it be able to work both with kubernetes and locally without kubernetes?

lhoestq · 2023-02-13T11:25:44Z

There's a priority level you can set for each task in Celery : celery/celery#2635 (comment)

And we can set it based on the user's history for example

And I've seen examples of kubernetes and docker-compose deployments on github - you just need to define the celery workers, the rabbitmq queue, etc.

severo · 2023-02-13T12:32:02Z

OK. it might be a good replacement for part of the code. Should we explore this as part of this issue, or fix the issue first and try to migrate afterward?

lhoestq · 2023-02-13T12:45:22Z

I think we're still learning our needs, so for now I'd continue with the current implem to keep full flexibility. Let's think about it again once we have random access to rows ?

lhoestq · 2023-02-17T18:21:56Z

Zombies jobs are now killed every 10 minutes if the last heartbeat is >5min (there's one heartbeat every minute).

Next is to put an error response in the cache database when it happens

severo changed the title ~~Detect the "zombie" jobs, and re-run them~~ Detect the "zombie" jobs, and kill them 🧟 🔫 Jan 31, 2023

AndreaFrancis mentioned this issue Feb 8, 2023

Dataset Viewer issue for jonas/undp_jobs_raw #731

Closed

severo mentioned this issue Feb 13, 2023

Simplify the deployment to kubernetes #724

Closed

severo mentioned this issue Feb 14, 2023

Create an orchestrator #816

Closed

lhoestq self-assigned this Feb 14, 2023

This was referenced Feb 15, 2023

Add heartbeat #824

Merged

Kill zombies #826

Merged

lhoestq closed this as completed in #826 Feb 17, 2023

lhoestq mentioned this issue Feb 20, 2023

Set error response when zombie job is killed #827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect the "zombie" jobs, and kill them 🧟 🔫 #741

Detect the "zombie" jobs, and kill them 🧟 🔫 #741

severo commented Jan 31, 2023

AndreaFrancis commented Feb 7, 2023 •

edited

severo commented Feb 7, 2023 •

edited

severo commented Feb 7, 2023

AndreaFrancis commented Feb 7, 2023

AndreaFrancis commented Feb 10, 2023

severo commented Feb 10, 2023

lhoestq commented Feb 10, 2023

severo commented Feb 10, 2023

lhoestq commented Feb 10, 2023

AndreaFrancis commented Feb 10, 2023 •

edited

lhoestq commented Feb 10, 2023

lhoestq commented Feb 12, 2023

severo commented Feb 13, 2023

severo commented Feb 13, 2023

severo commented Feb 13, 2023

severo commented Feb 13, 2023

lhoestq commented Feb 13, 2023

severo commented Feb 13, 2023 •

edited

lhoestq commented Feb 13, 2023

lhoestq commented Feb 17, 2023 •

edited

Detect the "zombie" jobs, and kill them 🧟 🔫 #741

Detect the "zombie" jobs, and kill them 🧟 🔫 #741

Comments

severo commented Jan 31, 2023

AndreaFrancis commented Feb 7, 2023 • edited

severo commented Feb 7, 2023 • edited

severo commented Feb 7, 2023

AndreaFrancis commented Feb 7, 2023

AndreaFrancis commented Feb 10, 2023

severo commented Feb 10, 2023

lhoestq commented Feb 10, 2023

severo commented Feb 10, 2023

lhoestq commented Feb 10, 2023

AndreaFrancis commented Feb 10, 2023 • edited

lhoestq commented Feb 10, 2023

lhoestq commented Feb 12, 2023

severo commented Feb 13, 2023

severo commented Feb 13, 2023

severo commented Feb 13, 2023

severo commented Feb 13, 2023

lhoestq commented Feb 13, 2023

severo commented Feb 13, 2023 • edited

lhoestq commented Feb 13, 2023

lhoestq commented Feb 17, 2023 • edited

AndreaFrancis commented Feb 7, 2023 •

edited

severo commented Feb 7, 2023 •

edited

AndreaFrancis commented Feb 10, 2023 •

edited

severo commented Feb 13, 2023 •

edited

lhoestq commented Feb 17, 2023 •

edited