Add Batch Processing of Jobs (aka Batching) #105

olttwa · 2023-02-17T12:05:42Z

What is Batching?

Sometimes, a task involves working on multiple units. Instead of performing these tasks sequentially, executing them in parallel can speed up the process.
Let's say the task is to send emails to N Customers, N being a large number. Instead of 1 Job sending emails to N Customers sequentially, the task can be sped up by enqueuing N Jobs each sending the email to 1 Customer. There is 1 problem with this approach: You cannot track the status of the task and be notified of its completion.

Batch Processing is the process of executing N Jobs under 1 umbrella, tracking their status collectively and being notified when all the Jobs are completed.
Batching helps reduce the time it takes to complete a large Task and can help build a complex workflow by feature of Callbacks that can trigger other Tasks upon its completion.

Feature Specs

A Batch must have a one-way workflow of states. Something like created -> ready-for-execution -> in-progress -> executed-at-least-once -> successful/partially-successful/failed -> completed
Once a Batch is marked as ready-for-execution, no more Jobs must be allowed to be added to it. Reason being, if existing Jobs complete, a Batch cannot go from completed state to in-progress state.
A Batch's status should be trackable via API. The status should return all metadata like: {state: "in-progress", total: 100, executing: 30, successful: 45, retrying: 15, died: 10, created_at: "2023-feb-01 11:00", updated_at: "2023-feb-1 12:00"}
If a batch is deleted, all Jobs within the batch should be deleted.
A Batch must have a callback to mark its completion. For simplicity sake, there should be only 1 callback which marks that all Jobs have reached a state of completion (succeeded, or exhausted retries & died). Callbacks need to be implemented as per details here: Client-side Callbacks when a Job executes #54
A Batch shouldn't have a deadline/timeout. That is best taken care of by retry-settings at an individual Job-level.
Scheduling a Job within a batch shouldn't be allowed. If this functionality is needed, it can be achieved by a scheduled Job creating a Batch of Jobs

Nuances

Ordering of Job-execution within a batch cannot be guaranteed since Jobs will be executed parallely on different workers
When a Batch is deleted, all Jobs in the queue will be deleted. However, some Jobs might be executing. Their deletion cannot be guaranteed. They might fail and continue to be retried until they are dead. To address this, Jobs can check status of the batch before executing. Checks being performed before execution can be detrimental to performance. Hence, it's not advisable to enqueue batches that might need to be deleted mid-way.

Implementation Details

This is a complex feature to build. Some ideas after initial investigation:

A persistent store will be required to store count of executing jobs. Hence, this feature can exist for a message-broker like Redis and Postgres, but not for RabbitMQ.

The text was updated successfully, but these errors were encountered:

olttwa · 2023-02-17T12:06:08Z

cc @bsless

olttwa added the feature label Feb 17, 2023

olttwa mentioned this issue Feb 17, 2023

Client-side Callbacks when a Job executes #54

Closed

olttwa mentioned this issue Sep 2, 2023

Add Batching feature #140

Merged

olttwa closed this as completed in #140 Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Batch Processing of Jobs (aka Batching) #105

Add Batch Processing of Jobs (aka Batching) #105

olttwa commented Feb 17, 2023 •

edited

Loading

olttwa commented Feb 17, 2023

Add Batch Processing of Jobs (aka Batching) #105

Add Batch Processing of Jobs (aka Batching) #105

Comments

olttwa commented Feb 17, 2023 • edited Loading

What is Batching?

Feature Specs

Nuances

Implementation Details

olttwa commented Feb 17, 2023

olttwa commented Feb 17, 2023 •

edited

Loading