Heavy data-processing background jobs in production environment #24

mirekys · 2021-01-20T13:14:44Z

How do we manage long running background workflows that needs to process repository data stored on S3?
Each workflow could possibly consist of multiple tasks. There is no guarantee, that every task would be run on the same host and will be sharing the same temp working directory to put processed temporary files to.

Use cases, e.g.:

Creating archival packages for LTP LTP integration #5
Watermarking restaurovani-test.vscht.cz #20
- static watermarks added to images upon upload
- dynamic watermarks (e.g. with user details + ip + timestamp) added to documents upon download, this actually consists of following separate tasks run in a workflow pipeline:
```
presentation_watermark_workflow = presentation_workflow_factory(task_list=[
  fetch_record_attachment,
  fetch_record_metadata,
  get_record_watermark_text,
  add_watermark,
  add_titlepage,
])
```

Questions

How to access the files to be processed by background tasks?
How to pass temporary data between tasks (possibly executed on different hosts, even clusters)?
How to cleanup temporary workflow files if some kind of shared temporary space is used?

Possible solutions

Reduce all multi-task workflows to just a single task (no need for shared storage, but whole load of the workflow concentrated on a single host)
Use S3 for temp storage (slower than normal /tmp)
Execute tasks on a k8s cluster near the S3 storage Add processing k8s for data-intensive operations #22 (reduces latency, moves load away from app cluster)
Provide some shared temp space that is accessible to all task workers accross k8s clusters (not sure how to do this, if possible)

The text was updated successfully, but these errors were encountered:

mirekys added question Further information is requested deployment labels Jan 20, 2021

mirekys self-assigned this Jan 20, 2021

mirekys added this to To do in Development via automation Jan 20, 2021

mirekys assigned mesemus and berosek Jan 20, 2021

mirekys changed the title ~~Running heavy data-processing background jobs in production environment~~ Heavy data-processing background jobs in production environment Jan 20, 2021

mirekys mentioned this issue Jan 20, 2021

restaurovani-test.vscht.cz #20

Open

58 tasks

mirekys added backend devops labels Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavy data-processing background jobs in production environment #24

Heavy data-processing background jobs in production environment #24

mirekys commented Jan 20, 2021 •

edited

Loading

Heavy data-processing background jobs in production environment #24

Heavy data-processing background jobs in production environment #24

Comments

mirekys commented Jan 20, 2021 • edited Loading

Use cases, e.g.:

Questions

Possible solutions

mirekys commented Jan 20, 2021 •

edited

Loading