Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy data-processing background jobs in production environment #24

Open
mirekys opened this issue Jan 20, 2021 · 0 comments
Open

Heavy data-processing background jobs in production environment #24

mirekys opened this issue Jan 20, 2021 · 0 comments
Assignees
Labels
backend deployment devops question Further information is requested
Projects

Comments

@mirekys
Copy link
Member

mirekys commented Jan 20, 2021

How do we manage long running background workflows that needs to process repository data stored on S3?
Each workflow could possibly consist of multiple tasks. There is no guarantee, that every task would be run on the same host and will be sharing the same temp working directory to put processed temporary files to.

Use cases, e.g.:

  • Creating archival packages for LTP LTP integration #5
  • Watermarking restaurovani-test.vscht.cz #20
    • static watermarks added to images upon upload
    • dynamic watermarks (e.g. with user details + ip + timestamp) added to documents upon download, this actually consists of following separate tasks run in a workflow pipeline:
      presentation_watermark_workflow = presentation_workflow_factory(task_list=[
        fetch_record_attachment,
        fetch_record_metadata,
        get_record_watermark_text,
        add_watermark,
        add_titlepage,
      ])
      

Questions

  • How to access the files to be processed by background tasks?
  • How to pass temporary data between tasks (possibly executed on different hosts, even clusters)?
  • How to cleanup temporary workflow files if some kind of shared temporary space is used?

Possible solutions

  1. Reduce all multi-task workflows to just a single task (no need for shared storage, but whole load of the workflow concentrated on a single host)
  2. Use S3 for temp storage (slower than normal /tmp)
  3. Execute tasks on a k8s cluster near the S3 storage Add processing k8s for data-intensive operations #22 (reduces latency, moves load away from app cluster)
  4. Provide some shared temp space that is accessible to all task workers accross k8s clusters (not sure how to do this, if possible)
@mirekys mirekys added question Further information is requested deployment labels Jan 20, 2021
@mirekys mirekys self-assigned this Jan 20, 2021
@mirekys mirekys added this to To do in Development via automation Jan 20, 2021
@mirekys mirekys changed the title Running heavy data-processing background jobs in production environment Heavy data-processing background jobs in production environment Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend deployment devops question Further information is requested
Projects
No open projects
Development
  
To do
Development

No branches or pull requests

3 participants