Max job retained pipelined runs #608

fruttasecca · 2021-12-21T16:05:23Z

Description

This PR adds to the BE the functionality to automatically clean up the oldest job pipeline runs. This is done through a new field of the Job model, max_retained_pipeline_runs. This can be specified both during the creation of a job (POST) or update (PUT). This will make it so that 1) when new pipeline runs are scheduled for the job 2) when a pipeline run of the job gets into an end state the logic will check for the oldest job pipeline runs to delete. A requirement for a pipeline run to be deleted is to be in an end state (SUCCESS, FAILED, ABORTED).

Some details:

the deletion is asynchronous to new runs being scheduled or a run getting to an end state, that is, the cleanup is not run in the same transaction but after. This is because the orchest-api does not control theuserdir, but the orchest-webserver does.
each job pipeline run (as before) has a pipeline_run_index, this is used to find out what are the oldest job pipeline runs to delete, i.e. highest_index - max_retained_pipeline_runs would be the threshold to be considered for deletion
among the pipeline runs considered for deletion, only the ones in an end state are considered
max_retained_pipeline_runs = 0 is a valid value, i.e. every run will cleanup itself after reaching an end state
most likely, this setting makes sense only for cronjobs, but I have not put a constraint in place, keen to know your opinion about this being a FE "constraint" or an actual BE constraint

Related to #550

@yannickperrenet The deletion of the directory of a job pipeline run could have been done in different manners, I went for preserving the constraint that the orchest-api does not touch the userdir, and for reusing the existing orchest-webserver endpoint, so now the orchest-api is actually getting in touch with the webserver and it's quite awkward since it will result in a call to the orchest-api. Keen to know your opinion :).

⚠️ The PR contains schema changes that will alter your DB if you build & run.

Checklist

I have manually tested the application to make sure the changes don’t cause any downstream issues, which includes making sure ./orchest status --ext is not reporting failures when Orchest is running.

In case I changed one of the services’ models.py I have performed the appropriate database migrations.

Later used to retrieve job runs to delete when max_pipeline_runs retained has a valid value.

A job level property must be used to keep track of how many job pipeline runs have been scheduled because job pipeline runs can now be deleted, also the parameters of a (cron)job can be edited.

For some reason alembic did not like this migration (i.e. the previous verison) when creating database from scratch.

ricklamers · 2021-12-22T10:31:21Z

Nice job! LGTM

services/orchest-api/app/app/models.py

services/orchest-api/app/app/apis/namespace_jobs.py

Makes it so that the deletion of a job pipeline run directory is done by the celery worker rather than the orchest webserver. This allows to avoid a bidirectional relationship between the orchest-webserver and the orchest-api. The logic related to max_retained_pipeline_runs has been refactored into a 2PF, DeleteNonRetainedJobPipelineRuns.

services/orchest-api/app/app/apis/namespace_jobs.py

services/orchest-api/app/app/core/tasks.py

yannickperrenet

This is awesome! 💯

fruttasecca added 6 commits December 21, 2021 10:36

Add max_retained_pipeline_runs to jobs model

5e3c5e4

Add max_retained_pipeline_runs to Create/Update job 2PF

53df98b

Add ix_type_job_uuid_pipeline_run_index to pipeline_runs

3edd36d

Later used to retrieve job runs to delete when max_pipeline_runs retained has a valid value.

RM nonretained runs on JobRun/UpdateJobPipelineRun

ee3012e

Add delete_non_retained_pipeline_runs tests

0a1c9b7

Consider no job pipeline runs edge case

93c9c15

A job level property must be used to keep track of how many job pipeline runs have been scheduled because job pipeline runs can now be deleted, also the parameters of a (cron)job can be edited.

fruttasecca requested a review from yannickperrenet December 21, 2021 16:05

fruttasecca added 2 commits December 21, 2021 17:17

Add comment to index

f837fda

Update jobs data migration

11d0f34

For some reason alembic did not like this migration (i.e. the previous verison) when creating database from scratch.