[Epic] support running multiple velero backups/restores concurrently #487

ncdc · 2018-05-14T18:07:39Z

We need to account for a few things:

If 2 server pods run simultaneously, they both might get the same Backup or Restore in a New state. They would both attempt to change the state to InProgress, they would likely both succeed, resulting in undesirable behavior.
If a Backup or Restore is InProgress and the server terminates for any reason (scaled down, crashes, terminates normally), we ideally need to be able to have the replacement server process pick up whatever was in progress instead of having it linger.

rosskukulinski · 2018-06-14T04:37:56Z

From a use-ability perspective, I think this is a particularly important issue for us to tackle. There are a number of reasons why two Server pods might be running simultaneously and we need to handle that gracefully.

In addition, there's lots of ways an InProgress backup get stuck because a server exits. Again, we need to handle this gracefully. Metrics in #84 may be able to help (gauge displaying the current number of backups InProgress or Failed) visualize these issues, but it won't fix them.

ncdc · 2018-06-14T11:41:07Z

A backup/restore won't get stuck during normal shutdown operations because the Ark server waits for all in-progress work to complete before terminating. If that work takes too long and exceeds the pod's deletion grace period, then Kubernetes would forcefully kill the container, and that would interrupt the in-progress work before it had a chance to finish.

There are, however, plenty of situations where the Ark server could exit while doing work:

exceeding the grace period on a normal shutdown
OOM killed
a bug of some sort that causes a crash

This is definitely something we need to handle

rosskukulinski · 2018-08-06T20:11:53Z

This needs a quick test (from code) to trigger:

get a backup thats new
patch it to inprogress
patch again to inprogress (this should fail, since already in progress)

ncdc · 2018-10-11T19:19:08Z

I had a thought about how to implement this.

Each ark server process is assigned a unique identifier - the name of the pod (we can get the value using the downward API and pass it to ark server as a flag).

Each controller worker is also assigned a unique identifier.

When a new item (backup, restore) is processed by a controller, the first thing the controller attempts to do is set status.arkServerID and status.workerID. Assuming that succeeds without a conflict, the worker can proceed to do its work.

When a worker sees an InProgress item, it checks status.arkServerID

If there are no running pods matching that name, the worker resets the status back to New for reprocessing.
If there is a running pod matching that name, and it matches this ark server, reset the status to New if there are no active workers matching status.workerID

The controller would also need add event handlers for pods. Upon a change, we'd want to reevaluate all InProgress items to see if they need to be taken over.

There's probably a lot more to flesh out here, but I wanted to write this down before I forgot it.

nrb · 2020-08-11T18:15:00Z

@xmath279 Thanks for that feedback!

dsu-igeek · 2021-02-22T22:49:41Z

Will be done after design is finished - #2601

Oblynx · 2021-10-27T07:36:26Z

This also limits our scaling.

An idea would be to split velero servers into shards based on labels.
All that should be needed is that Velero can reconcile backups based on a label selector and ignore the rest. Imagine:

deployment velero1: watches label velero-shard=1
deployment velero2: watches label velero-shard=2

backup1: label velero-shard=1
backup2: label velero-shard=2

There should be no further interaction at the level of the custom resource.

WDYT? cc @nrb

pupseba · 2022-02-04T10:17:44Z

This is a major drawback for this tool specially when using restic integration. Having a DevOps approach where we can provide with Velero as a Service to different teams for them to plan their own backup policies for their applications is chaos without parallelism.

Not only bad for DevOps but this would very negatively affect any RTO, and create a lot incertidumbre with RPOs as you'll know when you schedule a backup but not when it will be its turn in a queue shared across all different application teams in a cluster.

It seems that this feature request is not marked as P1 - Important, maybe this could be reconsidered? @eleanor-millman

eleanor-millman · 2022-02-11T21:46:16Z

Thanks for the points. We will be reviewing this in a few weeks when we go through the items still open in the 1.8 project.

qdupuy · 2022-05-02T12:53:14Z

Hello 🖖,

Any news about the multi-jobs feature at the same time?

eleanor-millman · 2022-05-03T21:16:33Z

Hi @qdupuy no immediate news. I can tell you that parallelization of Velero (which would probably include this work) is something on our radar, but we first are focusing on other work like adding a data mover to Velero and bringing the CSI plugin to GA.

ncdc added enhancement labels May 14, 2018

rosskukulinski mentioned this issue Jun 14, 2018

Enable rolling update of Restic Daemonset #552

Closed

rosskukulinski added this to the v1.0.0 milestone Jun 14, 2018

rosskukulinski mentioned this issue Jun 15, 2018

Disallow ark create backup when restoreOnlyMode: true #357

Closed

rosskukulinski added Enhancement/Dev Internal or Developer-focused Enhancement to Velero and removed Enhancement labels Jun 25, 2018

skriss modified the milestones: v1.0.0, v0.11.0 Nov 28, 2018

skriss modified the milestones: v0.11.0, v0.12.0 Feb 11, 2019

skriss modified the milestones: v0.12.0, v1.0.0 Mar 13, 2019

skriss mentioned this issue Jul 11, 2019

worker pods design doc #1653

Closed

skriss added Enhancement/User End-User Enhancement to Velero and removed Enhancement/Dev Internal or Developer-focused Enhancement to Velero labels Aug 29, 2019

skriss mentioned this issue Dec 12, 2019

velero only processing one backup at a time #2122

Closed

skriss changed the title ~~Properly handle multiple workers/controllers processing the same backup/restore~~ [RFE] support running multiple velero backups/restores concurrently Dec 12, 2019

skriss added Performance and removed P1 - Important labels Feb 19, 2020

dymurray mentioned this issue Mar 30, 2020

Allow for non cluster-admin users to run migrations migtools/mig-controller#388

Closed

23 tasks

nrb mentioned this issue Apr 7, 2020

More than one ResticRepository found for workload namespace xxx #2404

Closed

skriss mentioned this issue May 27, 2020

Run multiple backup and restores concurrently #2582

Closed

skriss added P1 - Important Epic labels Jun 2, 2020

nrb modified the milestones: v1.5, v1.6 Aug 11, 2020

nrb modified the milestones: v1.6.0, v1.7.0 Feb 22, 2021

dsu-igeek modified the milestones: v1.7.0, v1.8.0 Feb 22, 2021

dsu-igeek added the Reviewed Q2 2021 label Apr 27, 2021

eleanor-millman removed the P1 - Important label Sep 15, 2021

reasonerjt added kind/requirement Reviewed Q2 2021 Epic Performance Enhancement/User End-User Enhancement to Velero and removed Enhancement/User End-User Enhancement to Velero Performance Epic Reviewed Q2 2021 labels May 20, 2022

eleanor-millman added the 1.10-candidate The label used for 1.10 planning discussion. label May 25, 2022

dymurray added the 1.10-candidate-rh Committed to include by Red Hat for 1.10 label May 31, 2022

eleanor-millman removed the 1.10-candidate The label used for 1.10 planning discussion. label Jun 2, 2022

cleverhu mentioned this issue Sep 9, 2022

Feature to create a schedule for maintenance #5307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] support running multiple velero backups/restores concurrently #487

[Epic] support running multiple velero backups/restores concurrently #487

ncdc commented May 14, 2018 •

edited

Loading

rosskukulinski commented Jun 14, 2018

ncdc commented Jun 14, 2018

rosskukulinski commented Aug 6, 2018

ncdc commented Oct 11, 2018

nrb commented Aug 11, 2020

dsu-igeek commented Feb 22, 2021

Oblynx commented Oct 27, 2021 •

edited

Loading

pupseba commented Feb 4, 2022

eleanor-millman commented Feb 11, 2022

qdupuy commented May 2, 2022

eleanor-millman commented May 3, 2022

[Epic] support running multiple velero backups/restores concurrently #487

[Epic] support running multiple velero backups/restores concurrently #487

Comments

ncdc commented May 14, 2018 • edited Loading

rosskukulinski commented Jun 14, 2018

ncdc commented Jun 14, 2018

rosskukulinski commented Aug 6, 2018

ncdc commented Oct 11, 2018

nrb commented Aug 11, 2020

dsu-igeek commented Feb 22, 2021

Oblynx commented Oct 27, 2021 • edited Loading

pupseba commented Feb 4, 2022

eleanor-millman commented Feb 11, 2022

qdupuy commented May 2, 2022

eleanor-millman commented May 3, 2022

ncdc commented May 14, 2018 •

edited

Loading

Oblynx commented Oct 27, 2021 •

edited

Loading