Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] support running multiple velero backups/restores concurrently #487

Open
ncdc opened this issue May 14, 2018 · 12 comments
Open

[Epic] support running multiple velero backups/restores concurrently #487

ncdc opened this issue May 14, 2018 · 12 comments
Labels
1.10-candidate-rh Committed to include by Red Hat for 1.10 Enhancement/User End-User Enhancement to Velero Epic kind/requirement Performance Reviewed Q2 2021

Comments

@ncdc
Copy link
Contributor

ncdc commented May 14, 2018

We need to account for a few things:

  1. If 2 server pods run simultaneously, they both might get the same Backup or Restore in a New state. They would both attempt to change the state to InProgress, they would likely both succeed, resulting in undesirable behavior.
  2. If a Backup or Restore is InProgress and the server terminates for any reason (scaled down, crashes, terminates normally), we ideally need to be able to have the replacement server process pick up whatever was in progress instead of having it linger.
@rosskukulinski
Copy link
Contributor

From a use-ability perspective, I think this is a particularly important issue for us to tackle. There are a number of reasons why two Server pods might be running simultaneously and we need to handle that gracefully.

In addition, there's lots of ways an InProgress backup get stuck because a server exits. Again, we need to handle this gracefully. Metrics in #84 may be able to help (gauge displaying the current number of backups InProgress or Failed) visualize these issues, but it won't fix them.

@ncdc
Copy link
Contributor Author

ncdc commented Jun 14, 2018

A backup/restore won't get stuck during normal shutdown operations because the Ark server waits for all in-progress work to complete before terminating. If that work takes too long and exceeds the pod's deletion grace period, then Kubernetes would forcefully kill the container, and that would interrupt the in-progress work before it had a chance to finish.

There are, however, plenty of situations where the Ark server could exit while doing work:

  • exceeding the grace period on a normal shutdown
  • OOM killed
  • a bug of some sort that causes a crash

This is definitely something we need to handle

@rosskukulinski rosskukulinski added Enhancement/Dev Internal or Developer-focused Enhancement to Velero and removed Enhancement labels Jun 25, 2018
@rosskukulinski
Copy link
Contributor

This needs a quick test (from code) to trigger:

  • get a backup thats new
  • patch it to inprogress
  • patch again to inprogress (this should fail, since already in progress)

@ncdc
Copy link
Contributor Author

ncdc commented Oct 11, 2018

I had a thought about how to implement this.

Each ark server process is assigned a unique identifier - the name of the pod (we can get the value using the downward API and pass it to ark server as a flag).

Each controller worker is also assigned a unique identifier.

When a new item (backup, restore) is processed by a controller, the first thing the controller attempts to do is set status.arkServerID and status.workerID. Assuming that succeeds without a conflict, the worker can proceed to do its work.

When a worker sees an InProgress item, it checks status.arkServerID

  • If there are no running pods matching that name, the worker resets the status back to New for reprocessing.
  • If there is a running pod matching that name, and it matches this ark server, reset the status to New if there are no active workers matching status.workerID

The controller would also need add event handlers for pods. Upon a change, we'd want to reevaluate all InProgress items to see if they need to be taken over.

There's probably a lot more to flesh out here, but I wanted to write this down before I forgot it.

@skriss skriss modified the milestones: v1.0.0, v0.11.0 Nov 28, 2018
@skriss skriss modified the milestones: v0.11.0, v0.12.0 Feb 11, 2019
@skriss skriss modified the milestones: v0.12.0, v1.0.0 Mar 13, 2019
@skriss skriss added Enhancement/User End-User Enhancement to Velero and removed Enhancement/Dev Internal or Developer-focused Enhancement to Velero labels Aug 29, 2019
@skriss skriss changed the title Properly handle multiple workers/controllers processing the same backup/restore [RFE] support running multiple velero backups/restores concurrently Dec 12, 2019
@nrb
Copy link
Contributor

nrb commented Aug 11, 2020

@xmath279 Thanks for that feedback!

@nrb nrb modified the milestones: v1.5, v1.6 Aug 11, 2020
@nrb nrb modified the milestones: v1.6.0, v1.7.0 Feb 22, 2021
@dsu-igeek
Copy link
Contributor

Will be done after design is finished - #2601

@Oblynx
Copy link

Oblynx commented Oct 27, 2021

This also limits our scaling.

An idea would be to split velero servers into shards based on labels.
All that should be needed is that Velero can reconcile backups based on a label selector and ignore the rest. Imagine:

deployment velero1: watches label velero-shard=1
deployment velero2: watches label velero-shard=2

backup1: label velero-shard=1
backup2: label velero-shard=2

There should be no further interaction at the level of the custom resource.

WDYT? cc @nrb

@pupseba
Copy link

pupseba commented Feb 4, 2022

This is a major drawback for this tool specially when using restic integration. Having a DevOps approach where we can provide with Velero as a Service to different teams for them to plan their own backup policies for their applications is chaos without parallelism.

Not only bad for DevOps but this would very negatively affect any RTO, and create a lot incertidumbre with RPOs as you'll know when you schedule a backup but not when it will be its turn in a queue shared across all different application teams in a cluster.

It seems that this feature request is not marked as P1 - Important, maybe this could be reconsidered? @eleanor-millman

@eleanor-millman
Copy link
Contributor

Thanks for the points. We will be reviewing this in a few weeks when we go through the items still open in the 1.8 project.

@qdupuy
Copy link

qdupuy commented May 2, 2022

Hello 🖖,

Any news about the multi-jobs feature at the same time?

@eleanor-millman
Copy link
Contributor

Hi @qdupuy no immediate news. I can tell you that parallelization of Velero (which would probably include this work) is something on our radar, but we first are focusing on other work like adding a data mover to Velero and bringing the CSI plugin to GA.

@eleanor-millman eleanor-millman added the 1.10-candidate The label used for 1.10 planning discussion. label May 25, 2022
@dymurray dymurray added the 1.10-candidate-rh Committed to include by Red Hat for 1.10 label May 31, 2022
@eleanor-millman eleanor-millman removed the 1.10-candidate The label used for 1.10 planning discussion. label Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.10-candidate-rh Committed to include by Red Hat for 1.10 Enhancement/User End-User Enhancement to Velero Epic kind/requirement Performance Reviewed Q2 2021
Projects
None yet
Development

No branches or pull requests