-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] support running multiple velero backups/restores concurrently #487
Comments
From a use-ability perspective, I think this is a particularly important issue for us to tackle. There are a number of reasons why two Server pods might be running simultaneously and we need to handle that gracefully. In addition, there's lots of ways an InProgress backup get stuck because a server exits. Again, we need to handle this gracefully. Metrics in #84 may be able to help (gauge displaying the current number of backups InProgress or Failed) visualize these issues, but it won't fix them. |
A backup/restore won't get stuck during normal shutdown operations because the Ark server waits for all in-progress work to complete before terminating. If that work takes too long and exceeds the pod's deletion grace period, then Kubernetes would forcefully kill the container, and that would interrupt the in-progress work before it had a chance to finish. There are, however, plenty of situations where the Ark server could exit while doing work:
This is definitely something we need to handle |
This needs a quick test (from code) to trigger:
|
I had a thought about how to implement this. Each ark server process is assigned a unique identifier - the name of the pod (we can get the value using the downward API and pass it to Each controller worker is also assigned a unique identifier. When a new item (backup, restore) is processed by a controller, the first thing the controller attempts to do is set When a worker sees an InProgress item, it checks
The controller would also need add event handlers for pods. Upon a change, we'd want to reevaluate all InProgress items to see if they need to be taken over. There's probably a lot more to flesh out here, but I wanted to write this down before I forgot it. |
@xmath279 Thanks for that feedback! |
Will be done after design is finished - #2601 |
This also limits our scaling. An idea would be to split velero servers into shards based on labels.
There should be no further interaction at the level of the custom resource. WDYT? cc @nrb |
This is a major drawback for this tool specially when using restic integration. Having a DevOps approach where we can provide with Velero as a Service to different teams for them to plan their own backup policies for their applications is chaos without parallelism. Not only bad for DevOps but this would very negatively affect any RTO, and create a lot incertidumbre with RPOs as you'll know when you schedule a backup but not when it will be its turn in a queue shared across all different application teams in a cluster. It seems that this feature request is not marked as P1 - Important, maybe this could be reconsidered? @eleanor-millman |
Thanks for the points. We will be reviewing this in a few weeks when we go through the items still open in the 1.8 project. |
Hello 🖖, Any news about the multi-jobs feature at the same time? |
Hi @qdupuy no immediate news. I can tell you that parallelization of Velero (which would probably include this work) is something on our radar, but we first are focusing on other work like adding a data mover to Velero and bringing the CSI plugin to GA. |
We need to account for a few things:
The text was updated successfully, but these errors were encountered: