Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OADP-1067: Design for Batching VSM CRs #179

Merged
merged 4 commits into from
Jan 17, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
62 changes: 62 additions & 0 deletions docs/design/batching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Batching VolumeSnapshotMover Backups and Restores

## Summary
The VolumeSnapshotMover controller currently runs on each `volumeSnapshotBackup`
and `volumeSnapshotRestore` until completed. The problem with this is that as
the number of VolSync processes increases, the time for backup/restore to complete
can become slow, as diminishing returns increases.


## Performance Testing Results
[Performance testing](https://docs.google.com/document/d/1kPKo46_McEkj0Fu9hdkTxiVDnTh00forFQnscTgiEgg/edit?usp=sharing)
eemcmullan marked this conversation as resolved.
Show resolved Hide resolved
found that there is an optimal batch number of 12 for backup. For restore,
the results appear to be linear.
eemcmullan marked this conversation as resolved.
Show resolved Hide resolved


## Motivation
Improve the performance of VolumeSnapshotMover backup and restore by allowing
users to specify a concurrent number of `volumeSnapshotBackups` and
`volumeSnapshotRestores` that can be simultaneously `inProgress` at once.


## Implementation details
Configurable values, `dpa.spec.dataMover.concurrentBackupVolumes` and
`dpa.spec.dataMover.concurrentRestoreVolumes`, can be used to specify
eemcmullan marked this conversation as resolved.
Show resolved Hide resolved
a number of `volumeSnapshotBackup` and `volumeSnapshotRestore` CRs that should
be operated on at once. For example, given 100 PVs, a user may want to use
VolumeSnapshotMover to backup 15 at a time and restore 20 at a time for
improved performance.

To implement this in the the VSM controller, we can keep track of the current
eemcmullan marked this conversation as resolved.
Show resolved Hide resolved
number of `volumeSnapshotBackups` or `volumeSnapshotRestores` that are not yet
completed. At the beginning of the reconcile, a counter can be incremented for
each VSB/VSR that starts.
When this counter is equal to the concurrentVolumes number, then requeue without
starting another VSB/VSR process.
Once a CR status changes to completed, this value will be decremented, and another
`volumeSnapshotBackup` or `volumeSnapshotRestore` can start.

If `dpa.spec.dataMover.concurrentBackupVolumes` or
`dpa.spec.dataMover.concurrentRestoreVolumes` is not set or it is set to zero,
then a default value will be used. This value is TBD.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm wondering how this would work if a user wants to set it to zero, meaning "no limit". For example, if the default is 12, then a user setting this to 0 will result in the default of 12 being used. Maybe we don't allow "no limit" -- but practically speaking, the user can still have "no limit" by setting it to 1000 or something similarly huge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one way to allow unlimited would be to implement the setting as an int pointer rather than an int -- that way we can distinguish between "not present; use default" (i.e. value is nil) and "unlimited" (value is 0; use this rather than default)" Thinking more about it, I'm leaning towards this. If we want "not specified" to mean "use the default" then we want the spec field to be *int64 rather than int64, so that 0 won't be overridden by the default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this approach sounds more reasonable. Will update with these details.


One option for a default value is to allow for unlimted `volumeSnapshotBackups` or
eemcmullan marked this conversation as resolved.
Show resolved Hide resolved
`volumeSnapshotRestores` if it is not set. Another option is to defing constants
such as `DefaultConcurrentBackupVolumes` and `DefaultConcurrentRestoreVolumes`
that can be set to a determined number, such as 0 or 12. More discussion is needed
around this topic.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 0 means "unlimited" then we don't really need to know which option we're using in the design -- when we decide the default value, then it's just a matter of assigning the const in the golang file. If the default is zero, then it's unlimited -- if it's another number, then it's limited to that.



### DPA config:

```
...
spec:
features:
dataMover:
enable: true
credentialName: <dm-restic-secret-name>
concurrentBackupVolumes: 12
concurrentRestoreVolumes: 50
...
```