Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Introduce faster compression and multiple threads for volume backup & restore #5189

Closed
5 tasks done
derekbit opened this issue Jan 3, 2023 · 14 comments
Closed
5 tasks done
Assignees
Labels
area/performance System, volume performance area/ui UI related like UI or CLI area/volume-backup-restore Volume backup restore highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal
Milestone

Comments

@derekbit
Copy link
Member

derekbit commented Jan 3, 2023

Is your improvement request related to a feature? Please describe (👍 if you like this request)

The backup mechanism in Longhorn is not efficient as investigated in #1409.

Besides the refactoring, the task will cover

Backup

  • Global backup compression setting
  • Per-volume backup compression setting (volume.spec.backupCompressionMethod)
  • Introduce faster compression algorithms, lz4 or zstd, into longhorn, but still need to handle backward compatibility, i.e. for the existing backups using gzip.
    • none: suitable for encoding data, e.g. encoded image and video
    • lz4: suitable for text data
    • gzip: a bit of higher compression ratio but slow. (not recommended)
  • Concurrent compression and transfer workers
    • Introduce producer-consumer pattern
    • Improve the compression and upload throughput

Restore

  • Concurrent decompression and transfer workers

Describe the solution you'd like

A clear and concise description of what you want to happen.

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

#1409
#3900
#3175
#4379

@derekbit derekbit added the kind/improvement Request for improvement of existing function label Jan 3, 2023
@derekbit derekbit self-assigned this Jan 3, 2023
@innobead innobead changed the title [IMPROVEMENT] Imrpove backup efficiency by faster compression and mutiple threads [IMPROVEMENT] Improve backup efficiency by faster compression and multiple threads Jan 3, 2023
@innobead innobead added priority/0 Must be fixed in this release (managed by PO) area/volume-backup-restore Volume backup restore area/performance System, volume performance labels Jan 3, 2023
@innobead innobead added this to the v1.5.0 milestone Jan 3, 2023
@derekbit
Copy link
Member Author

derekbit commented Jan 4, 2023

In summary, the backup throughput is increased by 15X when using lz4 and 10 concurrent threads in comparison with the backup in Longhorn v1.4.0.

Setup
Platform: Equinix
Host: Japan-Tokyo/m3.small.x86
CPU: Intel(R) Xeon(R) E-2378G CPU @ 2.80GHz
RAM: 64 GiB
Disk: Micron_5300_MTFD
OS: Ubuntu 22.04.1 LTS(kernel 5.15.0-53-generic)
Kubernetes: v1.23.6+rke2r2
Longhorn: master-branch + backup improvement
Nodes: 3 nodes

Backupstore target: external MinIO S3 (m3.small.x86)

Single-Threaded Backup and Restore
Volume: 50 GiB containing 1GB filesystem metadata and 10 GiB random data (3 replicas)
Single-Threaded Backup and Restore

Multiple-Threaded Backup

Multi-Threaded Backup

@innobead
Copy link
Member

innobead commented Jan 4, 2023

In summary, the backup throughput is increased by 15X when using lz4 and 10 concurrent threads in comparison with the backup in Longhorn v1.4.0.

This is wonderful and awesome 🎆 Can benefit the story of disaster recovery, especially RTO.

@innobead innobead added the highlight Important feature/issue to highlight label Jan 4, 2023
@innobead innobead changed the title [IMPROVEMENT] Improve backup efficiency by faster compression and multiple threads [FEATURE] Introduce faster compression and multiple threads for volume backup & restore Jan 4, 2023
@innobead innobead added kind/feature Feature request, new feature and removed kind/improvement Request for improvement of existing function labels Jan 4, 2023
@joshimoo
Copy link
Contributor

joshimoo commented Jan 5, 2023

Make sure to introduce a backup format version, this way it's easy to implement the backwards compatibility.
We also want to use much larger blocks in the new version, consider going from current 2MB -> 16MB -> 64MB this will improve the compression but further also reduce the amount of block files, which will lower the cost for the S3 lookups. I am thinking 16MB could be a good middle ground.

This is optional but for larger volumes it would be nice if we can dynamically switch to 64MB, perhaps a block size field in the new backup v2 version?

@innobead
Copy link
Member

innobead commented Jan 6, 2023

This feature is straightforward, but still, requires an LEP to make this feature cycle complete.

@derekbit
Copy link
Member Author

derekbit commented Jan 6, 2023

Continue the test of #5189 (comment)

The restore throughput is increased by 140% when using lz4 and 10 concurrent threads in comparison with the backup in Longhorn v1.4.0.

Multi-Threaded Restore

Note: Restore hit the IO bound of the MinIO server, because the restore throughput is saturated from 5 worker threads.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Jan 6, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

#5189 (comment)

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

#5201

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/backupstore#94
longhorn/longhorn-engine#820
longhorn/longhorn-instance-manager#171
longhorn/longhorn-manager#1670

  • Which areas/issues this PR might have potential impacts on?
    Area: backup/restore
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at Add LEP for Improve Backup and Restore Efficiency using Multiple Threads and Faster Compression Methods #5231

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at
    The automation test case PR is at
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@zamazan4ik
Copy link

Could you please also add zstd to your tests? Thanks!

@derekbit
Copy link
Member Author

derekbit commented Feb 7, 2023

Could you please also add zstd to your tests? Thanks!

I tested zstd before, and you can refer to #1409 (comment).
If you would like to introduce zstd, please open a feature request. Thank you.

@roger-ryao roger-ryao self-assigned this Feb 9, 2023
@roger-ryao
Copy link

roger-ryao commented Feb 13, 2023

Verified on master-head 20230213

  • longhorn master-head (b1ed058)
  • longhorn-engine master-head (58d75a7)
  • longhorn-instance-manager master-head (6636abe)
  • longhorn-manager master-head (37500d0)

Pre-requisite
Backupstore target: external MinIO S3

The test steps

  1. Deploy Longhorn master-head
  2. Create one 50GiB volume vol-0 and then write 20GiB data
  3. Get the data md5sum
  4. Setting > General > Backup Compression Method > none
  5. Backup vol-0
  6. Restore vol-0 from the backup.
  7. Create one 50GiB volume vol-2 and then write 20GiB data
  8. Get the data md5sum
  9. Setting > General > Backup Compression Method > lz4
  10. Backup vol-2
  11. Restore vol-2 from the backup.
  12. Create one 50GiB volume vol-3 and then write 20GiB data
  13. Get the data md5sum
  14. Setting > General > Backup Compression Method > gzip
  15. Backup vol-3
  16. Restore vol-3 from the backup.
  17. Download the support bundle and check instance-manager-r-XXXX/replica-manager.log

Result Passed

  1. The backup execution time is expected.

Screenshot_20230213_205427

Screenshot_20230213_205446

2023-02-13T12:15:39.292064692Z [vol-3-r-00ffea28] time="2023-02-13T12:15:39Z" level=info msg="Starting backup for &{vol-3 53687091200 map[KubernetesStatus:{\"pvName\":\"vol-3\",\"pvStatus\":\"Bound\",\"namespace\":\"default\",\"pvcName\":\"vol-3\",\"lastPVCRefAt\":\"\",\"workloadsStatus\":[{\"podName\":\"ubuntu-mountvol3\",\"podStatus\":\"Running\",\"workloadName\":\"\",\"workloadType\":\"\"}],\"lastPodRefAt\":\"\"} VolumeRecurringJobInfo:{} longhorn.io/volume-access-mode:rwo] 2023-02-13T12:15:39Z   0   none}, snapshot &{f6422380-bbb3-425a-9c95-e1afeb3d9915 2023-02-13T12:15:39Z}, dest s3://backupbucket@us-east-1/" pkg=backup
2023-02-13T12:15:39.300818235Z [vol-3-r-00ffea28] time="2023-02-13T12:15:39Z" level=info msg="Loaded driver for s3://backupbucket@us-east-1/" pkg=s3
2023-02-13T12:15:41.323954291Z [vol-3-r-00ffea28] time="2023-02-13T12:15:41Z" level=info msg="Added backupstore volume vol-3" pkg=backupstore
2023-02-13T12:15:41.498046649Z [vol-3-r-00ffea28] time="2023-02-13T12:15:41Z" level=info msg="Done initiating backup creation, received backupID: backup-743b9e0b2f3b4b84"
2023-02-13T12:15:41.502717391Z [vol-3-r-00ffea28] time="2023-02-13T12:15:41Z" level=info msg="Volume vol-3 Snapshot f6422380-bbb3-425a-9c95-e1afeb3d9915 is consist of 10569 mappings and 10569 blocks"
2023-02-13T12:24:34.147623000Z [vol-3-r-00ffea28] time="2023-02-13T12:24:34Z" level=info msg="Created snapshot changed blocks: 10569 mappings, 10569 blocks and 10026 new blocks" event=backup object=snapshot pkg=backupstore reason=complete snapshot=f6422380-bbb3-425a-9c95-e1afeb3d9915

supportbundle_6472c308-3572-4f24-98f4-402ad495a019_2023-02-13T12-41-39Z.zip

@derekbit
Copy link
Member Author

Verified on master-head 20230213

  • longhorn master-head (b1ed058)
  • longhorn-engine master-head (58d75a7)
  • longhorn-instance-manager master-head (6636abe)
  • longhorn-manager master-head (37500d0)

Pre-requisite Backupstore target: external MinIO S3

The test steps

  1. Deploy Longhorn master-head
  2. Create one 50GiB volume vol-0 and then write 20GiB data
  3. Get the data md5sum
  4. Setting > General > Backup Compression Method > none
  5. Backup vol-0
  6. Restore vol-0 from the backup.
  7. Create one 50GiB volume vol-2 and then write 20GiB data
  8. Get the data md5sum
  9. Setting > General > Backup Compression Method > lz4
  10. Backup vol-2
  11. Restore vol-2 from the backup.
  12. Create one 50GiB volume vol-3 and then write 20GiB data
  13. Get the data md5sum
  14. Setting > General > Backup Compression Method > gzip
  15. Backup vol-3
  16. Restore vol-3 from the backup.
  17. Download the support bundle and check instance-manager-r-XXXX/replica-manager.log

Result Passed

  1. The backup execution time is expected.

Screenshot_20230213_205427

Screenshot_20230213_205446

2023-02-13T12:15:39.292064692Z [vol-3-r-00ffea28] time="2023-02-13T12:15:39Z" level=info msg="Starting backup for &{vol-3 53687091200 map[KubernetesStatus:{\"pvName\":\"vol-3\",\"pvStatus\":\"Bound\",\"namespace\":\"default\",\"pvcName\":\"vol-3\",\"lastPVCRefAt\":\"\",\"workloadsStatus\":[{\"podName\":\"ubuntu-mountvol3\",\"podStatus\":\"Running\",\"workloadName\":\"\",\"workloadType\":\"\"}],\"lastPodRefAt\":\"\"} VolumeRecurringJobInfo:{} longhorn.io/volume-access-mode:rwo] 2023-02-13T12:15:39Z   0   none}, snapshot &{f6422380-bbb3-425a-9c95-e1afeb3d9915 2023-02-13T12:15:39Z}, dest s3://backupbucket@us-east-1/" pkg=backup
2023-02-13T12:15:39.300818235Z [vol-3-r-00ffea28] time="2023-02-13T12:15:39Z" level=info msg="Loaded driver for s3://backupbucket@us-east-1/" pkg=s3
2023-02-13T12:15:41.323954291Z [vol-3-r-00ffea28] time="2023-02-13T12:15:41Z" level=info msg="Added backupstore volume vol-3" pkg=backupstore
2023-02-13T12:15:41.498046649Z [vol-3-r-00ffea28] time="2023-02-13T12:15:41Z" level=info msg="Done initiating backup creation, received backupID: backup-743b9e0b2f3b4b84"
2023-02-13T12:15:41.502717391Z [vol-3-r-00ffea28] time="2023-02-13T12:15:41Z" level=info msg="Volume vol-3 Snapshot f6422380-bbb3-425a-9c95-e1afeb3d9915 is consist of 10569 mappings and 10569 blocks"
2023-02-13T12:24:34.147623000Z [vol-3-r-00ffea28] time="2023-02-13T12:24:34Z" level=info msg="Created snapshot changed blocks: 10569 mappings, 10569 blocks and 10026 new blocks" event=backup object=snapshot pkg=backupstore reason=complete snapshot=f6422380-bbb3-425a-9c95-e1afeb3d9915

supportbundle_6472c308-3572-4f24-98f4-402ad495a019_2023-02-13T12-41-39Z.zip

Weird. Why are the performances of the three compressions so closed?
Can you try set the compression method first and then create a new volume? Then, check if the volume.spec.backupCompressionMethod is expected.
Thank you.

@roger-ryao roger-ryao reopened this Feb 13, 2023
@roger-ryao
Copy link

roger-ryao commented Feb 14, 2023

Weird. Why are the performances of the three compressions so closed? Can you try set the compression method first and then create a new volume? Then, check if the volume.spec.backupCompressionMethod is expected. Thank you.

Hi @derekbit :
I test it on AWS and the instance type is t2.xlarge. I set the compression method first and then create volume. the volume.spec.backupCompressionMethod is expected.

Threads = 1
Screenshot_20230213_234450

Threads = 5
Screenshot_20230213_234429

The test steps

  1. Deploy Longhorn master-head
  2. Setting > General > Backup Compression Method > none
  3. Create one 50GiB volume vol-0 and then write 20GiB data
  4. Backup vol-0
  5. Setting > General > Backup Compression Method > lz4
  6. Create one 50GiB volume vol-2 and then write 20GiB data
  7. Backup vol-2
  8. Setting > General > Backup Compression Method > gzip
  9. Create one 50GiB volume vol-3 and then write 20GiB data
  10. Backup vol-3
  11. Download support bundle and check instance-manager-r-XXXX/replica-manager.log
    supportbundle_6472c308-3572-4f24-98f4-402ad495a019_2023-02-13T15-45-13Z.zip

If we create volume first and then set the different compression methods. When we backup volume, would the volume's compression method be changed?

@derekbit
Copy link
Member Author

If we create volume first and then set the different compression methods. When we backup volume, would the volume's compression method be changed?

No. The newly created volume's compression method is determined by the global compression method at that time. It cannot be changed.

@derekbit
Copy link
Member Author

I test it on AWS and the instance type is t2.xlarge. I set the compression method first and then create volume. the volume.spec.backupCompressionMethod is expected.

Yeah, it's expected. Thank you.

@roger-ryao
Copy link

Hi @derekbit :
I verified it again. The detail as below.

Verified on master-head 20230214

  • longhorn master-head (bb1bd7d)
  • longhorn-engine master-head (a26f857)
  • longhorn-instance-manager master-head (18affde)
  • longhorn-manager master-head (37500d0)

Pre-requisite
Backupstore target: external MinIO S3
instance type : c5d.2xlarge

The test steps

  1. Deploy Longhorn master-head
  2. Setting > General > Backup Compression Method > none
  3. Create one 50GiB volume vol-0 and then write 20GiB data
    dd if=/dev/urandom of=/data/20G count=20000 bs=1M conv=fsync status=progress
  4. Get the data md5sum and check vol-0 volume.spec.backupCompressionMethod
    kubectl -n longhorn-system describe volumes.longhorn.io vol-0 |grep "Backup Compression Method"
  5. Backup vol-0
  6. Restore vol-0 from the backup.
  7. Check md5sum of the data in the restored volume
  8. Setting > General > Backup Compression Method > lz4
  9. Create one 50GiB volume vol-2 and then write 20GiB data
    dd if=/dev/urandom of=/data/20G count=20000 bs=1M conv=fsync status=progress
  10. Get the data md5sum and check vol-2 volume.spec.backupCompressionMethod
    kubectl -n longhorn-system describe volumes.longhorn.io vol-0 |grep "Backup Compression Method"
  11. Backup vol-2
  12. Restore vol-2 from the backup.
  13. Check md5sum of the data in the restored volume
  14. Setting > General > Backup Compression Method > gzip
  15. Create one 50GiB volume vol-3 and then write 20GiB data
    dd if=/dev/urandom of=/data/20G count=20000 bs=1M conv=fsync status=progress
  16. Get the data md5sum and check vol-3 volume.spec.backupCompressionMethod
    kubectl -n longhorn-system describe volumes.longhorn.io vol-0 |grep "Backup Compression Method"
  17. Backup vol-3
  18. Restore vol-3 from the backup.
  19. Check md5sum of the data in the restored volume
  20. Download support bundle and check instance-manager-r-XXXX/replica-manager.log

The test steps

  1. the volume.spec.backupCompressionMethod is expected.
    Screenshot_20230214_105921
    Screenshot_20230214_112024
    Screenshot_20230214_114811

  2. The performances of the three compressions is expected
    Screenshot_20230214_125140
    supportbundle_50213bd3-9b2e-4bd1-92f5-02761334ffc7_2023-02-14T04-39-25Z.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance System, volume performance area/ui UI related like UI or CLI area/volume-backup-restore Volume backup restore highlight Important feature/issue to highlight kind/feature Feature request, new feature priority/0 Must be fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation require/lep Require adding/updating enhancement proposal
Projects
None yet
Development

No branches or pull requests

6 participants