Refactor to support PyTorch 2.0 and Lightning 2.0 #72

hrukalive · 2023-03-24T05:49:57Z

使用 PyTorch Lightning 2.0 作为框架，并对 base_task 和 acoustic_task 进行了适配。
- 自动多节点训练（CPU、单GPU、多GPU等）。
- 支持使用其他精度设置进行训练，例如'bf16'。
- 支持了梯度累积。
- 使用 K 个最近验证的检查点，并支持原有的永久检查点和间隔。
- 将采样器及其分布式版本重新编码为 PyTorch 的Sampler类的子类。
  - 重新实现了按相似帧数分组的样本Shuffle。
- 使用PL的rank_zero工具来甄别主进程
升级到 PyTorch 2.0，对 torch.compile 的支持还有待测试。
- 为了与大多数深度学习服务提供商兼容，也支持 PyTorch 1.12/1.13，但是 Lightning 需要是 2.0.0 版本
- （在 scripts/train.py 中，环境变量 TORCH_CUDNN_V8_API_ENABLED 的设置是防止在使用 16 位精度时过慢，如果它导致任何问题，请尝试将其注释掉。）
使用 HDF5 作为新的二进制数据集格式，以避免潜在的文件句柄争用。
更新了数据准备 Notebook，反映 PL 相关参数。

新参数说明：

PyTorch Lightning 相关：
- pl_trainer_accelerator、pl_trainer_devices、pl_trainer_num_nodes、pl_trainer_strategy 和 pl_trainer_precision 请参阅 PyTorch Lightning 2.0 文档以了解它们在 Trainer 中的用法。
- 对于 pl_trainer_devices，它可以是：
  - pl_trainer_devices: 'auto': 自动选择
  - pl_trainer_devices: 2: 使用两个加速器，自动选择
  - pl_trainer_devices: [2, 3]: 使用 2 号和 3 号加速器
- ddp_backend, 可选项有 'gloo', 'nccl', 或 'nccl_no_p2p'。
数据加载器相关：
- config/base.yaml 中的sampler_frame_count_grid：现在正确支持具有相似大小的样本的随机混洗。首先，样本长度被归一为sampler_frame_count_grid的倍数（默认为 6），然后在每个 bin 内，样本在每个 epoch 进行洗牌。
- config/base.yaml 中的 dataloader_prefetch_factor：PyTorch 的 DataLoader 设置（请参阅 PyTorch 文档）。
多 GPU 上的有效批量大小和梯度累积：
- max_tokens 和 max_sentences 始终控制单个设备的Batch大小。然后将有效批量大小全部相加。
- accumulate_grad_batches 允许您在梯度下降之前反向传播多个批次，从而增加有效批次大小。
- 例如（假设 max_sentences 占主导地位），你有 4 个 GPU，max_sentences=8，accumulate_grad_batches=2。那么有效批量大小 4*8*2=64。

Use official PyTorch Lightning 2.0 as the framework, and adapted base_task and acoustic_task to it.
- Automatic multiple-node training (CPU, single-GPU, multi-GPU, etc).
- Now support training with other precision settings, e.g., 'bf16'.
- Gradient Accumulation works correctly.
- Checkpointing with K most recent validations, and supports the original permanent checkpoint and interval.
- Re-coded batch sampler and its distributed version as subclasses of PyTorch's Sampler class.
  - Sample shuffling with grouping by similar frame count is reimplemented.
- Adapted multiple main process discrimination codes to use PL's rank_zero utility.
Upgrade to PyTorch 2.0, support for torch.compile is yet to be tested.
- For compatibility with most deep learning service providers, PyTorch 1.12/1.13 is also supported, however, Lightning needs to be version 2.0.0
- (In scripts/train.py, environment variable TORCH_CUDNN_V8_API_ENABLED is set to prevent excessive slowdown when using 16-bit precision. If it causes any problem, try to comment it out.)
Use HDF5 as the new binarized dataset format to avoid potential file handle sharing and race to perform file seeking.
Updated data preparation notebook to reflect several PL related parameters.

New parameter explanation:

PyTorch Lightning related:
- Exposed pl_trainer_accelerator, pl_trainer_devices, pl_trainer_num_nodes, pl_trainer_strategy, and pl_trainer_precision, see the PyTorch Lightning 2.0 doc for their usage in Trainer section.
- For pl_trainer_devices, it can be:
  - pl_trainer_devices: 'auto': Auto select
  - pl_trainer_devices: 2: Use two accelerators, auto select
  - pl_trainer_devices: [2, 3]: Use accelerator number 2 and 3
- ddp_backend, choose from 'gloo', 'nccl', or 'nccl_no_p2p'.
DataLoader related:
- sampler_frame_count_grid in config/base.yaml: Now random shuffling of samples with similar sizes is correctly supported. First, the sample length is rounded to multiples of sampler_frame_count_grid (default 6), and then within each bin, samples are shuffled every epoch.
- dataloader_prefetch_factor in config/base.yaml: Setting for PyTorch's DataLoader (refer to PyTorch doc).
Effective batch size on Multi-GPU and gradient accumulation:
- max_tokens and max_sentences always control batching for a single device. Effective batch size is then all of them summed.
- accumulate_grad_batches allows you to backpropagate multiple batches before a gradient descent, effectively increasing the batch size.
- Combining them, for example (suppose max_sentences is dominant), you have 4 GPUs, max_sentences=8, accumulate_grad_batches=2. Then you have effective batch size 4*8*2=64.

hrukalive · 2023-03-29T23:07:02Z

Performance is tightly linked to the grid resolution when performing shuffling and sorting by similar lengths on samples. When fully sorted, the performance does not drop compared to the original codebase.

yqzhishen

This can be the temporary solution for ONNX export. The deployment script itself needs to be refactored to support newer PyTorch.

yqzhishen · 2023-03-30T05:06:08Z

Performance issues addressed so I changed the base branch back to refactor-v2.

Bump dependencies and simplify requirements

…into refactor-pl

hrukalive added 30 commits March 22, 2023 01:16

Initial attempt to refactor lightning code

7acf94f

Add hparams to yaml, successful checkpointing

a0c2445

Fix batch sampler and lr_scheduler step freq

20ae2f3

Checkpointing done

219b663

Use pl rankzero utils to discriminate main proc

7819e0b

use rank_zero_info

311653a

Clean indexed ds, add main proc check back to hparam

9dda2eb

remove h5py compression and ncols in binarizer tqdm

b663b10

Format tqdm

5652c43

Fix bug in val_check_interval, hide v_num

ee67f81

Cleanup dataset codes, rename custom callbacks

7b2d23f

Bump requirement torch and pl version

4798f68

binarizer joint aug bug fix

27753f6

revert torch version

1d7cd38

Remove py3.9 syntax

97a52db

Add env for CUDNN API change, clean more codes

3ac11f4

Initial attempt to refactor lightning code

ae2946c

Add hparams to yaml, successful checkpointing

6b15665

Fix batch sampler and lr_scheduler step freq

705265f

Checkpointing done

6bc26ed

Use pl rankzero utils to discriminate main proc

93e4627

use rank_zero_info

6bb4cae

Clean indexed ds, add main proc check back to hparam

1dd369d

remove h5py compression and ncols in binarizer tqdm

f17263a

Format tqdm

49e6271

Fix bug in val_check_interval, hide v_num

5eb3d74

Cleanup dataset codes, rename custom callbacks

b2aa789

Bump requirement torch and pl version

bb3c637

binarizer joint aug bug fix

a39a9fa

revert torch version

f745a6e

hrukalive added 2 commits March 29, 2023 13:40

Remove dependency on lightning during deployment

048b459

Guard the sampler shuffle grid size

fc43aff

hrukalive added 3 commits March 29, 2023 20:53

More meaningful grid size

49331dc

Separate log of lr on prog bar

b1222e1

Prevent step in prog bar scientific notation

e065812

yqzhishen changed the base branch from refactor-pl to refactor-v2 March 30, 2023 04:21

yqzhishen approved these changes Mar 30, 2023

View reviewed changes

hrukalive and others added 19 commits March 30, 2023 01:22

Prevent distributed sampler padding

312c810

Bump dependencies and simplify requirements

331d1a2

Downgrade librosa version

8aaaacd

Merge pull request #1 from yqzhishen/simplify-deps

aaf1f98

Bump dependencies and simplify requirements

Merge branch 'refactor-pl' of https://github.com/hrukalive/DiffSinger …

4690210

…into refactor-pl

Correct validation logging, sampler revamp, fix hparam print bug

77e9a04

Batch shuffle order and fix bf16 type in vocoder

5f64a41

Fix h5py pickle on windows

947f841

Make sure validation is in fp32

78c279c

Remove unnecessary cast to float in vocoder

77e1d3a

Metric manually to device

78b2298

Sampler padding and default not shuffle batch

5c0263e

Verbose checkpointing

3c51a83

Fix permanent checkpointing

6851b7f

Remove printing

9d2fdac

Fix checkpoint top-k config change behavior

441c622

Various fix for checkpointing

d457ace

Permanent checkpointing

76ef3c6

Fixes for lightning 2.0

b76cfcc

yqzhishen merged commit c3b8ac6 into openvpi:refactor-v2 Apr 4, 2023

hrukalive deleted the refactor-pl branch April 8, 2023 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to support PyTorch 2.0 and Lightning 2.0 #72

Refactor to support PyTorch 2.0 and Lightning 2.0 #72

hrukalive commented Mar 24, 2023 •

edited

Loading

hrukalive commented Mar 29, 2023

yqzhishen left a comment

yqzhishen commented Mar 30, 2023

Refactor to support PyTorch 2.0 and Lightning 2.0 #72

Refactor to support PyTorch 2.0 and Lightning 2.0 #72

Conversation

hrukalive commented Mar 24, 2023 • edited Loading

hrukalive commented Mar 29, 2023

yqzhishen left a comment

Choose a reason for hiding this comment

yqzhishen commented Mar 30, 2023

hrukalive commented Mar 24, 2023 •

edited

Loading