Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to support PyTorch 2.0 and Lightning 2.0 #72

Merged
merged 79 commits into from
Apr 4, 2023

Conversation

hrukalive
Copy link

@hrukalive hrukalive commented Mar 24, 2023

  • 使用 PyTorch Lightning 2.0 作为框架,并对 base_taskacoustic_task 进行了适配。
    • 自动多节点训练(CPU、单GPU、多GPU等)。
    • 支持使用其他精度设置进行训练,例如'bf16'
    • 支持了梯度累积。
    • 使用 K 个最近验证的检查点,并支持原有的永久检查点和间隔。
    • 将采样器及其分布式版本重新编码为 PyTorch 的Sampler类的子类。
      • 重新实现了按相似帧数分组的样本Shuffle。
    • 使用PL的rank_zero工具来甄别主进程
  • 升级到 PyTorch 2.0,对 torch.compile 的支持还有待测试。
    • 为了与大多数深度学习服务提供商兼容,也支持 PyTorch 1.12/1.13,但是 Lightning 需要是 2.0.0 版本
    • (在 scripts/train.py 中,环境变量 TORCH_CUDNN_V8_API_ENABLED 的设置是防止在使用 16 位精度时过慢,如果它导致任何问题,请尝试将其注释掉。)
  • 使用 HDF5 作为新的二进制数据集格式,以避免潜在的文件句柄争用。
  • 更新了数据准备 Notebook,反映 PL 相关参数。

新参数说明:

  • PyTorch Lightning 相关:
    • pl_trainer_acceleratorpl_trainer_devicespl_trainer_num_nodespl_trainer_strategypl_trainer_precision 请参阅 PyTorch Lightning 2.0 文档以了解它们在 Trainer 中的用法。
    • 对于 pl_trainer_devices,它可以是:
      • pl_trainer_devices: 'auto': 自动选择
      • pl_trainer_devices: 2: 使用两个加速器,自动选择
      • pl_trainer_devices: [2, 3]: 使用 2 号和 3 号加速器
    • ddp_backend, 可选项有 'gloo', 'nccl', 或 'nccl_no_p2p'
  • 数据加载器相关:
    • config/base.yaml 中的sampler_frame_count_grid:现在正确支持具有相似大小的样本的随机混洗。 首先,样本长度被归一为sampler_frame_count_grid的倍数(默认为 6),然后在每个 bin 内,样本在每个 epoch 进行洗牌。
    • config/base.yaml 中的 dataloader_prefetch_factor:PyTorch 的 DataLoader 设置(请参阅 PyTorch 文档)。
  • 多 GPU 上的有效批量大小和梯度累积:
    • max_tokensmax_sentences 始终控制单个设备的Batch大小。 然后将有效批量大小全部相加。
    • accumulate_grad_batches 允许您在梯度下降之前反向传播多个批次,从而增加有效批次大小。
    • 例如(假设 max_sentences 占主导地位),你有 4 个 GPU,max_sentences=8accumulate_grad_batches=2。 那么有效批量大小 4*8*2=64

  • Use official PyTorch Lightning 2.0 as the framework, and adapted base_task and acoustic_task to it.
    • Automatic multiple-node training (CPU, single-GPU, multi-GPU, etc).
    • Now support training with other precision settings, e.g., 'bf16'.
    • Gradient Accumulation works correctly.
    • Checkpointing with K most recent validations, and supports the original permanent checkpoint and interval.
    • Re-coded batch sampler and its distributed version as subclasses of PyTorch's Sampler class.
      • Sample shuffling with grouping by similar frame count is reimplemented.
    • Adapted multiple main process discrimination codes to use PL's rank_zero utility.
  • Upgrade to PyTorch 2.0, support for torch.compile is yet to be tested.
    • For compatibility with most deep learning service providers, PyTorch 1.12/1.13 is also supported, however, Lightning needs to be version 2.0.0
    • (In scripts/train.py, environment variable TORCH_CUDNN_V8_API_ENABLED is set to prevent excessive slowdown when using 16-bit precision. If it causes any problem, try to comment it out.)
  • Use HDF5 as the new binarized dataset format to avoid potential file handle sharing and race to perform file seeking.
  • Updated data preparation notebook to reflect several PL related parameters.

New parameter explanation:

  • PyTorch Lightning related:
    • Exposed pl_trainer_accelerator, pl_trainer_devices, pl_trainer_num_nodes, pl_trainer_strategy, and pl_trainer_precision, see the PyTorch Lightning 2.0 doc for their usage in Trainer section.
    • For pl_trainer_devices, it can be:
      • pl_trainer_devices: 'auto': Auto select
      • pl_trainer_devices: 2: Use two accelerators, auto select
      • pl_trainer_devices: [2, 3]: Use accelerator number 2 and 3
    • ddp_backend, choose from 'gloo', 'nccl', or 'nccl_no_p2p'.
  • DataLoader related:
    • sampler_frame_count_grid in config/base.yaml: Now random shuffling of samples with similar sizes is correctly supported. First, the sample length is rounded to multiples of sampler_frame_count_grid (default 6), and then within each bin, samples are shuffled every epoch.
    • dataloader_prefetch_factor in config/base.yaml: Setting for PyTorch's DataLoader (refer to PyTorch doc).
  • Effective batch size on Multi-GPU and gradient accumulation:
    • max_tokens and max_sentences always control batching for a single device. Effective batch size is then all of them summed.
    • accumulate_grad_batches allows you to backpropagate multiple batches before a gradient descent, effectively increasing the batch size.
    • Combining them, for example (suppose max_sentences is dominant), you have 4 GPUs, max_sentences=8, accumulate_grad_batches=2. Then you have effective batch size 4*8*2=64.

@hrukalive
Copy link
Author

Performance is tightly linked to the grid resolution when performing shuffling and sorting by similar lengths on samples. When fully sorted, the performance does not drop compared to the original codebase.

@yqzhishen yqzhishen changed the base branch from refactor-pl to refactor-v2 March 30, 2023 04:21
Copy link
Member

@yqzhishen yqzhishen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be the temporary solution for ONNX export. The deployment script itself needs to be refactored to support newer PyTorch.

@yqzhishen
Copy link
Member

Performance issues addressed so I changed the base branch back to refactor-v2.

@yqzhishen yqzhishen merged commit c3b8ac6 into openvpi:refactor-v2 Apr 4, 2023
@hrukalive hrukalive deleted the refactor-pl branch April 8, 2023 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants