[TorchAcc][Experimental] Integrate TorchAcc. #647

baoleai · 2024-04-02T08:09:55Z

PR type

Bug Fix
New Feature
Document Updates
More Model or Dataset Support

PR information

TorchAcc is a framework developed by Alibaba PAI to accelerate PyTorch model training, providing computational acceleration based on compilation optimization and distributed strategies such as FSDP and TP+SP. This PR uses TorchAcc to accelerate the training of Swift SFT LoRA and Full scenarios and provides examples of qwen-72b-chat. Users can enable TorchAcc acceleration by setting export USE_TORCHACC=1. Currently, this feature is still in the experimental stage and only available internally.

Experiment results

Test with 4*80G A100 on qwen-72b-chat lora with script:
sh examples/pytorch/llm/scripts/qwen_72b_chat/torchacc/lora_fsdp_sft.sh

{"eval_loss": 0.35520425, "eval_acc": 0.87996558, "eval_runtime": 753.7452, "eval_samples_per_second": 0.362, "eval_steps_per_second": 0.004, "epoch": 1.0, "global_step": 1123}
{"train_runtime": 13049.8116, "train_samples_per_second": 2.065, "train_steps_per_second": 0.086, "total_flos": 0.0, "train_loss": 0.36302515, "epoch": 1.0, "global_step": 1123}

tastelikefeet · 2024-04-08T07:34:53Z

examples/pytorch/llm/scripts/qwen_72b_chat/torchacc/full_fsdp_sft.sh

+NNODES=4 \
+NPROC_PER_NODE=8 \
+swift sft \
+	--model_type qwen-72b-chat \


对齐需要处理一下

tastelikefeet · 2024-04-08T07:35:45Z

examples/pytorch/llm/scripts/qwen_72b_chat/torchacc/full_fsdp_sft.sh

+  --gradient_accumulation_steps 1 \
+  --gradient_checkpointing no \
+  --tuner_backend 'peft' \
+  --eval_steps 2000000 \


这个eval_steps是不是太大了

tastelikefeet · 2024-04-08T07:35:58Z

examples/pytorch/llm/scripts/qwen_72b_chat/torchacc/full_fsdp_sft.sh

+  --save_steps 2000000 \
+  --logging_steps 10 \
+  --preprocess_num_proc 1 \
+  --dataloader_num_workers 0 \


建议改成4提高处理效率

tastelikefeet · 2024-04-08T07:36:10Z

examples/pytorch/llm/scripts/qwen_72b_chat/torchacc/lora_fsdp_sft.sh

@@ -0,0 +1,31 @@
+# Experimental environment: 4 * A800


tastelikefeet · 2024-04-08T07:37:31Z

swift/llm/sft.py

+        # wrapper the model and make these properties wrong.
+        label_names = find_labels(model)
+        return_loss = can_return_loss(model)
+        model = ta.patch_qwen_model(model)


目前仅支持qwen吗，这个方法名是否有点特异化了

这个后面一个PR会解决

tastelikefeet · 2024-04-08T07:40:15Z

swift/llm/sft.py

-sft_main = get_main(SftArguments, llm_sft)
+def get_sft_main(args, llm):
+    if use_torchacc():
+        logger.warning('TorchAcc is currently only available internally.')


internally建议改为更具体的场景，否则用户会疑问什么是内部场景

tastelikefeet · 2024-04-08T07:42:39Z

swift/llm/utils/template.py

+    return [max_length // 4 * (i + 1) for i in range(4)]
+
+
+def _get_bucket(bucket_sizes, data_length):


建议给torchacc单独建一个py文件

tastelikefeet · 2024-04-08T07:43:20Z

swift/llm/utils/template.py

@@ -428,6 +453,32 @@ def data_collator(self,
                loss_scale, batch_first=True, padding_value=0.)
        labels = pad_sequence(labels, batch_first=True, padding_value=-100)

+        if use_torchacc():
+            rank, _, world_size, _ = get_dist_setting()


封装成方法，单独放入torchacc单独的py中，以免用户阅读这里的代码的时候有疑问

tastelikefeet · 2024-04-08T07:44:18Z

swift/trainers/mixin.py

+        if not use_torchacc():
+            return super()._save_tpu(output_dir)
+
+        import torch_xla.core.xla_model as xm


同样，这里也建议封装单独的方法，放入单独的py中

tastelikefeet · 2024-04-08T07:46:26Z

swift/trainers/trainers.py

+            return super().get_train_dataloader()
+        else:
+            # patch skip_first_batches for customized dataloader.
+            def acc_skip_first_batches(dataloader, num_batches=0):


tastelikefeet · 2024-04-08T07:47:37Z

swift/llm/sft.py

@@ -181,8 +203,21 @@ def llm_sft(args: SftArguments) -> Dict[str, Union[str, Any]]:
        if val_dataset is not None:
            val_dataset = LazyLLMDataset(val_dataset, template)

+    bucket_sizes = get_bucket_sizes(


这里会不会有问题

…hat. (modelscope#333)

* main: update Agent best practice with Modelscope-Agent (modelscope#676) [TorchAcc][Experimental] Integrate TorchAcc. (modelscope#647)

baoleai requested review from Jintao-Huang and tastelikefeet April 2, 2024 08:09

tastelikefeet reviewed Apr 8, 2024

View reviewed changes

baoleai force-pushed the features/rebase_0401 branch from e0fe1d4 to dd5b360 Compare April 8, 2024 13:25

baoleai requested a review from tastelikefeet April 8, 2024 13:28

baoleai force-pushed the features/rebase_0401 branch from dd5b360 to 30ad8c8 Compare April 9, 2024 02:51

baoleai and others added 10 commits April 9, 2024 15:09

[TorchAcc] Integrate TorchAcc and provide a sft example of qwen-72b-c…

9e139e4

…hat. (modelscope#333)

Enhance TorchAcc support for dynamic sequence. (modelscope#382)

aa73464

[TorchAcc] Add support for save/load checkpoint. (modelscope#444)

a863eef

fix patch

bd45a40

fix lint

03c3ded

code clean

3378818

fix comments

4c4c091

rebase

346f5f1

minor

f5b44d7

rebase main.

48ff067

baoleai force-pushed the features/rebase_0401 branch from 2a849fa to 48ff067 Compare April 9, 2024 07:25

rm bucket_sizes

95ae971

tastelikefeet approved these changes Apr 9, 2024

View reviewed changes

tastelikefeet merged commit 1e9f8be into modelscope:main Apr 9, 2024
2 checks passed

tastelikefeet added a commit to tastelikefeet/swift that referenced this pull request Apr 10, 2024

Merge branch 'main' into feat/eval

fdea6bf

* main: update Agent best practice with Modelscope-Agent (modelscope#676) [TorchAcc][Experimental] Integrate TorchAcc. (modelscope#647)

Zhikaiiii mentioned this pull request Apr 11, 2024

[TorchAcc][Experimental] Integrate more model in torchacc #683

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TorchAcc][Experimental] Integrate TorchAcc. #647

[TorchAcc][Experimental] Integrate TorchAcc. #647

baoleai commented Apr 2, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024

tastelikefeet Apr 8, 2024

baoleai Apr 8, 2024 •

edited

		return [max_length // 4 * (i + 1) for i in range(4)]


		def _get_bucket(bucket_sizes, data_length):

[TorchAcc][Experimental] Integrate TorchAcc. #647

[TorchAcc][Experimental] Integrate TorchAcc. #647

Conversation

baoleai commented Apr 2, 2024

PR type

PR information

Experiment results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baoleai Apr 8, 2024 • edited

Choose a reason for hiding this comment

baoleai Apr 8, 2024 •

edited