-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support deepspeed dynamo #2460
Support deepspeed dynamo #2460
Conversation
5f23187
to
2763315
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @oraluben for adding the torch compile support for DeepSpeed ✨! It would be great to have related tests in tests/deepspeed/deepspeed.py
file.
…rate into support-deepspeed-dynamo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thank you @oraluben for the changes. Please look at the reply to your comment and the basic runs fails with compile, so we need to see first how to make this feature usable.
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Thanks! I've committed your dynamo fix, and I'll look at the failed test. |
485ce27
to
b7c5924
Compare
This is ready to be reviewed again :) @pacman100 |
397e301
to
a3ce1df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @oraluben, thank you for the updates to the PR, left a comment. The PR is almost good to merge. Also, I spent quite some time testing this. Here is a complete minimal example using DeepSpeed+Dynamo and the env variables suggested by you:
- Clone repo https://github.com/sgugger/torchdynamo-tests
- Change the config
configs/dynamo_fp16.yaml
to include DeepSpeed:
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: 'DEEPSPEED'
downcast_bf16: 'no'
dynamo_backend: INDUCTOR
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
- Run the below command:
TORCHDYNAMO_DEBUG_FUNCTION=forward accelerate launch --config_file configs/dynamo_fp16.yaml scripts/text_classification.py --task_name mrpc --dynamo_backend "inductor" --batch_size 8
Output:
json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 1,
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"nvme_path": null
},
"offload_param": {
"device": "none",
"nvme_path": null
},
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_clipping": 1.0,
"compile": {
"enabled": true,
"backend": "inductor"
},
"steps_per_print": inf,
"fp16": {
"enabled": false
},
"bf16": {
"enabled": false
},
"zero_allow_untested_optimizer": true
}
...
Training Accuracy for backend inductor at epoch 0: {'accuracy': 0.7355349344978166, 'f1': 0.8169280181371624}
67%|█████████████████████████████████████████████████████████████████████████████████████████▏ | 457/687 [01:49<00:44, 5.15it/s]03/20/2024 07:15:49 - INFO - accelerate.accelerator - The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
67%|█████████████████████████████████████████████████████████████████████████████████████████▎ | 458/687 [01:49<00:44, 5.16it/s]Training Accuracy for backend inductor at epoch 1: {'accuracy': 0.8681768558951966, 'f1': 0.902127659574468}
Training Accuracy for backend inductor at epoch 1: {'accuracy': 0.8681768558951966, 'f1': 0.902127659574468}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 686/687 [02:35<00:00, 5.09it/s]03/20/2024 07:16:36 - INFO - accelerate.accelerator - The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 687/687 [02:35<00:00, 5.11it/s]Training Accuracy for backend inductor at epoch 2: {'accuracy': 0.954967248908297, 'f1': 0.9664838513101768}
Training finished.
First iteration took: 17.61s
Average time after the first iteration: 201.59msTraining Accuracy for backend inductor at epoch 2: {'accuracy': 0.954967248908297, 'f1': 0.9664838513101768}
Training finished.
First iteration took: 17.84s
Average time after the first iteration: 201.58ms
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] function: 'forward' (/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:46)
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] last reason: tensor 'L['input']' requires_grad mismatch. expected requires_grad=1
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] function: 'forward' (/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:46)
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] last reason: tensor 'L['input']' requires_grad mismatch. expected requires_grad=1
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
03/20/2024 07:16:45 - INFO - accelerate.accelerator - The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
Evaluation finished.
First iteration took: 7.44s
Average time after the first iteration: 77.50ms
Evaluation finished.
First iteration took: 6.44s
Average time after the first iteration: 119.29ms
Test Accuracy for backend inductor: {'accuracy': 0.8525, 'f1': 0.8952042628774423}
Test Accuracy for backend inductor: {'accuracy': 0.8525, 'f1': 0.8952042628774423}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 687/687 [02:45<00:00, 4.16it/s]
Speed-up:
Dynamo | FP16 |
---|---|
z3+no | 188.64ms/111.11ms |
z3+inductor | 201.58ms/119.29ms |
Don't see any savings at all 😅.
# dynamo itself has some issue, use below to only compile `forward` for testing. | ||
# On deepspeed side, `deepspeed.util.z3_leaf_module.[un]set_z3_leaf_modules` is used for similar issue | ||
# that user want to compile/skip specific module. | ||
"TORCHDYNAMO_DEBUG_FUNCTION": "forward", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this required? The above comment is unclear. A detailed explanation in this PR and a slightly detailed comment would help to know what needs to be done to get DeepSpeed+Compile working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dynamo still don't support all python op and may cause graph break and even failure during compile. However I didn't dive into the detail of the failure and use this whitelist to tell dynamo to only compile forward
function, which will also break a module into several functions. This is why you didn't see improvement comparing inductor and no dynamo.
Based on my experience using dynamo with large model, user should specify which module they want to compile as one, which can be done with mentioned deepspeed api with user code modified. The focus of the test in this PR is on determining if dynamo is enabled, and I did not evaluate its performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our internal scenario, dynamo can perform a ~10% speedup if compile each LLamaDecodeLayer
.
Without using the env variable
cc @tohtana and @tjruwase in case you have idea about this and the steps to overcome this. |
Hi @oraluben, @pacman100, Thank you for your report! Sorry for my late response. We found that the error is caused by I will try to fix this by compiling again when the grad mode is changed. |
Hi @oraluben, @pacman100,
I forcibly disable the custom Linear function and disabled the Z3 hook function, and the above example worked. |
That sounds like I'm initializing the config in wrong place, can you give some advice about the proper way? @pacman100 On the other hand, I'm submitting this patch in torch: pytorch/pytorch#124273. I think it's safe to land this if the patch goes into torch. |
…rate into support-deepspeed-dynamo
@umchand, FYI |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
not stale, still working on that |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
This is a PR that tries to respect microsoft/DeepSpeed#4878 in 🤗 accelerate/transformers.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@pacman100 since it's deepspeed related, and @tohtana since you implemented the deepspeed part.