Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support deepspeed dynamo #2460

Closed
wants to merge 20 commits into from

Conversation

oraluben
Copy link

@oraluben oraluben commented Feb 18, 2024

What does this PR do?

This is a PR that tries to respect microsoft/DeepSpeed#4878 in 🤗 accelerate/transformers.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@pacman100 since it's deepspeed related, and @tohtana since you implemented the deepspeed part.

@oraluben oraluben marked this pull request as ready for review February 18, 2024 09:45
setup.py Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @oraluben for adding the torch compile support for DeepSpeed ✨! It would be great to have related tests in tests/deepspeed/deepspeed.py file.

@pacman100
Copy link
Contributor

Hello, overall comment, I get the below error when I run the below test:

pytest -sv tests/deepspeed/test_deepspeed.py -k test_basic_dynamo_run

Screenshot 2024-03-15 at 1 11 48 PM

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thank you @oraluben for the changes. Please look at the reply to your comment and the basic runs fails with compile, so we need to see first how to make this feature usable.

tests/deepspeed/test_deepspeed.py Show resolved Hide resolved
Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
@oraluben
Copy link
Author

Thanks! I've committed your dynamo fix, and I'll look at the failed test.

@oraluben
Copy link
Author

This is ready to be reviewed again :) @pacman100

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @oraluben, thank you for the updates to the PR, left a comment. The PR is almost good to merge. Also, I spent quite some time testing this. Here is a complete minimal example using DeepSpeed+Dynamo and the env variables suggested by you:

  1. Clone repo https://github.com/sgugger/torchdynamo-tests
  2. Change the config configs/dynamo_fp16.yaml to include DeepSpeed:
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: 'DEEPSPEED'
downcast_bf16: 'no'
dynamo_backend: INDUCTOR
fsdp_config: {}
gpu_ids: all
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
  1. Run the below command:
TORCHDYNAMO_DEBUG_FUNCTION=forward  accelerate launch --config_file configs/dynamo_fp16.yaml scripts/text_classification.py --task_name mrpc --dynamo_backend "inductor" --batch_size 8

Output:

json = {
    "train_batch_size": 16, 
    "train_micro_batch_size_per_gpu": 8, 
    "gradient_accumulation_steps": 1, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none", 
            "nvme_path": null
        }, 
        "offload_param": {
            "device": "none", 
            "nvme_path": null
        }, 
        "stage3_gather_16bit_weights_on_model_save": true
    }, 
    "gradient_clipping": 1.0, 
    "compile": {
        "enabled": true, 
        "backend": "inductor"
    }, 
    "steps_per_print": inf, 
    "fp16": {
        "enabled": false
    }, 
    "bf16": {
        "enabled": false
    }, 
    "zero_allow_untested_optimizer": true
}
...

Training Accuracy for backend inductor at epoch 0: {'accuracy': 0.7355349344978166, 'f1': 0.8169280181371624}
 67%|█████████████████████████████████████████████████████████████████████████████████████████▏                                            | 457/687 [01:49<00:44,  5.15it/s]03/20/2024 07:15:49 - INFO - accelerate.accelerator - The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
 67%|█████████████████████████████████████████████████████████████████████████████████████████▎                                            | 458/687 [01:49<00:44,  5.16it/s]Training Accuracy for backend inductor at epoch 1: {'accuracy': 0.8681768558951966, 'f1': 0.902127659574468}
Training Accuracy for backend inductor at epoch 1: {'accuracy': 0.8681768558951966, 'f1': 0.902127659574468}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 686/687 [02:35<00:00,  5.09it/s]03/20/2024 07:16:36 - INFO - accelerate.accelerator - The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 687/687 [02:35<00:00,  5.11it/s]Training Accuracy for backend inductor at epoch 2: {'accuracy': 0.954967248908297, 'f1': 0.9664838513101768}
Training finished.
First iteration took: 17.61s
Average time after the first iteration: 201.59msTraining Accuracy for backend inductor at epoch 2: {'accuracy': 0.954967248908297, 'f1': 0.9664838513101768}

Training finished.
First iteration took: 17.84s
Average time after the first iteration: 201.58ms
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING]    function: 'forward' (/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:46)
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING]    last reason: tensor 'L['input']' requires_grad mismatch. expected requires_grad=1
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank1]:[2024-03-20 07:16:42,607] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (8)
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING]    function: 'forward' (/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:46)
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING]    last reason: tensor 'L['input']' requires_grad mismatch. expected requires_grad=1
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank0]:[2024-03-20 07:16:42,628] torch._dynamo.convert_frame: [WARNING] To diagnose recompilation issues, see https://pytorch.org/docs/master/compile/troubleshooting.html.
03/20/2024 07:16:45 - INFO - accelerate.accelerator - The used dataset had no length, returning gathered tensors. You should drop the remainder yourself.
Evaluation finished.
First iteration took: 7.44s
Average time after the first iteration: 77.50ms
Evaluation finished.
First iteration took: 6.44s
Average time after the first iteration: 119.29ms
Test Accuracy for backend inductor: {'accuracy': 0.8525, 'f1': 0.8952042628774423}
Test Accuracy for backend inductor: {'accuracy': 0.8525, 'f1': 0.8952042628774423}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 687/687 [02:45<00:00,  4.16it/s]

Speed-up:

Dynamo FP16
z3+no 188.64ms/111.11ms
z3+inductor 201.58ms/119.29ms

Don't see any savings at all 😅.

# dynamo itself has some issue, use below to only compile `forward` for testing.
# On deepspeed side, `deepspeed.util.z3_leaf_module.[un]set_z3_leaf_modules` is used for similar issue
# that user want to compile/skip specific module.
"TORCHDYNAMO_DEBUG_FUNCTION": "forward",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this required? The above comment is unclear. A detailed explanation in this PR and a slightly detailed comment would help to know what needs to be done to get DeepSpeed+Compile working.

Copy link
Author

@oraluben oraluben Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamo still don't support all python op and may cause graph break and even failure during compile. However I didn't dive into the detail of the failure and use this whitelist to tell dynamo to only compile forward function, which will also break a module into several functions. This is why you didn't see improvement comparing inductor and no dynamo.

Based on my experience using dynamo with large model, user should specify which module they want to compile as one, which can be done with mentioned deepspeed api with user code modified. The focus of the test in this PR is on determining if dynamo is enabled, and I did not evaluate its performance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our internal scenario, dynamo can perform a ~10% speedup if compile each LLamaDecodeLayer.

@pacman100
Copy link
Contributor

Without using the env variable TORCHDYNAMO_DEBUG_FUNCTION=forward, I get the following error:

File "/raid/sourab/transformers/src/transformers/models/bert/modeling_bert.py", line 286, in forward
        mixed_query_layer = self.query(hidden_states)result = forward_call(*args, **kwargs)

  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 727, in _convert_frame
    return self._call_impl(*args, **kwargs)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
    compiled_product = _compile(
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 665, in _compile
    result = forward_call(*args, **kwargs)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    raise InternalTorchDynamoError(str(e)).with_traceback(
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 646, in _compile
    return F.linear(input, self.weight, self.bias)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in catch_errors
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    return callback(frame, cache_entry, hooks, frame_state)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 727, in _convert_frame
    r = func(*args, **kwargs)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 626, in compile_inner
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
    check_fn = CheckFunctionManager(
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/guards.py", line 1011, in __init__
    compiled_product = _compile(
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 665, in _compile
        raise InternalTorchDynamoError(str(e)).with_traceback(guard.create(builder)

  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 646, in _compile
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_guards.py", line 246, in create
    return self.create_fn(builder, self)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/guards.py", line 448, in CONSTANT_MATCH
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    val = self.get(guard.name)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/guards.py", line 258, in get
    r = func(*args, **kwargs)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 626, in compile_inner
    return eval(name, self.scope, CLOSURE_VARS)
  File "<string>", line 1, in <module>
    check_fn = CheckFunctionManager(
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/_dynamo/guards.py", line 1011, in __init__
torch._dynamo.exc.InternalTorchDynamoError: type object 'FunctionMeta' has no attribute 'forward'


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

cc @tohtana and @tjruwase in case you have idea about this and the steps to overcome this.

@tohtana
Copy link

tohtana commented Mar 27, 2024

Hi @oraluben, @pacman100, Thank you for your report! Sorry for my late response.

We found that the error is caused by no_grad for the evaluation. Currently we reuse the compiled model even after the grad mode is changed. I thought dynamo automatically recompiles a model when necessary, but it seems that it is not always the case.

I will try to fix this by compiling again when the grad mode is changed.

@tohtana
Copy link

tohtana commented Mar 28, 2024

Hi @oraluben, @pacman100,
I found that torch recompiles the model when grad mode is changed. Actually we have the two following issues.

  1. Custom linear module
    DeepSpeed enables its custom linear module when Z3 is enabled. However, it does not work with torch.compile. So we have disabled the module when torch.compile is enabled. DeepSpeed disables it in Init() and checks the compile is enabled or not. We expect enabled in the compile config is boolean but auto is passed to Init ('compile': {'enabled': 'auto', 'backend': 'auto'}). Then DeepSpeed doesn't disable the custom function.
    It seems auto is set at

    config["compile"] = {"enabled": "auto", "backend": "auto"}
    Is this an expected behavior? On the other hand, DeepSpeedEngine receives 'compile': {'enabled': True, 'backend': 'inductor'}. Can we pass the same to Init()?

  2. Z3 hook function
    Dynamo fails to compile one of functions in Z3 hook. We can exclude the function from the compilation target as in Disable compile for Z3 hook function microsoft/DeepSpeed#5325. (We already excluded some other functions for Z3 hook)

I forcibly disable the custom Linear function and disabled the Z3 hook function, and the above example worked.
Can you give us a suggestion for the first one? If it works, we can merge microsoft/DeepSpeed#5325 for the second one.

@oraluben
Copy link
Author

We expect enabled in the compile config is boolean but auto is passed to Init ('compile': {'enabled': 'auto', 'backend': 'auto'}).

That sounds like I'm initializing the config in wrong place, can you give some advice about the proper way? @pacman100

On the other hand, I'm submitting this patch in torch: pytorch/pytorch#124273. I think it's safe to land this if the patch goes into torch.

@tjruwase
Copy link

@umchand, FYI

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@oraluben
Copy link
Author

not stale, still working on that

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants