Skip to content

Prompt layer-wise recompute when applicable#20126

Merged
pengwa merged 23 commits into
mainfrom
pengwa/enable_layerwise_automatically
Apr 10, 2024
Merged

Prompt layer-wise recompute when applicable#20126
pengwa merged 23 commits into
mainfrom
pengwa/enable_layerwise_automatically

Conversation

@pengwa
Copy link
Copy Markdown
Contributor

@pengwa pengwa commented Mar 28, 2024

Prompt layer-wise when applicable

Give explicit prompts in export failures to users to enable layer-wise memory optimization if we found the checkpoint function is used.

  • Using checkpoint function is a strong indicator that the model is too large to fit in GPU memory.
  • If we don't override the checkpoint function here, mostly ONNX export will be failed. 1. For old version PyTorch, when handling gradient checkpoint feature, we just throw an exception. 2. For new version PyTorch, an export failure happens.
  • But both failures did not give users explicitly "HOW" to mitigate. This PR did that.

``

image

Motivation and Context

@pengwa pengwa added the training issues related to ONNX Runtime training; typically submitted using template label Mar 28, 2024
@pengwa pengwa marked this pull request as ready for review March 28, 2024 16:01
@pengwa pengwa changed the title Enable layer-wise automatically when applicable Enable layer-wise-recompute automatically when applicable Mar 28, 2024
Comment thread orttraining/orttraining/python/training/ortmodule/__init__.py Outdated
@pengwa pengwa changed the title Enable layer-wise-recompute automatically when applicable Enable layer-wise-recompute automatically Mar 29, 2024
Comment thread docs/Memory_Optimizer.md Outdated
Comment thread orttraining/orttraining/test/python/orttraining_test_ortmodule_api.py Outdated
@pengwa pengwa changed the title Enable layer-wise-recompute automatically Prompt layer-wise when applicable Apr 9, 2024
@pengwa pengwa changed the title Prompt layer-wise when applicable Prompt layer-wise recompute when applicable Apr 9, 2024
Comment thread docs/Memory_Optimizer.md Outdated
Comment thread orttraining/orttraining/python/training/ortmodule/__init__.py
wschin
wschin previously approved these changes Apr 9, 2024
Copy link
Copy Markdown
Contributor

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good efforts to consolidates the flags we have.

wschin
wschin previously approved these changes Apr 9, 2024
Copy link
Copy Markdown
Contributor

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@pengwa pengwa merged commit 280b263 into main Apr 10, 2024
@pengwa pengwa deleted the pengwa/enable_layerwise_automatically branch April 10, 2024 03:50
@pengwa
Copy link
Copy Markdown
Contributor Author

pengwa commented Apr 10, 2024

Thanks @wschin @mindest !

TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
### Prompt layer-wise when applicable

Give explicit prompts in export failures to users to enable layer-wise
memory optimization if we found the checkpoint function is used.
- Using checkpoint function is a strong indicator that the model is too
large to fit in GPU memory.
- If we don't override the checkpoint function here, mostly ONNX export
will be failed. 1. For old version PyTorch, when handling gradient
checkpoint feature, we just throw an exception. 2. For new version
PyTorch, an export failure happens.
- But both failures did not give users explicitly "HOW" to mitigate.
This PR did that.

``


![image](https://github.com/microsoft/onnxruntime/assets/10530022/c0476748-5818-4cc8-b2d6-88c7580fe4da)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants