-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower GPU memory requirements at ONNX export #1115
Conversation
The documentation is not available anymore as the PR was closed or merged. |
if attr_name == "config": | ||
return super().__getattr__(attr_name) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an infinite recursion error, which makes sense no? We call self.config
in the __getattr__
redefinition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use multiprocessing to run the subprocess?
Just trying to understand, everything looks fine!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as my understanding goes, multiprocessing
is nice to have shared data among processes and to transfer data between each (typically launching a function as a subprocess). subprocess
is more to launch commands that are not data heavy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice PR @fxmarty 🔥 🚀
This should help a lot with exporting 7B-parameter models on GPU!
@@ -50,6 +52,9 @@ | |||
from transformers.modeling_tf_utils import TFPreTrainedModel | |||
|
|||
|
|||
mp.set_start_method("spawn", force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using force=True
, this should be called only once no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting a RuntimeError: context has already been set
otherwise
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Post-processing (merging of decoders) still uses x2 the model size in RAM, not GPU memory.
May partially fix #1069 #1060 #1055
Fixes as well a bug where ORT inputs generated in
generate_dummy_inputs_for_validation
were always of type fp32, even if the export is in fp16.ONNX Runtime has the bad habit of not releasing GPU memory (see microsoft/onnxruntime#7463 & microsoft/onnxruntime#11362 & the script below) when doing
or when simply exiting a function that initialized an InferenceSession.
Thus, in the ONNX export validation, as we initialize several
InferenceSession
(e.g. encoder, decoder), if the export is done on GPU, memory keeps accumulating which may result in OOM.This PR allows to launch the validation in subprocesses, that are effectively killed after each validation and allow to indeed release memory. See the logs below to compare the memory usage (exporting llama-7b on fp16 on cuda device, on pytorch 2.1.0.dev20230615+cu118).
Currently
After this PR
ORT not releasing memory bug reproduction
printing