Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower GPU memory requirements at ONNX export #1115

Merged
merged 11 commits into from
Jun 19, 2023

Conversation

fxmarty
Copy link
Collaborator

@fxmarty fxmarty commented Jun 16, 2023

Post-processing (merging of decoders) still uses x2 the model size in RAM, not GPU memory.

May partially fix #1069 #1060 #1055

Fixes as well a bug where ORT inputs generated in generate_dummy_inputs_for_validation were always of type fp32, even if the export is in fp16.

ONNX Runtime has the bad habit of not releasing GPU memory (see microsoft/onnxruntime#7463 & microsoft/onnxruntime#11362 & the script below) when doing

del session
gc.collect()

or when simply exiting a function that initialized an InferenceSession.

Thus, in the ONNX export validation, as we initialize several InferenceSession (e.g. encoder, decoder), if the export is done on GPU, memory keeps accumulating which may result in OOM.

This PR allows to launch the validation in subprocesses, that are effectively killed after each validation and allow to indeed release memory. See the logs below to compare the memory usage (exporting llama-7b on fp16 on cuda device, on pytorch 2.1.0.dev20230615+cu118).

Currently
----------------- Start main_export
RAM: 39519.16 MB
GPU mem: 0.00 MB
----------------- before from_pretrained call
RAM: 39519.74 MB
GPU mem: 0.00 MB
----------------- after from_pretrained call
RAM: 41075.89 MB
GPU mem: 14547.94 MB
----------------- After loading model
RAM: 41071.94 MB
GPU mem: 14547.94 MB
----------------- After loading models_and_onnx_configs, before export
RAM: 41084.33 MB
GPU mem: 14547.94 MB
----------------- Just before onnx_export call
RAM: 41083.15 MB
GPU mem: 14547.94 MB
----------------- Just after onnx_export call
RAM: 43764.74 MB
GPU mem: 15351.15 MB
----------------- Just after save external data call
RAM: 43752.37 MB
GPU mem: 15351.15 MB
----------------- Just before onnx_export call
RAM: 44119.20 MB
GPU mem: 15761.15 MB
----------------- Just after onnx_export call
RAM: 45858.22 MB
GPU mem: 15782.12 MB
----------------- Just after save external data call
RAM: 45847.20 MB
GPU mem: 15782.12 MB
----------------- After export, before post-process
RAM: 45856.85 MB
GPU mem: 15782.12 MB
----------------- After post-process
RAM: 47516.82 MB
GPU mem: 15782.12 MB
----------------- Before nth validation
RAM: 47515.49 MB
GPU mem: 15782.12 MB
----------------- Start of nth model validation
RAM: 47527.54 MB
GPU mem: 15782.12 MB
----------------- Before InferenceSession init
RAM: 47526.32 MB
GPU mem: 15782.12 MB
----------------- After InferenceSession init
RAM: 48015.36 MB
GPU mem: 33012.32 MB
----------------- Before nth validation
RAM: 48057.54 MB
GPU mem: 35233.20 MB
----------------- Start of nth model validation
RAM: 48059.34 MB
GPU mem: 35233.20 MB
----------------- Before InferenceSession init
RAM: 48084.92 MB
GPU mem: 35233.20 MB
----------------- After InferenceSession init
RAM: 48694.38 MB
GPU mem: 52501.15 MB
----------------- After validation
RAM: 48799.00 MB
GPU mem: 55058.63 MB
After this PR
----------------- Start main_export
RAM: 32096.99 MB
GPU mem: 0.00 MB
----------------- before from_pretrained call
RAM: 32109.21 MB
GPU mem: 0.00 MB
----------------- after from_pretrained call
RAM: 35730.13 MB
GPU mem: 14545.85 MB
----------------- After loading model
RAM: 35739.60 MB
GPU mem: 14545.85 MB
----------------- After loading models_and_onnx_configs, before export
RAM: 35787.28 MB
GPU mem: 14545.85 MB
----------------- Just before onnx_export call
RAM: 35795.46 MB
GPU mem: 14545.85 MB
----------------- Just before onnx_export call
RAM: 38835.44 MB
GPU mem: 15761.15 MB
----------------- Just after onnx_export call
RAM: 40598.87 MB
GPU mem: 15782.12 MB
----------------- Just after save external data call
RAM: 40578.62 MB
GPU mem: 15782.12 MB
----------------- After export, before post-process
RAM: 40570.54 MB
GPU mem: 15782.12 MB
----------------- After post-process
RAM: 42374.24 MB
GPU mem: 15782.12 MB
----------------- before ValidationProcess init
RAM: 42377.47 MB
GPU mem: 15782.12 MB
----------------- end of ValidationProcess init
RAM: 42377.54 MB
GPU mem: 15782.12 MB
----------------- start of ValidationProcess run
RAM: 42686.26 MB
GPU mem: 16213.08 MB
----------------- after ValidationProcess join
RAM: 42385.13 MB
GPU mem: 15782.12 MB
----------------- before ValidationProcess init
RAM: 42396.16 MB
GPU mem: 15782.12 MB
----------------- end of ValidationProcess init
RAM: 42395.87 MB
GPU mem: 15782.12 MB
----------------- start of ValidationProcess run
RAM: 42802.97 MB
GPU mem: 16213.08 MB
----------------- after ValidationProcess join
RAM: 42466.02 MB
GPU mem: 15782.12 MB
----------------- After validation
RAM: 42464.28 MB
GPU mem: 15782.12 MB
ORT not releasing memory bug reproduction
import onnxruntime as ort
import gc
import psutil
import subprocess
import torch
import os

def print_memory(prefix):
    cpu_ram_mb = (psutil.virtual_memory().total - psutil.virtual_memory().available) / 1000**2

    command = "nvidia-smi --query-gpu=memory.used --format=csv --id=0"
    gpu_mem_info = subprocess.check_output(command.split()).decode("ascii").split("\n")[:-1][1:]

    gpu_mem_mb = [int(x.split()[0]) for i, x in enumerate(gpu_mem_info)][0] * 1.048576

    print("-----------------", prefix, flush=True)
    print(f"RAM: {cpu_ram_mb:.2f} MB", flush=True)
    print(f"GPU mem: {gpu_mem_mb:.2f} MB", flush=True)

print_memory("Before session load")

session = ort.InferenceSession("/path/to/decoder_model.onnx", providers=["CUDAExecutionProvider"])

"""
onnx_inputs = {
    "input_ids": torch.randint(0, 10, (2, 20)).numpy(),
    "attention_mask": torch.ones(2, 20, dtype=torch.int64).numpy()
}

onnx_outputs = session.run(None, onnx_inputs)
"""
del session
gc.collect()

print_memory("after collect")

printing

----------------- Before session load
RAM: 6784.98 MB
GPU mem: 5.24 MB
----------------- after collect
RAM: 8804.13 MB
GPU mem: 921.70 MB

@fxmarty fxmarty changed the title Lower memory requirements at ONNX export Lower GPU memory requirements at ONNX export Jun 16, 2023
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 16, 2023

The documentation is not available anymore as the PR was closed or merged.

Comment on lines +47 to +49
if attr_name == "config":
return super().__getattr__(attr_name)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

@fxmarty fxmarty Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an infinite recursion error, which makes sense no? We call self.config in the __getattr__ redefinition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use multiprocessing to run the subprocess?
Just trying to understand, everything looks fine!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as my understanding goes, multiprocessing is nice to have shared data among processes and to transfer data between each (typically launching a function as a subprocess). subprocess is more to launch commands that are not data heavy.

Copy link
Contributor

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR @fxmarty 🔥 🚀
This should help a lot with exporting 7B-parameter models on GPU!

@@ -50,6 +52,9 @@
from transformers.modeling_tf_utils import TFPreTrainedModel


mp.set_start_method("spawn", force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using force=True, this should be called only once no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting a RuntimeError: context has already been set otherwise

optimum/exporters/onnx/convert.py Outdated Show resolved Hide resolved
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

llama-7b inference report Failed to allocate memory for requested buffer of size 180355072
4 participants