Lower GPU memory requirements at ONNX export #1115

fxmarty · 2023-06-16T07:13:53Z

Post-processing (merging of decoders) still uses x2 the model size in RAM, not GPU memory.

Fixes as well a bug where ORT inputs generated in generate_dummy_inputs_for_validation were always of type fp32, even if the export is in fp16.

ONNX Runtime has the bad habit of not releasing GPU memory (see microsoft/onnxruntime#7463 & microsoft/onnxruntime#11362 & the script below) when doing

del session
gc.collect()

or when simply exiting a function that initialized an InferenceSession.

Thus, in the ONNX export validation, as we initialize several InferenceSession (e.g. encoder, decoder), if the export is done on GPU, memory keeps accumulating which may result in OOM.

This PR allows to launch the validation in subprocesses, that are effectively killed after each validation and allow to indeed release memory. See the logs below to compare the memory usage (exporting llama-7b on fp16 on cuda device, on pytorch 2.1.0.dev20230615+cu118).

Currently

----------------- Start main_export
RAM: 39519.16 MB
GPU mem: 0.00 MB
----------------- before from_pretrained call
RAM: 39519.74 MB
GPU mem: 0.00 MB
----------------- after from_pretrained call
RAM: 41075.89 MB
GPU mem: 14547.94 MB
----------------- After loading model
RAM: 41071.94 MB
GPU mem: 14547.94 MB
----------------- After loading models_and_onnx_configs, before export
RAM: 41084.33 MB
GPU mem: 14547.94 MB
----------------- Just before onnx_export call
RAM: 41083.15 MB
GPU mem: 14547.94 MB
----------------- Just after onnx_export call
RAM: 43764.74 MB
GPU mem: 15351.15 MB
----------------- Just after save external data call
RAM: 43752.37 MB
GPU mem: 15351.15 MB
----------------- Just before onnx_export call
RAM: 44119.20 MB
GPU mem: 15761.15 MB
----------------- Just after onnx_export call
RAM: 45858.22 MB
GPU mem: 15782.12 MB
----------------- Just after save external data call
RAM: 45847.20 MB
GPU mem: 15782.12 MB
----------------- After export, before post-process
RAM: 45856.85 MB
GPU mem: 15782.12 MB
----------------- After post-process
RAM: 47516.82 MB
GPU mem: 15782.12 MB
----------------- Before nth validation
RAM: 47515.49 MB
GPU mem: 15782.12 MB
----------------- Start of nth model validation
RAM: 47527.54 MB
GPU mem: 15782.12 MB
----------------- Before InferenceSession init
RAM: 47526.32 MB
GPU mem: 15782.12 MB
----------------- After InferenceSession init
RAM: 48015.36 MB
GPU mem: 33012.32 MB
----------------- Before nth validation
RAM: 48057.54 MB
GPU mem: 35233.20 MB
----------------- Start of nth model validation
RAM: 48059.34 MB
GPU mem: 35233.20 MB
----------------- Before InferenceSession init
RAM: 48084.92 MB
GPU mem: 35233.20 MB
----------------- After InferenceSession init
RAM: 48694.38 MB
GPU mem: 52501.15 MB
----------------- After validation
RAM: 48799.00 MB
GPU mem: 55058.63 MB

After this PR

----------------- Start main_export
RAM: 32096.99 MB
GPU mem: 0.00 MB
----------------- before from_pretrained call
RAM: 32109.21 MB
GPU mem: 0.00 MB
----------------- after from_pretrained call
RAM: 35730.13 MB
GPU mem: 14545.85 MB
----------------- After loading model
RAM: 35739.60 MB
GPU mem: 14545.85 MB
----------------- After loading models_and_onnx_configs, before export
RAM: 35787.28 MB
GPU mem: 14545.85 MB
----------------- Just before onnx_export call
RAM: 35795.46 MB
GPU mem: 14545.85 MB
----------------- Just before onnx_export call
RAM: 38835.44 MB
GPU mem: 15761.15 MB
----------------- Just after onnx_export call
RAM: 40598.87 MB
GPU mem: 15782.12 MB
----------------- Just after save external data call
RAM: 40578.62 MB
GPU mem: 15782.12 MB
----------------- After export, before post-process
RAM: 40570.54 MB
GPU mem: 15782.12 MB
----------------- After post-process
RAM: 42374.24 MB
GPU mem: 15782.12 MB
----------------- before ValidationProcess init
RAM: 42377.47 MB
GPU mem: 15782.12 MB
----------------- end of ValidationProcess init
RAM: 42377.54 MB
GPU mem: 15782.12 MB
----------------- start of ValidationProcess run
RAM: 42686.26 MB
GPU mem: 16213.08 MB
----------------- after ValidationProcess join
RAM: 42385.13 MB
GPU mem: 15782.12 MB
----------------- before ValidationProcess init
RAM: 42396.16 MB
GPU mem: 15782.12 MB
----------------- end of ValidationProcess init
RAM: 42395.87 MB
GPU mem: 15782.12 MB
----------------- start of ValidationProcess run
RAM: 42802.97 MB
GPU mem: 16213.08 MB
----------------- after ValidationProcess join
RAM: 42466.02 MB
GPU mem: 15782.12 MB
----------------- After validation
RAM: 42464.28 MB
GPU mem: 15782.12 MB

ORT not releasing memory bug reproduction

import onnxruntime as ort
import gc
import psutil
import subprocess
import torch
import os

def print_memory(prefix):
    cpu_ram_mb = (psutil.virtual_memory().total - psutil.virtual_memory().available) / 1000**2

    command = "nvidia-smi --query-gpu=memory.used --format=csv --id=0"
    gpu_mem_info = subprocess.check_output(command.split()).decode("ascii").split("\n")[:-1][1:]

    gpu_mem_mb = [int(x.split()[0]) for i, x in enumerate(gpu_mem_info)][0] * 1.048576

    print("-----------------", prefix, flush=True)
    print(f"RAM: {cpu_ram_mb:.2f} MB", flush=True)
    print(f"GPU mem: {gpu_mem_mb:.2f} MB", flush=True)

print_memory("Before session load")

session = ort.InferenceSession("/path/to/decoder_model.onnx", providers=["CUDAExecutionProvider"])

"""
onnx_inputs = {
    "input_ids": torch.randint(0, 10, (2, 20)).numpy(),
    "attention_mask": torch.ones(2, 20, dtype=torch.int64).numpy()
}

onnx_outputs = session.run(None, onnx_inputs)
"""
del session
gc.collect()

print_memory("after collect")

printing

----------------- Before session load
RAM: 6784.98 MB
GPU mem: 5.24 MB
----------------- after collect
RAM: 8804.13 MB
GPU mem: 921.70 MB

HuggingFaceDocBuilderDev · 2023-06-16T07:52:24Z

The documentation is not available anymore as the PR was closed or merged.

michaelbenayoun · 2023-06-16T09:21:46Z

optimum/utils/normalized_config.py

+        if attr_name == "config":
+            return super().__getattr__(attr_name)
+


I had an infinite recursion error, which makes sense no? We call self.config in the __getattr__ redefinition.

michaelbenayoun · 2023-06-16T09:22:19Z

optimum/exporters/onnx/convert.py

Why do we need to use multiprocessing to run the subprocess?
Just trying to understand, everything looks fine!

As far as my understanding goes, multiprocessing is nice to have shared data among processes and to transfer data between each (typically launching a function as a subprocess). subprocess is more to launch commands that are not data heavy.

regisss

Very nice PR @fxmarty 🔥 🚀
This should help a lot with exporting 7B-parameter models on GPU!

regisss · 2023-06-16T12:40:17Z

optimum/exporters/onnx/convert.py

@@ -50,6 +52,9 @@
    from transformers.modeling_tf_utils import TFPreTrainedModel


+mp.set_start_method("spawn", force=True)


Why using force=True, this should be called only once no?

I am getting a RuntimeError: context has already been set otherwise

optimum/exporters/onnx/convert.py

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

HuggingFaceDocBuilderDev · 2023-06-19T05:08:11Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

fxmarty added 6 commits June 15, 2023 17:21

verbose

ba078a3

squash

870d6fa

Merge branch 'master' into experiment-memory-export

c6d9f4b

fixes

2b695b7

add comment

68d27f3

slight refactor

00a5111

fxmarty changed the title ~~Lower memory requirements at ONNX export~~ Lower GPU memory requirements at ONNX export Jun 16, 2023

fxmarty requested review from michaelbenayoun, regisss and xenova June 16, 2023 07:23

define get_inputs

263bc45

michaelbenayoun reviewed Jun 16, 2023

View reviewed changes

fxmarty added 3 commits June 16, 2023 18:51

fix tests

cef5b86

Merge branch 'master' into experiment-memory-export

f5238d1

fix tests bis

916298c

regisss approved these changes Jun 16, 2023

View reviewed changes

Update optimum/exporters/onnx/convert.py

ae2b483

Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

fxmarty merged commit 08575af into huggingface:main Jun 19, 2023
10 checks passed

This was referenced Jun 19, 2023

convert llama-7B Failed to allocate memory for requested buffer of size 90177536 #1060

Closed

llama-7b inference report Failed to allocate memory for requested buffer of size 180355072 #1069

Closed

fxmarty mentioned this pull request Sep 13, 2023

AttributeError: Can't pickle local object 'main.<locals>.train_transforms' #1327

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower GPU memory requirements at ONNX export #1115

Lower GPU memory requirements at ONNX export #1115

fxmarty commented Jun 16, 2023 •

edited

HuggingFaceDocBuilderDev commented Jun 16, 2023 •

edited

michaelbenayoun Jun 16, 2023

fxmarty Jun 16, 2023 •

edited

michaelbenayoun Jun 16, 2023

fxmarty Jun 16, 2023

regisss left a comment

regisss Jun 16, 2023

fxmarty Jun 19, 2023

HuggingFaceDocBuilderDev commented Jun 19, 2023

		if attr_name == "config":
		return super().__getattr__(attr_name)

		@@ -50,6 +52,9 @@
		from transformers.modeling_tf_utils import TFPreTrainedModel


		mp.set_start_method("spawn", force=True)

Lower GPU memory requirements at ONNX export #1115

Lower GPU memory requirements at ONNX export #1115

Conversation

fxmarty commented Jun 16, 2023 • edited

HuggingFaceDocBuilderDev commented Jun 16, 2023 • edited

michaelbenayoun Jun 16, 2023

Choose a reason for hiding this comment

fxmarty Jun 16, 2023 • edited

Choose a reason for hiding this comment

michaelbenayoun Jun 16, 2023

Choose a reason for hiding this comment

fxmarty Jun 16, 2023

Choose a reason for hiding this comment

regisss left a comment

Choose a reason for hiding this comment

regisss Jun 16, 2023

Choose a reason for hiding this comment

fxmarty Jun 19, 2023

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 19, 2023

fxmarty commented Jun 16, 2023 •

edited

HuggingFaceDocBuilderDev commented Jun 16, 2023 •

edited

fxmarty Jun 16, 2023 •

edited