Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when using CLIPTextModel #31439

Closed
2 of 4 tasks
minsuk00 opened this issue Jun 15, 2024 · 7 comments
Closed
2 of 4 tasks

Memory leak when using CLIPTextModel #31439

minsuk00 opened this issue Jun 15, 2024 · 7 comments

Comments

@minsuk00
Copy link

minsuk00 commented Jun 15, 2024

System Info

  • transformers version: 4.26.1
  • Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
  • Python version: 3.11.5
  • Huggingface_hub version: 0.17.3
  • PyTorch version (GPU?): 2.1.2 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I can't free GPU memory after I use CLIPTextModel
Also, memory is allocated in another device for some reason

problem should be reproduced by using the following code snippet

from transformers import CLIPTextModel
import torch

clip_text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda:1")
del clip_text_model
torch.cuda.empty_cache()

Expected behavior

  • CLIPTextModel is placed in "cuda:1", but for some reason, memory gets allocated in "cuda:0" when I call torch.cuda.empty_cache()
  • Also, memory is still not freed for "cuda:1"
  • As a result, both "cuda:0" and "cuda:1" have some memory allocated

I've also tried using garbage collection and explicitly moving model to cpu, but they don't work.

@younesbelkada
Copy link
Contributor

Hi @minsuk00
you can also try the release_memory utility method from accelerate.utils - cc @muellerzr

@minsuk00
Copy link
Author

minsuk00 commented Jun 18, 2024

@younesbelkada -cc @muellerz
Thanks for the suggestion, but it doesn't seem to work.
clip_text_model = accelerate.utils.release_memory(clip_text_model)
does not free any GPU memory.

Additionally, calling clip_text_model.cpu() or torch.cuda.empty_cache() simply results in the behavior described above.

@huggingface huggingface deleted a comment from github-actions bot Jul 16, 2024
@amyeroberts
Copy link
Collaborator

cc @muellerzr regarding the accelerate behaviour.

Regarding torch.cuda.empty_cache() it's recommended that this function is not manually used c.f. a related issue, and this discussion in the pytorch forum

@nnilayy
Copy link
Contributor

nnilayy commented Sep 6, 2024

Hey there @minsuk00, Hope you are doing well. (Redirected from #33345 to here). I primarily observe two concerns raised regarding the VRAM usages in here. Let's clear these one by one.

Small VRAM Usage Spike on 2nd GPU/Non-primary GPUs

When loading a model, PyTorch initializes a CUDA context for managing resources like device memory and kernel execution. This setup includes detecting which GPUs are available. By default, CUDA might allocate a small amount of memory on all available GPUs, not just the one we explicitly specified for model to be deployed on. This can cause a minor memory spike on non-primary GPUs.

clip_text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda:1")

The code above places the model on GPU-1. However, without specific settings, CUDA initializes contexts on all GPUs, leading to small memory footprints on each.

To ensure that only GPU-1 is utilized, and to prevent CUDA from initializing on other GPUs, you can set the CUDA_VISIBLE_DEVICES environment variable at the beginning of your script, as follows:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # Only GPU-1 is recognized

This command restricts CUDA to see and use only the GPU specified by us and avoides unnecessary memory allocation on others.

🤗 Tip: Always define CUDA_VISIBLE_DEVICES before importing torch, to ensure that CUDA configurations are applied correctly:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch

# rest of the code

Residual VRAM Usage After Freeing Up Memory

When we use PyTorch with CUDA to load a model, the framework initializes a CUDA context. The CUDA could be thought of as a data structure that manages crucial GPU resources such as device memory allocation, kernel execution, and synchronization mechanisms.

So previously, to make an attempt to clear the VRAM entirely, we executed the following code:

del clip_text_model
gc.collect()
torch.cuda.empty_cache()

This frees up most of the VRAM used by the model, but running this does not dismantle the CUDA context or other minor memory allocations made during the model's initialization. As a result, we still observe some residual VRAM usage. The CUDA context, as a persistent entity, maintains the state and resources required for efficient GPU operations across the lifecycle of the program, ensuring the overhead of initializing these resources is not repeatedly incurred.

The good news is that residual VRAM usage is generally consistent across different models, regardless of their type or size.

For example:
(1)For CLIPTextModel: Once the model is loaded with the following config, CLIPTextModel consumes 667MB of VRAM on the second GPU at peak, and at the end of this script after deleting the model, with garbage collection and emptying the cache, we are left with a VRAM usage of 103MB.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import gc
import torch
from transformers import CLIPTextModel

clip_text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda:1")
del clip_text_model
gc.collect()
torch.cuda.empty_cache()

And similarly:
(2)For Meta-Llama-3.1-8B-Instruct: Once the model is loaded with the following config, Meta-Llama-3.1-8B-Instruct consumes 12.2GB of VRAM on the second GPU at peak, and at the end of this script after deleting the model, with garbage collection and emptying the cache, we are left with a VRAM usage of 103MB.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import gc
import torch
from transformers import AutoModelForCausalLM
checkpoint = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(checkpoint, 
                                             device_map="auto",
                                             torch_dtype=torch.float16,
                                             token=hugging_face_token)
del model
gc.collect()
torch.cuda.empty_cache()

🌟 Important point to note:
If you ran both the above scripts sequentially, the residual VRAM usage would not sum up to 206MB; it would still be just 103MB. This indicates that the CUDA context, once initialized, does not duplicate its foundational resource allocations when additional models are loaded in the same session.

Therefore, loading multiple models does not cumulatively increase residual VRAM usage beyond what is necessary for the CUDA context and other minimal allocations. After using one model, you can delete it and load another as needed without concern for having extra VRAM overhead thought to be carried by another model load.

If you are looking to mitigate even this small residual VRAM usage, I have written the next section discussing techniques to do the same. But if you want to continue with loading another model and continue with your work after loading the first model, i have briefly highlighted the guidelines in the TLDR & Ending Notes section at the end.

Mitigating Residual VRAM Usage

(1) Restart the IPy-Kernel:

Restarting the Jupyter Notebook kernel clears all defined variables and the CUDA context, effectively freeing up all GPU memory used during the session. This is a straightforward method to reset the memory state of the GPU after the model operations:

(2) Using Terminal to Run the Script:

Running your scripts directly through a terminal session instead of a notebook can help manage memory more effectively. Each time the script completes, the Python process terminates, which clears all memory allocations including the CUDA context:

Here’s how to set up and run your script:

%%writefile model_load_script.py
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch
from transformers import CLIPTextModel

clip_text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda:1")

# Operations are performed here
# Once completed, clean up the model and VRAM
del clip_text_model
gc.collect()
torch.cuda.empty_cache()

To execute the script, run the following command in your terminal:

python model_load_script.py

(3) Using numba to Reset the GPU:

For situations where restarting the notebook or terminal is not feasible, you can use numba's GPU management features to forcefully reset the GPU. This method clears all memory allocations and should be used with caution as it affects all CUDA kernels and memory on the GPU:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch
from numba import cuda
from transformers import CLIPTextModel

clip_text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda:1")

# After other operations, clean up the model and VRAM
del clip_text_model
gc.collect()
torch.cuda.empty_cache()
device = cuda.get_current_device()
device.reset()

Note: Use this method at the end of your workflow, as it will disrupt all ongoing GPU processes and reset the state entirely, similar to rebooting the GPU.

🤗 TLDR & Ending Notes

  1. For the first issue, setting CUDA_VISIBLE_DEVICES before importing torch solves the error of non-primary GPU VRAM spike.

  2. For the second issue:

    • (I) If you just want to remove residual VRAM usage at the end of the script run, you can:

      • (i) Restart the Python Kernel
      • (ii) Use Terminal to run the script
      • (iii) Use numba at the end of the script
    • (II) If you want to continue doing some abc() function or load another model_y and do not need the previously loaded model_x:

del model_x
gc.collect()
torch.cuda.empty_cache()

The code above will clear the VRAM occupied by model_x and can be used by model_y or can be used to perform other operations which would require VRAM for their execution.

Hope this clarifies the miscellaneous VRAM spikes and Residual VRAM usage🤗, Let me know if something is unclear.

@SunMarc
Copy link
Member

SunMarc commented Sep 27, 2024

Hey @minsuk00, does the answer above fixes your issue ?

@minsuk00
Copy link
Author

@SunMarc @nnilayy
Hey, apologies for the late response.

1.

For the first problem with the error of non-primary GPU VRAM spike, setting the CUDA_VISIBLE_DEVICES to 1 works well, but only when I set the device to "cuda:0" instead of "cuda:1"

clip_text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda:0")

Data is correctly allocated at GPU 1 as a result.
Otherwise, when I use .to("cuda:1"), it causes this error:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2.

I confirmed that indeed, the residual VRAM after freeing the memory is shared among multiple models.


Thanks for the help! Really appreciate it.

@SunMarc
Copy link
Member

SunMarc commented Sep 30, 2024

For the first problem with the error of non-primary GPU VRAM spike, setting the CUDA_VISIBLE_DEVICES to 1 works well, but only when I set the device to "cuda:0" instead of "cuda:1"

clip_text_model = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to("cuda:0")
Data is correctly allocated at GPU 1 as a result.
Otherwise, when I use .to("cuda:1"), it causes this error:

Yes, you should do .to("cuda:0") because setting CUDA_VISIBLE_DEVICES = 1 means that you only see the gpu n*1, so "cuda:0" target this gpu and since there are no other visible gpus, "cuda:1" is not a valid device ordinal.

If your issue is fixed, feel free to close the PR. Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants