-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added LOWVRAM env variable to text-to-image #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! Left some comments.
@@ -94,6 +94,8 @@ def __init__(self, model_id: str): | |||
self.ldm = AutoPipelineForText2Image.from_pretrained(model_id, **kwargs).to( | |||
torch_device | |||
) | |||
if os.environ.get("LOWVRAM"): | |||
self.ldm.enable_sequential_cpu_offload() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Titan-Node Have you tried model offloading instead of sequential CPU offloading? Would be curious how the VRAM savings + inference speed results compare given that model offloading is supposed to have less of an impact on inference speed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly with model offloading it ended up increasing the GPU memory requirements. I tried both with torch_device being set to GPU and CPU but model offloading had either no effect or increased GPU RAM while also slowing down inference by 4x.
I've also tried model offloading with text, image and video pipelines with the same results, I gave up on it once I seen the sequential offloading option. I tried both together but I believe it throw an error.
Here is some breakdowns of some tests I did on an H100.
benchmarks.xlsx
@@ -94,6 +94,8 @@ def __init__(self, model_id: str): | |||
self.ldm = AutoPipelineForText2Image.from_pretrained(model_id, **kwargs).to( | |||
torch_device | |||
) | |||
if os.environ.get("LOWVRAM"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L94 will move the pipeline to the GPU when cuda is available (torch_device
gets set to cuda) via the .to()
call. Based on the diffusers docs it looks like you should not move the pipeline to the GPU first if either CPU or model offloading is used. So, if LOWVRAM
is enabled you'd probably want to do something like this:
self.ldm = AutoPipelineForText2Image.from_pretrained(model_id, **kwargs)
if os.environ.get("LOWVRAM"):
# Enable CPU or model offloading
else:
self.ldm.to(torch_device)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesnt seem to be a performance or RAM difference when putting the .to()
call before it but if the docs say not to then I will make that modification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also running into issues with using the SFAST flag with the LOWRAM flag, although it seems the SFAST flag is not working anymore by default.
Getting:
File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/sfast/jit/overrides.py", line 21, in __torch_function__
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: Cannot copy out of meta tensor; no data!
Either way I'm going to do some more testing to make sure the LOWRAM flag is put in the correct place and does not effect the SFAST flag. I'll close this for now and do everything in a single commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I wonder if stable-fast only works if the entire pipeline is loaded to the CUDA device.
Enables text-to-image sequential cpu offloading , roughly 1.2x savings of VRAM with 7.6x longer inference time.