Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added LOWVRAM env variable to text-to-image #36

Closed
wants to merge 1 commit into from

Conversation

Titan-Node
Copy link

Enables text-to-image sequential cpu offloading , roughly 1.2x savings of VRAM with 7.6x longer inference time.

Copy link
Member

@yondonfu yondonfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! Left some comments.

@@ -94,6 +94,8 @@ def __init__(self, model_id: str):
self.ldm = AutoPipelineForText2Image.from_pretrained(model_id, **kwargs).to(
torch_device
)
if os.environ.get("LOWVRAM"):
self.ldm.enable_sequential_cpu_offload()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Titan-Node Have you tried model offloading instead of sequential CPU offloading? Would be curious how the VRAM savings + inference speed results compare given that model offloading is supposed to have less of an impact on inference speed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly with model offloading it ended up increasing the GPU memory requirements. I tried both with torch_device being set to GPU and CPU but model offloading had either no effect or increased GPU RAM while also slowing down inference by 4x.
I've also tried model offloading with text, image and video pipelines with the same results, I gave up on it once I seen the sequential offloading option. I tried both together but I believe it throw an error.

Here is some breakdowns of some tests I did on an H100.
benchmarks.xlsx

@@ -94,6 +94,8 @@ def __init__(self, model_id: str):
self.ldm = AutoPipelineForText2Image.from_pretrained(model_id, **kwargs).to(
torch_device
)
if os.environ.get("LOWVRAM"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L94 will move the pipeline to the GPU when cuda is available (torch_device gets set to cuda) via the .to() call. Based on the diffusers docs it looks like you should not move the pipeline to the GPU first if either CPU or model offloading is used. So, if LOWVRAM is enabled you'd probably want to do something like this:

self.ldm = AutoPipelineForText2Image.from_pretrained(model_id, **kwargs)

if os.environ.get("LOWVRAM"):
    # Enable CPU or model offloading
else:
    self.ldm.to(torch_device)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesnt seem to be a performance or RAM difference when putting the .to() call before it but if the docs say not to then I will make that modification

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also running into issues with using the SFAST flag with the LOWRAM flag, although it seems the SFAST flag is not working anymore by default.

Getting:

  File "/root/.pyenv/versions/3.11.8/lib/python3.11/site-packages/sfast/jit/overrides.py", line 21, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: Cannot copy out of meta tensor; no data!

Either way I'm going to do some more testing to make sure the LOWRAM flag is put in the correct place and does not effect the SFAST flag. I'll close this for now and do everything in a single commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I wonder if stable-fast only works if the entire pipeline is loaded to the CUDA device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants