Add GLIGEN implementation #4441

nikhil-masterful · 2023-08-02T22:51:44Z

What does this PR do?

GLIGEN: Open-Set Grounded Text-to-Image Generation (CVPR 2023)
Project page - https://gligen.github.io/
Paper - https://arxiv.org/abs/2301.07093

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2023-08-02T22:58:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

nikhil-masterful · 2023-08-03T04:37:45Z

@patrickvonplaten @sayakpaul @stevhliu
Could anyone please help me in resolving this failed check ?

sayakpaul · 2023-08-03T06:58:24Z

Create a virtual Python environment.
Go to your local clone of diffusers.
Run pip install -e .[quality].
Then run make fix-copies.

nikhil-masterful · 2023-08-03T16:51:42Z

Create a virtual Python environment.

Go to your local clone of diffusers.

Run pip install -e .[quality].

Then run make fix-copies.

@sayakpaul : Thank you. All checks passed.
Tagging @patrickvonplaten @stevhliu as well for the review.

isamu-isozaki · 2023-08-03T17:07:52Z

Very awesome!

patrickvonplaten · 2023-08-04T10:39:50Z

src/diffusers/models/unet_2d_condition.py

@@ -63,6 +63,62 @@ class UNet2DConditionOutput(BaseOutput):
    sample: torch.FloatTensor = None


+class FourierEmbedder(nn.Module):


Can we move this to https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py

patrickvonplaten · 2023-08-04T10:39:59Z

src/diffusers/models/unet_2d_condition.py

+        return torch.stack((x.sin(), x.cos()), dim=-1).permute(0, 1, 3, 4, 2).reshape(*x.shape[:2], -1)
+
+
+class PositionNet(nn.Module):


Can we also move this to https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/embeddings.py

patrickvonplaten · 2023-08-04T10:43:10Z

src/diffusers/models/unet_2d_condition.py

@@ -202,6 +258,7 @@ def __init__(
        conv_in_kernel: int = 3,
        conv_out_kernel: int = 3,
        projection_class_embeddings_input_dim: Optional[int] = None,
+        use_gated_attention: bool = False,


Suggested change

use_gated_attention: bool = False,

attention_type: str = "default", # gated

When introducing new config variables let's make sure we can extend them going forward. Using a string-type variable would be nice

When introducing new config variables let's make sure we can extend them going forward. Using a string-type variable would be nice

@sayakpaul to fix this comment, I had to create a new repo for weights so that I can modify the unet/config file
Weight of the model are the exact copy of original weights

We should then submit PRs to the original model repository and tag the authors there.

I can do that, but it will break their fork of diffusers. I'm not sure if they would prefer that

So, IIUC the existing checkpoints from the gligen organization won't work with the current implementation that is being added in the PR?

I completely agree with you. I've reached out to their author haotian-liu at liuhaotian.cn@gmail.com to checkout this PR, but haven't heard back from them

Maybe open a discussion on their model repository? Feel free to tag me.

Opened a discussion here

Alright thanks much!

Meanwhile, I think we can knock off the other pending comments.

Thank you. Will fix the pending comments today

src/diffusers/models/transformer_2d.py

src/diffusers/models/attention.py

patrickvonplaten

Looks good to me in general! @sayakpaul @yiyixuxu do you want to give this a pass?

sayakpaul · 2023-08-04T14:39:21Z

Yes, I will.

examples/gligen/generation_text_box.py

examples/gligen/inpainting_text_box.py

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen.py

sayakpaul · 2023-08-04T17:57:22Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen.py

+
+class StableDiffusionGLIGENPipeline(DiffusionPipeline):
+    r"""
+    Pipeline for text-to-image generation using Stable Diffusion.


Needs to change.

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen.py

nikhil-masterful · 2023-08-11T19:00:23Z

@yiyixuxu Thanks for the review. I fixed all your comments and @sayakpaul comments as well.
Working on adding a test right now.

sayakpaul · 2023-08-12T17:15:58Z

@nikhil-masterful till we're hearing back from the GLIGEN authors, I think it's okay to have all the checkpoints under your HF profile with the latest configs. WDYT?

nikhil-masterful · 2023-08-12T19:32:57Z

@nikhil-masterful till we're hearing back from the GLIGEN authors, I think it's okay to have all the checkpoints under your HF profile with the latest configs. WDYT?

Agreed

nikhil-masterful · 2023-08-13T06:25:49Z

Did a longer pass and it's looking amazing.

I think we're yet to add tests, no? Let's add some tests here. Completely okay to just add fast tests for now.

@yiyixuxu could you give this a check as well?

@sayakpaul @yiyixuxu : Added a FastTest as requested.

sayakpaul

Looking fantastic! Thanks so much for iterating! I guess the only remainings are:

Let me know if anything is unclear.

nikhil-masterful · 2023-08-14T03:04:14Z

Looking fantastic! Thanks so much for iterating! I guess the only remainings are:

https://github.com/huggingface/diffusers/pull/4441/files#r1289496701

Add GLIGEN implementation #4441 (comment)

Let me know if anything is unclear.

@sayakpaul :

I've updated docstring with both the cases. Case 1 : img2img, Case 2 : text2img
Add GLIGEN implementation #4441 (comment) - Is this about waiting to hear back from GLIGEN authors ? It'll be really helpful for my company if we can merge GLIGEN soon, we are waiting on it so that we can run it straight from diffusers .

sayakpaul · 2023-08-14T04:14:43Z

I've updated docstring with both the cases. Case 1 : img2img, Case 2 : text2img

Works perfect! @stevhliu is this how you would have expected to see multiple example use cases for a pipeline to be included in the corresponding doc?

#4441 (comment) - Is this about waiting to hear back from GLIGEN authors ? It'll be really helpful for my company if we can merge GLIGEN soon, we are waiting on it so that we can run it straight from diffusers .

This is about having clones of the original GLIGEN checkpoints under your HF profile with the new configuration change. This will allow users to use all the available GLIGEN checkpoints directly from diffusers, right?

Also, instead of doing this, could we update the examples to something like so?

from diffusers.utils import make_image_grid

images = pipe(
    prompt=prompt,
    num_images_per_prompt=1,
    gligen_phrases=phrases,
    gligen_boxes=boxes,
    gligen_scheduled_sampling_beta=1,
    num_inference_steps=50,
).images

make_image_grid(images, rows, cols, resize=256)

It's simpler and doesn't make use of additional dependencies like torchvision. make_image_grid() utility was recently introduced and here's the official doc: https://huggingface.co/docs/diffusers/main/en/api/utilities#diffusers.utils.make_image_grid.

docs/source/en/api/pipelines/stable_diffusion/gligen.md

stevhliu · 2023-08-14T19:55:41Z

Works perfect! @stevhliu is this how you would have expected to see multiple example use cases for a pipeline to be included in the corresponding doc?

Yeah since there aren't separate pipelines (for example, StableDiffusionGLIGENImg2Img), this'll work just fine!

yiyixuxu

thanks for iterating! It looks much better now.
I left some more comments here :)

tests/pipelines/pipeline_params.py

tests/pipelines/stable_diffusion/test_stable_diffusion_gligen.py

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen.py

yiyixuxu · 2023-08-14T19:04:11Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen.py

+        width = width or self.unet.config.sample_size * self.vae_scale_factor
+
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(


Can we move the logics to check these three parameters into this function instead?

`gligen_phrases`, `gligen_boxes`, `gligen_inpaint_image`

Moved gligen_phrases, gligen_boxes to check_inputs
gligen_inpaint_image is always valid because None is also a valid input for text2img case

yiyixuxu · 2023-08-14T19:19:29Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen.py

+            dtype = self.text_encoder.dtype
+            # For each entity, described in phrases, is denoted with a bounding box,
+            # we represent the location information as (xmin,ymin,xmax,ymax)
+            boxes = torch.zeros(max_objs, 4, device=device, dtype=dtype)
+            boxes[:n_objs] = torch.tensor(_boxes[:n_objs])
+            text_embeddings = torch.zeros(max_objs, self.unet.cross_attention_dim, device=device, dtype=dtype)
+            text_embeddings[:n_objs] = _text_embeddings[:n_objs]
+            # Generate a mask for each object that is entity described by phrases
+            masks = torch.zeros(max_objs, device=device, dtype=dtype)
+            masks[:n_objs] = 1


2 questions here:

is there any reason _text_embedding, gligen_boxes would not have length n_objs? I made my suggestions based on the assumption that _text_embedding, gligen_boxes is already at length n_obj but let me know if it's not the case

do boxes, text_embeddings has to be a fixes size max_obj? can't it just be the same size as the number of objects we passed? so we don't have to fill the rest of of tensor 0s

Suggested change

dtype = self.text_encoder.dtype

# For each entity, described in phrases, is denoted with a bounding box,

# we represent the location information as (xmin,ymin,xmax,ymax)

boxes = torch.zeros(max_objs, 4, device=device, dtype=dtype)

boxes[:n_objs] = torch.tensor(_boxes[:n_objs])

text_embeddings = torch.zeros(max_objs, self.unet.cross_attention_dim, device=device, dtype=dtype)

text_embeddings[:n_objs] = _text_embeddings[:n_objs]

# Generate a mask for each object that is entity described by phrases

masks = torch.zeros(max_objs, device=device, dtype=dtype)

masks[:n_objs] = 1

# For each entity, described in phrases, is denoted with a bounding box,

# we represent the location information as (xmin,ymin,xmax,ymax)

boxes = torch.zeros(max_objs, 4, device=device, dtype=self.text_encoder.dtype)

boxes[:n_objs] = torch.tensor(gligen_boxes)

text_embeddings = torch.zeros(max_objs, self.unet.cross_attention_dim, device=device, dtype=self.text_encoder.dtype)

text_embeddings[:n_objs] = _text_embeddings

# Generate a mask for each object that is entity described by phrases

masks = torch.zeros(max_objs, device=device, dtype=self.text_encoder.dtype)

masks[:n_objs] = 1

Changed _text_embedding, gligen_boxes to have length n_objs.

Do boxes, text_embeddings has to be a fixes size max_obj? yes, that's how GLIGEN authors intended it to be. It would be good to keep it that way for now

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_gligen.py

nikhil-masterful · 2023-08-15T04:48:52Z

I've updated docstring with both the cases. Case 1 : img2img, Case 2 : text2img

Works perfect! @stevhliu is this how you would have expected to see multiple example use cases for a pipeline to be included in the corresponding doc?

#4441 (comment) - Is this about waiting to hear back from GLIGEN authors ? It'll be really helpful for my company if we can merge GLIGEN soon, we are waiting on it so that we can run it straight from diffusers .

This is about having clones of the original GLIGEN checkpoints under your HF profile with the new configuration change. This will allow users to use all the available GLIGEN checkpoints directly from diffusers, right?

Also, instead of doing this, could we update the examples to something like so?
from diffusers.utils import make_image_grid

images = pipe(
    prompt=prompt,
    num_images_per_prompt=1,
    gligen_phrases=phrases,
    gligen_boxes=boxes,
    gligen_scheduled_sampling_beta=1,
    num_inference_steps=50,
).images

make_image_grid(images, rows, cols, resize=256)
It's simpler and doesn't make use of additional dependencies like torchvision. make_image_grid() utility was recently introduced and here's the official doc: https://huggingface.co/docs/diffusers/main/en/api/utilities#diffusers.utils.make_image_grid.

@sayakpaul

Yes, my HF profile with the new configuration will allow users to use all the available GLIGEN checkpoints directly from diffusers
Regards to the example in docstring, I got rid of torchvision and just used the output pil image as @yiyixuxu recommended in this comments.

sayakpaul · 2023-08-15T04:56:14Z

@nikhil-masterful I only see two checkpoints here: https://huggingface.co/masterful. Are these the only two supported from GLIGEN officially?

If so, yeah, then that's checked.

@yiyixuxu could you give this a final look?

nikhil-masterful · 2023-08-15T05:23:05Z

@nikhil-masterful I only see two checkpoints here: https://huggingface.co/masterful. Are these the only two supported from GLIGEN officially?

If so, yeah, then that's checked.

@yiyixuxu could you give this a final look?

Yes, those were the only two supported from GLIGEN officially

nikhil-masterful · 2023-08-15T05:24:00Z

@sayakpaul @yiyixuxu Thanks for reviewing this. I've fixed all the outstanding comments. Please let me know if I missed anything

yiyixuxu

Looking great to me! thanks!

nikhil-masterful · 2023-08-16T01:56:36Z

@sayakpaul if it looks good, can we merge please ?

sayakpaul

Thanks so much for iterating!

nikhil-masterful · 2023-08-16T04:16:22Z

Thanks for helping me make this contribution. It was great experience.

@sayakpaul I would like to continue contributing to diffusers. Could you please direct me to any outstanding bug/feature that I can work on ? I can iterate on things faster.

sayakpaul · 2023-08-16T04:51:03Z

Thanks so much for being willing to do that! I would redirect you to our issues thread and see what interests you and we can take it from there.

* Add GLIGEN implementation * GLIGEN: Fix code quality check failures * GLIGEN: Fix Import block un-sorted or un-formatted failures * GLIGEN: Fix check_repository_consistency failures * GLIGEN: Add 'PositionNet' to versatile_diffusion/modeling_text_unet.py * GLIGEN: check_repository_consistency: fix 'copy does not match' error * GLIGEN: Fix review comments (1) * GLIGEN: Fix E721 Do not compare types, use `isinstance()` failures * GLIGEN : Ensure _encode_prompt() copy matches to StableDiffusionPipeline * GLIGEN: Fix ruff E721 failure in unidiffuser/test_unidiffuser.py * GLIGEN: doc_builder: restyle pipeline_stable_diffusion_gligen.py * GIGLEN: reset files unrelated to gligen * GLIGEN: Fix documentation comments (1) * GLIGEN: Fix review comments (2) * GLIGEN: Added FastTest * GLIGEN: Fix review comments (3)

Add GLIGEN implementation

7749b7c

nikhil-masterful added 4 commits August 2, 2023 16:38

GLIGEN: Fix code quality check failures

2c63533

GLIGEN: Fix Import block un-sorted or un-formatted failures

3a71fa3

GLIGEN: Fix check_repository_consistency failures

a7d798f

GLIGEN: Add 'PositionNet' to versatile_diffusion/modeling_text_unet.py

db78984

nikhil-masterful marked this pull request as draft August 3, 2023 20:08

nikhil-masterful marked this pull request as ready for review August 3, 2023 20:09

GLIGEN: check_repository_consistency: fix 'copy does not match' error

ee273a3

patrickvonplaten reviewed Aug 4, 2023

View reviewed changes

src/diffusers/models/transformer_2d.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Aug 4, 2023

View reviewed changes

src/diffusers/models/attention.py Show resolved Hide resolved

patrickvonplaten reviewed Aug 4, 2023

View reviewed changes