add Paella (Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces) #2058

aengusng8 · 2023-01-21T18:37:17Z

Hi, @patrickvonplaten. This is my draft PR that we recently mentioned in Discord. I am 90% complete to move Paella to our library, and I think I need your help to finalize this progress.

What is done?

"Pipeline" and "Scheduler" are ready to run, check my Kaggle notebook: https://www.kaggle.com/code/aengusng/notebookd7ca68b633/notebook
Note: run this by CPU only in Colab, or CPU/GPU in Kaggle.

Current bottleneck problems?

I have a few questions that I would appreciate your help with:

Should I use layers, blocks, or models in the diffusers\src\diffusers\models folder to replace some parts of the original Paella model class, or should I keep the original Paella model class unchanged?
Can the code contain additional libraries such as einops, rudalle, and open_clip_torch, since they are part of the author's code?
When their vqvae is initialized from rudalle.get_vae, their text_encoder and tokenizer are initialized from open_clip, and How to save and upload the model class/configurations of vqvae, text_encoder, and tokenizer that are outside of Diffusers (like this https://huggingface.co/CompVis/stable-diffusion-v1-4)?

What is next?

Upon resolving this bottleneck, I plan to easily incorporate three additional pipelines: outpainting, image variation, and image interpolation.
Lastly, I will include some tests.

Updated: Closed this PR because comparing internal and external models takes time and deliberation.

HuggingFaceDocBuilderDev · 2023-01-21T18:42:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

patrickvonplaten · 2023-01-23T07:32:24Z

Hey @aengusng8,

Super cool! This already looks great :-) Please ping me @pcuenca if you'd like to have a review

aengusng8 · 2023-02-01T02:41:30Z

src/diffusers/pipelines/paella/pipeline_paella.py

+        self.c_r = c_r
+        self.down_levels = down_levels
+        self.up_levels = up_levels
+        c_levels = [c_hidden // (2**i) for i in reversed(range(len(down_levels)))]
+        self.embedding = nn.Embedding(num_vec_classes, c_levels[0])
+
+        # DOWN BLOCKS
+        self.down_blocks = nn.ModuleList()
+        for i, num_blocks in enumerate(down_levels):
+            blocks = []
+            if i > 0:
+                blocks.append(nn.Conv2d(c_levels[i - 1], c_levels[i], kernel_size=4, stride=2, padding=1))
+            for _ in range(num_blocks):
+                block = ResBlock(c_levels[i], c_levels[i] * 4, c_clip + c_r)
+                block.channelwise[-1].weight.data *= np.sqrt(1 / sum(down_levels))
+                blocks.append(block)
+            self.down_blocks.append(nn.ModuleList(blocks))
+
+        # UP BLOCKS
+        self.up_blocks = nn.ModuleList()
+        for i, num_blocks in enumerate(up_levels):
+            blocks = []
+            for j in range(num_blocks):
+                block = ResBlock(
+                    c_levels[len(c_levels) - 1 - i],
+                    c_levels[len(c_levels) - 1 - i] * 4,
+                    c_clip + c_r,
+                    c_levels[len(c_levels) - 1 - i] if (j == 0 and i > 0) else 0,
+                )
+                block.channelwise[-1].weight.data *= np.sqrt(1 / sum(up_levels))
+                blocks.append(block)
+            if i < len(up_levels) - 1:
+                blocks.append(
+                    nn.ConvTranspose2d(
+                        c_levels[len(c_levels) - 1 - i],
+                        c_levels[len(c_levels) - 2 - i],
+                        kernel_size=4,
+                        stride=2,
+                        padding=1,
+                    )
+                )
+            self.up_blocks.append(nn.ModuleList(blocks))
+
+        self.clf = nn.Conv2d(c_levels[0], num_vec_classes, kernel_size=1)
+
+    def gamma(self, r):
+        return (r * torch.pi / 2).cos()
+
+    def gen_r_embedding(self, r, max_positions=10000):
+        dtype = r.dtype
+        r = self.gamma(r) * max_positions
+        half_dim = self.c_r // 2
+        emb = math.log(max_positions) / (half_dim - 1)
+        emb = torch.arange(half_dim, device=r.device).float().mul(-emb).exp()
+        emb = r[:, None] * emb[None, :]
+        emb = torch.cat([emb.sin(), emb.cos()], dim=1)
+        if self.c_r % 2 == 1:  # zero pad
+            emb = nn.functional.pad(emb, (0, 1), mode="constant")
+        return emb.to(dtype)
+
+    def _down_encode_(self, x, s):
+        level_outputs = []
+        for i, blocks in enumerate(self.down_blocks):
+            for block in blocks:
+                if isinstance(block, ResBlock):
+                    # s_level = s[:, 0]
+                    # s = s[:, 1:]
+                    x = block(x, s)
+                else:
+                    x = block(x)
+            level_outputs.insert(0, x)
+        return level_outputs
+
+    def _up_decode(self, level_outputs, s):
+        x = level_outputs[0]
+        for i, blocks in enumerate(self.up_blocks):
+            for j, block in enumerate(blocks):
+                if isinstance(block, ResBlock):
+                    # s_level = s[:, 0]
+                    # s = s[:, 1:]
+                    if i > 0 and j == 0:
+                        x = block(x, s, level_outputs[i])
+                    else:
+                        x = block(x, s)
+                else:
+                    x = block(x)
+        return x
+
+    def forward(self, x, c, r):  # r is a uniform value between 0 and 1
+        r_embed = self.gen_r_embedding(r)
+        x = self.embedding(x).permute(0, 3, 1, 2)
+        if len(c.shape) == 2:
+            s = torch.cat([c, r_embed], dim=-1)[:, :, None, None]
+        else:
+            r_embed = r_embed[:, :, None, None].expand(-1, -1, c.size(2), c.size(3))
+            s = torch.cat([c, r_embed], dim=1)
+        level_outputs = self._down_encode_(x, s)
+        x = self._up_decode(level_outputs, s)
+        x = self.clf(x)
+        return x


Hi @pcuenca (cc @patrickvonplaten), should I use layers, blocks, or models in the diffusers\src\diffusers\models folder to replace some parts of the original Paella model class, or should I keep the original Paella model class unchanged?

Hey @aengusng8,

No worries! Thanks a lot for working on this :-)

It would be amazing if you could try to "mold" your code into the existing UNet2DConditionModel class:

diffusers/src/diffusers/models/unet_2d_condition.py

Line 53 in 2f9a70a

class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):

given that Paella uses a text conditioned unet it should fit.

Also we've just added a design philosophy that might help: https://huggingface.co/docs/diffusers/main/en/conceptual/philosophy

So it be super cool if you could gauge whether it's possible to "force" the whole modeling code into UNet2DConditionModel - feel free to design your own, new unet up and down class

github-actions · 2023-02-28T15:03:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

dblunk88 · 2023-04-16T21:37:54Z

Any updates?

Is this based on https://github.com/dome272/Paella ?

aengusng8 · 2023-04-17T04:29:12Z

Hi @dblunk88,

No
Yes, but based on old Paella (Paella is recently updated)

aengusng8 added 2 commits January 21, 2023 23:24

Initial commit

c3071fb

add gamma function

8cbb831

patrickvonplaten assigned pcuenca and patrickvonplaten Jan 23, 2023

aengusng8 commented Feb 1, 2023

View reviewed changes

github-actions bot added the stale Issues that haven't received updates label Feb 28, 2023

github-actions bot closed this Mar 8, 2023

patrickvonplaten mentioned this pull request Apr 17, 2023

Paella v3 #3134

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Paella (Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces) #2058

add Paella (Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces) #2058

aengusng8 commented Jan 21, 2023 •

edited

HuggingFaceDocBuilderDev commented Jan 21, 2023

patrickvonplaten commented Jan 23, 2023

aengusng8 Feb 1, 2023 •

edited

patrickvonplaten Feb 3, 2023

github-actions bot commented Feb 28, 2023

dblunk88 commented Apr 16, 2023

aengusng8 commented Apr 17, 2023

add Paella (Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces) #2058

add Paella (Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces) #2058

Conversation

aengusng8 commented Jan 21, 2023 • edited

What is done?

Current bottleneck problems?

What is next?

HuggingFaceDocBuilderDev commented Jan 21, 2023

patrickvonplaten commented Jan 23, 2023

aengusng8 Feb 1, 2023 • edited

Choose a reason for hiding this comment

patrickvonplaten Feb 3, 2023

Choose a reason for hiding this comment

github-actions bot commented Feb 28, 2023

dblunk88 commented Apr 16, 2023

aengusng8 commented Apr 17, 2023

aengusng8 commented Jan 21, 2023 •

edited

aengusng8 Feb 1, 2023 •

edited