Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Paella (Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces) #2058

Closed
wants to merge 2 commits into from

Conversation

aengusng8
Copy link
Contributor

@aengusng8 aengusng8 commented Jan 21, 2023

Hi, @patrickvonplaten. This is my draft PR that we recently mentioned in Discord. I am 90% complete to move Paella to our library, and I think I need your help to finalize this progress.

What is done?

"Pipeline" and "Scheduler" are ready to run, check my Kaggle notebook: https://www.kaggle.com/code/aengusng/notebookd7ca68b633/notebook
Note: run this by CPU only in Colab, or CPU/GPU in Kaggle.

Current bottleneck problems?

I have a few questions that I would appreciate your help with:

  1. Should I use layers, blocks, or models in the diffusers\src\diffusers\models folder to replace some parts of the original Paella model class, or should I keep the original Paella model class unchanged?
  2. Can the code contain additional libraries such as einops, rudalle, and open_clip_torch, since they are part of the author's code?
  3. When their vqvae is initialized from rudalle.get_vae, their text_encoder and tokenizer are initialized from open_clip, and How to save and upload the model class/configurations of vqvae, text_encoder, and tokenizer that are outside of Diffusers (like this https://huggingface.co/CompVis/stable-diffusion-v1-4)?

What is next?

  1. Upon resolving this bottleneck, I plan to easily incorporate three additional pipelines: outpainting, image variation, and image interpolation.
  2. Lastly, I will include some tests.

Updated: Closed this PR because comparing internal and external models takes time and deliberation.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@patrickvonplaten
Copy link
Contributor

Hey @aengusng8,

Super cool! This already looks great :-) Please ping me @pcuenca if you'd like to have a review

Comment on lines +227 to +399
self.c_r = c_r
self.down_levels = down_levels
self.up_levels = up_levels
c_levels = [c_hidden // (2**i) for i in reversed(range(len(down_levels)))]
self.embedding = nn.Embedding(num_vec_classes, c_levels[0])

# DOWN BLOCKS
self.down_blocks = nn.ModuleList()
for i, num_blocks in enumerate(down_levels):
blocks = []
if i > 0:
blocks.append(nn.Conv2d(c_levels[i - 1], c_levels[i], kernel_size=4, stride=2, padding=1))
for _ in range(num_blocks):
block = ResBlock(c_levels[i], c_levels[i] * 4, c_clip + c_r)
block.channelwise[-1].weight.data *= np.sqrt(1 / sum(down_levels))
blocks.append(block)
self.down_blocks.append(nn.ModuleList(blocks))

# UP BLOCKS
self.up_blocks = nn.ModuleList()
for i, num_blocks in enumerate(up_levels):
blocks = []
for j in range(num_blocks):
block = ResBlock(
c_levels[len(c_levels) - 1 - i],
c_levels[len(c_levels) - 1 - i] * 4,
c_clip + c_r,
c_levels[len(c_levels) - 1 - i] if (j == 0 and i > 0) else 0,
)
block.channelwise[-1].weight.data *= np.sqrt(1 / sum(up_levels))
blocks.append(block)
if i < len(up_levels) - 1:
blocks.append(
nn.ConvTranspose2d(
c_levels[len(c_levels) - 1 - i],
c_levels[len(c_levels) - 2 - i],
kernel_size=4,
stride=2,
padding=1,
)
)
self.up_blocks.append(nn.ModuleList(blocks))

self.clf = nn.Conv2d(c_levels[0], num_vec_classes, kernel_size=1)

def gamma(self, r):
return (r * torch.pi / 2).cos()

def gen_r_embedding(self, r, max_positions=10000):
dtype = r.dtype
r = self.gamma(r) * max_positions
half_dim = self.c_r // 2
emb = math.log(max_positions) / (half_dim - 1)
emb = torch.arange(half_dim, device=r.device).float().mul(-emb).exp()
emb = r[:, None] * emb[None, :]
emb = torch.cat([emb.sin(), emb.cos()], dim=1)
if self.c_r % 2 == 1: # zero pad
emb = nn.functional.pad(emb, (0, 1), mode="constant")
return emb.to(dtype)

def _down_encode_(self, x, s):
level_outputs = []
for i, blocks in enumerate(self.down_blocks):
for block in blocks:
if isinstance(block, ResBlock):
# s_level = s[:, 0]
# s = s[:, 1:]
x = block(x, s)
else:
x = block(x)
level_outputs.insert(0, x)
return level_outputs

def _up_decode(self, level_outputs, s):
x = level_outputs[0]
for i, blocks in enumerate(self.up_blocks):
for j, block in enumerate(blocks):
if isinstance(block, ResBlock):
# s_level = s[:, 0]
# s = s[:, 1:]
if i > 0 and j == 0:
x = block(x, s, level_outputs[i])
else:
x = block(x, s)
else:
x = block(x)
return x

def forward(self, x, c, r): # r is a uniform value between 0 and 1
r_embed = self.gen_r_embedding(r)
x = self.embedding(x).permute(0, 3, 1, 2)
if len(c.shape) == 2:
s = torch.cat([c, r_embed], dim=-1)[:, :, None, None]
else:
r_embed = r_embed[:, :, None, None].expand(-1, -1, c.size(2), c.size(3))
s = torch.cat([c, r_embed], dim=1)
level_outputs = self._down_encode_(x, s)
x = self._up_decode(level_outputs, s)
x = self.clf(x)
return x
Copy link
Contributor Author

@aengusng8 aengusng8 Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pcuenca (cc @patrickvonplaten), should I use layers, blocks, or models in the diffusers\src\diffusers\models folder to replace some parts of the original Paella model class, or should I keep the original Paella model class unchanged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aengusng8,

No worries! Thanks a lot for working on this :-)

It would be amazing if you could try to "mold" your code into the existing UNet2DConditionModel class:

class UNet2DConditionModel(ModelMixin, ConfigMixin, UNet2DConditionLoadersMixin):
given that Paella uses a text conditioned unet it should fit.

Also we've just added a design philosophy that might help: https://huggingface.co/docs/diffusers/main/en/conceptual/philosophy

So it be super cool if you could gauge whether it's possible to "force" the whole modeling code into UNet2DConditionModel - feel free to design your own, new unet up and down class

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Feb 28, 2023
@github-actions github-actions bot closed this Mar 8, 2023
@dblunk88
Copy link
Contributor

Any updates?

Is this based on https://github.com/dome272/Paella ?

@aengusng8
Copy link
Contributor Author

Hi @dblunk88,

  1. No
  2. Yes, but based on old Paella (Paella is recently updated)

@patrickvonplaten patrickvonplaten mentioned this pull request Apr 17, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants