Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Community Pipeline] UnCLIP image / text interpolations #1869

Closed
2 tasks done
patrickvonplaten opened this issue Dec 30, 2022 · 49 comments
Closed
2 tasks done

[Community Pipeline] UnCLIP image / text interpolations #1869

patrickvonplaten opened this issue Dec 30, 2022 · 49 comments
Assignees
Labels
stale Issues that haven't received updates

Comments

@patrickvonplaten
Copy link
Contributor

Model/Pipeline/Scheduler description

Copied from #1858:

UnCLIP / Karlo: https://huggingface.co/spaces/kakaobrain/karlo gives some very nice and precise results when doing image generation and can strongly outperform Stable Diffusion in some - see:
https://www.reddit.com/r/StableDiffusion/comments/zshufz/karlo_the_first_large_scale_open_source_dalle_2/

Another extremely interesting aspect of Dalle 2 is its ability to interpolate between text and or image embeddings. See e.g. section 3.) of the Dalle 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf . This PR now allows to directly pass text embeddings and image embeddings which should enable those tasks!

I think we could create a super cool community pipeline. The pipeline could allow to automatically create interpolations between two text prompts and similarly we could create one to do interpolations between two images.

In terms of design to stay as efficient as possible the following would make sense:

    1. The user passes two text prompts and a num_interpolations input.
    1. The pipeline then embeds those two text prompts into the text embeddings x_0 and x_N and num_interpolations x_1, x_2, ... x_N-1 are created using the slerp function .
    1. Then we have num_interpolations + 2 text embeddings that should be passed in a batch through the model to create a nice interpolation of images.
    1. It'd be important to make use of enable_cpu_offload() to save memory.

It's probably easier to start with the UnCLIPImageInterpolationPipeline since image embeddings are just a single 1-d vector where as for text embeddings two latent vectors are used.

Would be more than happy to help if someone is interested in giving this a try - think it'll make for some super cool demos.

Open source status

  • The model implementation is available
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

No response

@Abhinay1997
Copy link
Contributor

Abhinay1997 commented Jan 1, 2023

Hi @patrickvonplaten and @williamberman, are you working on this or can I pick this up ?

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Jan 1, 2023

@Abhinay1997 feel free to pick it up! We re more than happy to help if needed ☺️

@Abhinay1997
Copy link
Contributor

This makes sense at the top level. But just so that my understanding is correct, we want a community pipeline that can interpolate between prompts/images like the StableDiffusionInterpolation but using the unCLIPPipeline ( a.k.a Dall-E 2)

The interpolation also makes sense, we generate the embeddings for the 2 prompts, p1 and p2 let's say. They would correspond to x_0 and x_N of the interpolation sequence and using slerp I would interpolate between them for N outputs in total for a pair of prompts/images.

@patrickvonplaten
Copy link
Contributor Author

That's exactly right @Abhinay1997 :-)

@Abhinay1997
Copy link
Contributor

@patrickvonplaten sorry it took so long.

The UnCLIPTextInterpolation pipeline is actually more straight forward imo. For the UnCLIPImageInterpolation pipeline, would we not need the CLIP model that was used for training the UnCLIPPipeline ?

Because I see the CLIP model in the original codebase but not in the HuggingFace hub model

I am working on the text interpolation right now. ETA: 24th Jan.

@patrickvonplaten
Copy link
Contributor Author

cc @williamberman here

@williamberman
Copy link
Contributor

Thanks for your work @Abhinay1997!

We do still use clip for the unclip pipeline(s). Note that we just import it from transformers.

In the text to image pipeline, we use the text encoder for encoding the prompt

text_encoder: CLIPTextModelWithProjection
tokenizer: CLIPTokenizer

In the image variation pipeline, we use the image encoder for encoding the input image and the text encoder for encoding an empty prompt. Note that we also optionally allow directly passing image embedding to the pipeline to skip encoding an input image.

text_encoder: CLIPTextModelWithProjection
tokenizer: CLIPTokenizer
feature_extractor: CLIPFeatureExtractor
image_encoder: CLIPVisionModelWithProjection

The image interpolation pipeline should work similarly to the image variation pipeline in that it should be passed either the two sets of images which would be encoded via clip or the two sets of pre-encoded latents. Note that similarly to the image variation pipeline, we would have to encode the empty text prompt for the image interpolation pipeline.

LMK if that makes sense!

@Abhinay1997
Copy link
Contributor

Hi @williamberman thank you for the details. It makes sense. I was under the impression that I needed to use the actual CLIP checkpoint that the UnCLIP model learns to invert its decoder over. So I got confused.

Will share the work in progress notebooks soon.

@Abhinay1997
Copy link
Contributor

Abhinay1997 commented Jan 31, 2023

Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?

Questions:-

  1. When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?
  2. Will I be able to instantiate this community pipeline using DiffusionPipeline(custom_pipelein='....') as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?

@williamberman
Copy link
Contributor

williamberman commented Feb 1, 2023

Great work so far @Abhinay1997 !

Hi @williamberman can you review the UnCLIPTextInterpolation notebook when you have time ?

Questions:-

  1. When prompts have different lengths, what attention mask should be used for the intermediate interpolation steps ?

That's a good point, and I don't know off the top of my head or from googling around. I would recommend for now just using the mask of the longer prompt. cc @patrickvonplaten

  1. Will I be able to instantiate this community pipeline using DiffusionPipeline(custom_pipelein='....') as I am inheriting from both DiffusionPipeline[ CommunityPipeline requirement] and UnCLIPPipeline[ to be able to use UnCLIP modules] ?

We actually want all pipelines to be completely independent so please do not inherit from the UnCLIPPipeline :)

@Abhinay1997
Copy link
Contributor

@williamberman, just to clarify, can I still import UnCLIPPipeline inside the methods and use it for generation ?

@williamberman
Copy link
Contributor

Nope!

We want all pipelines to be as self contained as possible. If any methods are exactly the same, we have the # Copied from mechanism (which we should document a bit better) which will let you copy and paste the method and keep the two methods in sync in ci.

We've articulated some of our rationale in the philosophy doc

https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#pipelines
https://github.com/huggingface/diffusers/blob/main/docs/source/en/conceptual/philosophy.mdx#tweakable-contributor-friendly-over-abstraction

@Abhinay1997
Copy link
Contributor

Thanks for the clarification @williamberman. Will update and make the PR for it soon.

@Abhinay1997
Copy link
Contributor

@williamberman @patrickvonplaten Please find the PR for UnCLIPTextInterpolation: #2257

Also, what about the interpolation attention_masks ? Any other thoughts on it ? Using max for now, as suggested.

P.S: Planning to complete UnCLIPImageInterpolation this week

@williamberman
Copy link
Contributor

With the text interpolation pipeline merged, the image interpolation pipeline is still up for grabs!

@Abhinay1997
Copy link
Contributor

Abhinay1997 commented Feb 13, 2023

@williamberman I was planning to start on image interpolation too. Would that be okay ?

@williamberman
Copy link
Contributor

Yes please! You're on a roll :)

@Abhinay1997
Copy link
Contributor

Abhinay1997 commented Feb 14, 2023

@Abhinay1997
Copy link
Contributor

So, I was re-reading the paper of Dall-E 2 and found that their text interpolation is a little more complicated in that they interpolate on image embeddings using a normalised difference of the text embeddings of the two prompts. This produces much better results than my implementation. I'll update the text interpolation pipeline once the image interpolation is done.

@Abhinay1997
Copy link
Contributor

Image Interpolation is looking good. I'm getting results in line with Dall-e 2.

Notebook: https://colab.research.google.com/drive/1eN-oy3N6amFT48hhxvv02Ad5798FDvHd?usp=sharing

Results:-
starry_to_dog

Inputs:-
starry_night

dogs

Will open a PR tomorrow.

@patrickvonplaten
Copy link
Contributor Author

Wow that's super cool 🔥

@patrickvonplaten
Copy link
Contributor Author

Very much looking forward to the PR! Let's maybe also try to make a cool spaces about this @apolinario @AK391 @osanseviero

@AK391
Copy link

AK391 commented Feb 16, 2023

Hi @Abhinay1997 awesome work on the community pipeline, I opened a request for a community space in our discord for the community pipeline: https://discord.com/channels/879548962464493619/1075849794519572490/1075849794519572490, you can join here: https://discord.gg/pEYnj5ZW and check out the event by taking the role #collaborate and write under one of the paper posts under #making-demos forum

@Abhinay1997
Copy link
Contributor

Opened the PR for UnCLIPImageInterpolation: #2400

@williamberman @patrickvonplaten

@Abhinay1997
Copy link
Contributor

Abhinay1997 commented Mar 3, 2023

While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:

  1. Take the original image x0 and generate the inverted noise xT using DDIM Inversion and the image_embeddings z_img_0
  2. Given a target prompt p_target and a caption for the original image p_start, compute text_embeddings z_txt_start and z_txt_target
  3. Compute the text diff embeddings, z_txt_diff = norm_diff(z_txt_start, z_txt_target)
  4. Compute the intermediate embedding z_inter = slerp(interp_value, z_img_0, z_txt_diff) where interp_value is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper)
  5. Use the intermediate embeddings to generate the text diff images.

@Abhinay1997
Copy link
Contributor

UnCLIP Image Interpolation demo space is up and running at https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo

Do check it out !

@osanseviero
Copy link
Member

Very cool Space 🔥

@patrickvonplaten
Copy link
Contributor Author

Super cool space @Abhinay1997 - shared it on Reddit as well :-)

@Abhinay1997
Copy link
Contributor

Thanks @patrickvonplaten, @osanseviero !

@sayakpaul
Copy link
Member

Can we close this issue now?

@Abhinay1997
Copy link
Contributor

There is unclip text diff that is part of the interpolations @sayakpaul. Let me work on the pr for this. We can close it this week 🙂

While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:

  1. Take the original image x0 and generate the inverted noise xT using DDIM Inversion and the image_embeddings z_img_0
  2. Given a target prompt p_target and a caption for the original image p_start, compute text_embeddings z_txt_start and z_txt_target
  3. Compute the text diff embeddings, z_txt_diff = norm_diff(z_txt_start, z_txt_target)
  4. Compute the intermediate embedding z_inter = slerp(interp_value, z_img_0, z_txt_diff) where interp_value is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper)
  5. Use the intermediate embeddings to generate the text diff images.

@sayakpaul
Copy link
Member

I didn't follow it, sorry. We can do it one by one if you want to since you're already tackling #2195. Totally fine with that.

@Abhinay1997
Copy link
Contributor

I meant there is one more interpolation that we can work on and the flow is all planned out. So there should be no surprises and we can close it this week.

As for #2195 , I fixed the issue. See my comments there. I need to update the branch to resolve conflicts with main

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Apr 13, 2023
@Abhinay1997
Copy link
Contributor

Team, the pipeline is in the works. Will make the PR this week :)

@github-actions
Copy link

github-actions bot commented May 8, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@tikitong
Copy link

There is unclip text diff that is part of the interpolations @sayakpaul. Let me work on the pr for this. We can close it this week 🙂

While #2400 is under review, I wanted to share the basic outline for the UnCLIP text diff flow:

  1. Take the original image x0 and generate the inverted noise xT using DDIM Inversion and the image_embeddings z_img_0
  2. Given a target prompt p_target and a caption for the original image p_start, compute text_embeddings z_txt_start and z_txt_target
  3. Compute the text diff embeddings, z_txt_diff = norm_diff(z_txt_start, z_txt_target)
  4. Compute the intermediate embedding z_inter = slerp(interp_value, z_img_0, z_txt_diff) where interp_value is linearly spaced in the interval [0.25,0.5] (from the Dall E 2 paper)
  5. Use the intermediate embeddings to generate the text diff images.

Hi, thanks for this very interesting works ! @Abhinay1997 @patrickvonplaten UnCLIPImageInterpolation and UnCLIPTextInterpolationPipeline are they directly available from diffusers? Also has the text diff method been implemented?

@Abhinay1997
Copy link
Contributor

@tikitong Thanks for the interest in this !

You can use them like this:

pipe = DiffusionPipeline.from_pretrained("kakaobrain/karlo-v1-alpha-image-variations", torch_dtype=dtype, custom_pipeline='unclip_image_interpolation')
pipe = DiffusionPipeline.from_pretrained("kakaobrain/karlo-v1-alpha", torch_dtype=dtype, custom_pipeline='unclip_text_interpolation')

There are spaces for both UnCLIPTextInterpolation and UnCLIPImageInterpolation.

https://huggingface.co/spaces/NagaSaiAbhinay/UnCLIP_Image_Interpolation_Demo
https://huggingface.co/spaces/NagaSaiAbhinay/unclip_text_interpolation_demo

I'm working on something else so didn't have time for the text diff. But I'll update here once I'm done. Hopefully this weekend :)

@tikitong
Copy link

@Abhinay1997 thank you very much for your quick answer!

thanks also for the two commands with the custom_pipeline argument.

I'm looking foward for the text diff !

If I understand correctly, the DDIM inversion can be done just before this line right ?

beta_prod_t = 1 - alpha_prod_t

by doing alpha_prod_t, alpha_prod_t_prev = alpha_prod_t_prev, alpha_prod_t

In the two notebooks you put (UnCLIPTextInterpolation and UnCLIPImageInterpolation) it could be done just before the calculation of super_res_latents = self.super_res_scheduler.step(.. ?

or in fact by doing t, prev_timestep = prev_timestep, t before call self.super_res_scheduler.step(..

@Abhinay1997
Copy link
Contributor

@tikitong,

HuggingFace has a DDIM inversion scheduler so I was actually planning to use it similar to how SDPix2PixZero pipeline does DDIM Inversion to get the inverted latents ( original noise ). But I'll have to look into it in more detail.

You can take a look at a pipeline that uses DDIM Inversion here

Maybe I'll be able to better answer your question once I get down into the details.

@patrickvonplaten , what do you think ?

@patrickvonplaten
Copy link
Contributor Author

Sorry what exactly is the question here 😅

I think the community pipelines work very well no?

@tikitong
Copy link

@Abhinay1997 I looked for the DDIM inversion scheduler, DDIMInverseScheduler. Yes it should work with it.
@patrickvonplaten Yes totally, I was just wondering how to get the inverted latent xT from z_img_0.

@Abhinay1997
Copy link
Contributor

@tikitong there is an example of DDIM inversion using a DDIM Scheduler here

@tikitong
Copy link

tikitong commented May 21, 2023

@Abhinay1997 thanks to your link and to this one I was able to get closer to the solution. I would be very interested to have your opinion on my results !

The results below are for the following features:
an adult lion -> a lion cub

I set the interp_value as follows:
interp_value= torch.linspace(0.25, 0.65, steps=10)

I calculated z_txt_diff = norm_diff(z_txt_start, z_txt_target)
as follows:
z_txt_diff = ( z_txt_target - z_txt_start) / torch.norm( z_txt_target - z_txt_start, dim=1, keepdim=True)

Starting from left to right, the first image is directly z_img_0, the second is the reconstruction of z_img_0 with xT fixed, called noise.

As you can see, it seems to work, but the adult lion never turns into a real lion cub, even at 0.65. In the dalle2 paper is for 0.5.

I have several questions about the results :
First, do you confirm that for the inversion of the ddim to have xT, scheduler.timesteps should be inverted?

Secondly, is it normal to have the generated image of z_img_0 (with random noise) different from the second one (with fixed xT). I expected to have exactly the same image.

my parameters are:

num_steps = 10
batch_size = 1
guidance_scale = 7
prior_cf_scale = 4
prior_steps = "25

For the generation of each interp_value, as for the generation of z_img_0 and noise, I left the timesteps at 1000. I get [901 801 701 601 501 401 301 201 101 1]

For the xT generation I set the timesteps to 50 and I invert it. I get [1]. Above I noticed that the results vary much more.

Capture d’écran 2023-05-21 à 15 22 56

@Abhinay1997
Copy link
Contributor

Abhinay1997 commented May 21, 2023

@tikitong

About the timesteps, if you're using the DDIMInverseScheduler, they would go from 0 to 1000, I'd assume. See here there is no reversal of the timesteps.

Can you confirm you're using spherical interpolation to calculate z_inter i.e using the formula:
z_inter = slerp(interp_value, z_img_0, z_txt_diff). You can find a working slerp function here

As for reproducing the original image from z_img_0, noise latents generated from DDIMInversion cannot exactly reproduce the original image. They are very close but not an exact match. We can discuss this further if you're interested.

The paper mentionsinterp_value= torch.linspace(0.25, 0.5, steps=10) but to be honest, I wonder if it'll actually work because we stop the interpolation basically midway between the two images. Can you try interpolating till 1 ?

If there's a colab notebook, I'd be happy to have a look. :)

@tikitong
Copy link

tikitong commented May 23, 2023

@Abhinay1997 thanks you for these details and sorry for the delay, I have to finish another work.
Yes I use spherical interpolation, the formula you gave.
When I interpolate till 1 the image is deformed and does not represent a lion anymore.
Of course, I will make one, the time to adapt and format it, I try to do it asap.

@Abhinay1997
Copy link
Contributor

@tikitong sorry for the late reply. I tried it out and am getting similar results. My implementation fails to reconstruct the original image too i.e no difference between using random noise and original noise. I'm trying to figure out where the issue lies. Will update here if I make any progress :)

@Abhinay1997
Copy link
Contributor

original_cat reconstructed_cat

I made some progress in the DDIM Inversion. I still can't get exact reconstruction of original image, but I'm getting there.

Left: Original image.
Right: Reconstructed using DDIM Inversion on UnCLIP Decoder.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

7 participants