Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the scaling_factor? #3

Closed
sayakpaul opened this issue Jul 31, 2023 · 6 comments
Closed

What is the scaling_factor? #3

sayakpaul opened this issue Jul 31, 2023 · 6 comments

Comments

@sayakpaul
Copy link

We have latent_shift and latent_magnitude values here:

https://github.com/madebyollin/taesd/blob/main/taesd.py#L44C1-L45C23

But is there a scaling_factor as well or is it just one?

scaling_factor as observed in https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/models/autoencoder_kl.py#L61.

@madebyollin
Copy link
Owner

There is no scaling_factor for TAESD - TAESD directly converts SD(XL) latents into RGB images in [0, 1] (see the usage in the example notebook). So if you need to specify a value you can probably set it to 1.0.

(The latent_shift and latent_magnitude values in taesd.py are only relevant if you want to store latents into RGBA PNG files - sorry for the confusion.)

@sayakpaul
Copy link
Author

sayakpaul commented Jul 31, 2023

Thanks for your reply!

I am trying to integrate your work in diffusers so that users can use it very easily (of course, crediting this repository).

With the following code (diffusers was installed using pip install git+https://github.com/huggingface/diffusers@feat/tiny-autoenc):

import torch
from diffusers import DiffusionPipeline, TinyAutoencoder

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = TinyAutoencoder.from_pretrained("sayakpaul/taesd-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0).images[0]
image

I am getting:

image

Is the quality somewhat expected?

To give you some more context here's what we do in the standard pipeline settings.

After we get the latents from the UNet,

  1. We first decode it: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L697.
  2. And then run it through the postprocessor: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L708.
  3. Then we denormalize the image: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/image_processor.py#L240.
  4. And then generate the final PIL image.

From your example notebook, comparing this line:

res_taesd = taesd_dec(latents).cpu().permute(0, 2, 3, 1).float().clamp(0, 1).numpy()

to this one in diffusers, it feels like that the additional (images / 2 + 0.5) is not required for the tiny autoencoder?

Would be amazing to get your thoughts here.

@sayakpaul
Copy link
Author

sayakpaul commented Jul 31, 2023

to this one in diffusers, it feels like that the additional (images / 2 + 0.5) is not required for the tiny autoencoder?

Seems like it's indeed the case.

When I do:

import PIL 

pipe.vae = TinyAutoencoder.from_pretrained(
    "sayakpaul/taesd-diffusers", torch_dtype=torch.float16
).to("cuda")
latents = pipe(
    prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
    generator=torch.manual_seed(0), output_type="latent"
).images

decoded_image = pipe.vae.decode(
    latents / pipe.vae.config.scaling_factor, return_dict=False
)[0]
decoded_image = decoded_image.permute(0, 2, 3, 1).float().clamp(0, 1).cpu().detach().numpy().squeeze(0)

PIL.Image.fromarray((decoded_image * 255).round().astype("uint8"))

With this, I am getting:

image

@sayakpaul
Copy link
Author

When I use the original VAE, I get:

from diffusers import AutoencoderKL

original_vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", subfolder="vae", torch_dtype=torch.float16
).to("cuda")
pipe.vae = original_vae

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(
    prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
    generator=torch.manual_seed(0)
).images[0]
image

image

@sayakpaul
Copy link
Author

Closing the issue.

@madebyollin
Copy link
Owner

madebyollin commented Jul 31, 2023

Yup, TAESD directly predicts values in [0, 1] so you don't need the additional denormalization step (though clamping is still recommended). The image here looks correct to me 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants