What is the `scaling_factor`? #3

sayakpaul · 2023-07-31T04:02:12Z

We have latent_shift and latent_magnitude values here:

https://github.com/madebyollin/taesd/blob/main/taesd.py#L44C1-L45C23

But is there a scaling_factor as well or is it just one?

scaling_factor as observed in https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/models/autoencoder_kl.py#L61.

The text was updated successfully, but these errors were encountered:

madebyollin · 2023-07-31T04:40:09Z

There is no scaling_factor for TAESD - TAESD directly converts SD(XL) latents into RGB images in [0, 1] (see the usage in the example notebook). So if you need to specify a value you can probably set it to 1.0.

(The latent_shift and latent_magnitude values in taesd.py are only relevant if you want to store latents into RGBA PNG files - sorry for the confusion.)

sayakpaul · 2023-07-31T05:50:56Z

Thanks for your reply!

I am trying to integrate your work in diffusers so that users can use it very easily (of course, crediting this repository).

With the following code (diffusers was installed using pip install git+https://github.com/huggingface/diffusers@feat/tiny-autoenc):

import torch
from diffusers import DiffusionPipeline, TinyAutoencoder

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = TinyAutoencoder.from_pretrained("sayakpaul/taesd-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0).images[0]
image

I am getting:

Is the quality somewhat expected?

To give you some more context here's what we do in the standard pipeline settings.

After we get the latents from the UNet,

We first decode it: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L697.
And then run it through the postprocessor: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L708.
Then we denormalize the image: https://github.com/huggingface/diffusers/blob/ea5b0575f8f91b76f32fb6f6930c0bc30e42865e/src/diffusers/image_processor.py#L240.
And then generate the final PIL image.

From your example notebook, comparing this line:

res_taesd = taesd_dec(latents).cpu().permute(0, 2, 3, 1).float().clamp(0, 1).numpy()

to this one in diffusers, it feels like that the additional (images / 2 + 0.5) is not required for the tiny autoencoder?

Would be amazing to get your thoughts here.

sayakpaul · 2023-07-31T06:12:52Z

to this one in diffusers, it feels like that the additional (images / 2 + 0.5) is not required for the tiny autoencoder?

Seems like it's indeed the case.

When I do:

import PIL 

pipe.vae = TinyAutoencoder.from_pretrained(
    "sayakpaul/taesd-diffusers", torch_dtype=torch.float16
).to("cuda")
latents = pipe(
    prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
    generator=torch.manual_seed(0), output_type="latent"
).images

decoded_image = pipe.vae.decode(
    latents / pipe.vae.config.scaling_factor, return_dict=False
)[0]
decoded_image = decoded_image.permute(0, 2, 3, 1).float().clamp(0, 1).cpu().detach().numpy().squeeze(0)

PIL.Image.fromarray((decoded_image * 255).round().astype("uint8"))

With this, I am getting:

sayakpaul · 2023-07-31T06:14:05Z

When I use the original VAE, I get:

from diffusers import AutoencoderKL

original_vae = AutoencoderKL.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", subfolder="vae", torch_dtype=torch.float16
).to("cuda")
pipe.vae = original_vae

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(
    prompt, num_inference_steps=25, height=512, width=512, guidance_scale=3.0,
    generator=torch.manual_seed(0)
).images[0]
image

sayakpaul · 2023-07-31T08:28:38Z

Closing the issue.

madebyollin · 2023-07-31T13:35:47Z

Yup, TAESD directly predicts values in [0, 1] so you don't need the additional denormalization step (though clamping is still recommended). The image here looks correct to me 👍

sayakpaul closed this as completed Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the `scaling_factor`? #3

What is the `scaling_factor`? #3

sayakpaul commented Jul 31, 2023

madebyollin commented Jul 31, 2023

sayakpaul commented Jul 31, 2023 •

edited

Loading

sayakpaul commented Jul 31, 2023 •

edited

Loading

sayakpaul commented Jul 31, 2023

sayakpaul commented Jul 31, 2023

madebyollin commented Jul 31, 2023 •

edited

Loading

What is the scaling_factor? #3

What is the scaling_factor? #3

Comments

sayakpaul commented Jul 31, 2023

madebyollin commented Jul 31, 2023

sayakpaul commented Jul 31, 2023 • edited Loading

sayakpaul commented Jul 31, 2023 • edited Loading

sayakpaul commented Jul 31, 2023

sayakpaul commented Jul 31, 2023

madebyollin commented Jul 31, 2023 • edited Loading

What is the `scaling_factor`? #3

What is the `scaling_factor`? #3

sayakpaul commented Jul 31, 2023 •

edited

Loading

sayakpaul commented Jul 31, 2023 •

edited

Loading

madebyollin commented Jul 31, 2023 •

edited

Loading