When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic. #1422

ScottishFold007 · 2022-11-25T14:55:59Z

Describe the bug

When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic.

Reproduction

from diffusers import DiffusionPipeline
from transformers import CLIPFeatureExtractor, CLIPModel
import torch

model_id = r"C:\Users\admin\Desktop\stable_diffusion模型合辑\stabilityai_stable-diffusion-2"
feature_extractor = CLIPFeatureExtractor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16)

guided_pipeline = DiffusionPipeline.from_pretrained(
model_id,
custom_pipeline="clip_guided_stable_diffusion",
clip_model=clip_model,
feature_extractor=feature_extractor,
revision="fp16",
torch_dtype=torch.float16,
)
guided_pipeline.enable_attention_slicing()
guided_pipeline = guided_pipeline.to("cuda")

prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece"

generator = torch.Generator(device="cuda").manual_seed(0)
images = []
for i in range(4):
image = guided_pipeline(
prompt,
num_inference_steps=50,
guidance_scale=7.5,
clip_guidance_scale=100,
num_cutouts=4,
use_cutouts=False,
generator=generator,
).images[0]
images.append(image)

save images locally

for i, img in enumerate(images):
img.save(f"./clip_guided_sd/image_{i}.png")

Logs

No response

System Info

diffusers==0.9.0.dev0

akirchmeyer · 2022-11-28T20:01:41Z

Maybe related: #1429 (comment)
I used to get "mosaic" images similar to what you describe while training a dreambooth model on SD v2, and this solved it for me.

ScottishFold007 · 2022-11-29T15:30:00Z

Maybe related: #1429 (comment) I used to get "mosaic" images similar to what you describe while training a dreambooth model on SD v2, and this solved it for me.
I tried this solution, but found that the problem was not.Maybe：
` @torch.enable_grad()
def cond_fn(
self,
latents,
timestep,
index,
text_embeddings,
noise_pred_original,
text_embeddings_clip,
clip_guidance_scale,
num_cutouts,
use_cutouts=True,
):
latents = latents.detach().requires_grad_()

    if isinstance(self.scheduler, (LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler)):
        sigma = self.scheduler.sigmas[index]
        # the model input needs to be scaled to match the continuous ODE formulation in K-LMS
        latent_model_input = latents / ((sigma**2 + 1) ** 0.5)
    else:
        latent_model_input = latents

    # predict the noise residual
    noise_pred = self.unet(latent_model_input, timestep, encoder_hidden_states=text_embeddings).sample

    if isinstance(self.scheduler, (PNDMScheduler, DDIMScheduler)):
        alpha_prod_t = self.scheduler.alphas_cumprod[timestep]
        beta_prod_t = 1 - alpha_prod_t
        # compute predicted original sample from predicted noise also called
        # "predicted x_0" of formula (12) from https://arxiv.org/pdf/2010.02502.pdf
        pred_original_sample = (latents - beta_prod_t ** (0.5) * noise_pred) / alpha_prod_t ** (0.5)

        fac = torch.sqrt(beta_prod_t)
        sample = pred_original_sample * (fac) + latents * (1 - fac)
    elif isinstance(self.scheduler,  (LMSDiscreteScheduler, EulerDiscreteScheduler, EulerAncestralDiscreteScheduler)):
        sigma = self.scheduler.sigmas[index]
        sample = latents - sigma * noise_pred
    else:
        raise ValueError(f"scheduler type {type(self.scheduler)} not supported")

    sample = 1 / 0.18215 * sample
    image = self.vae.decode(sample).sample
    image = (image / 2 + 0.5).clamp(0, 1)

    if use_cutouts:
        image = self.make_cutouts(image, num_cutouts)
    else:
        image = transforms.Resize(feature_extractor.size["shortest_edge"])(image)
    image = self.normalize(image).to(latents.dtype)

    image_embeddings_clip = self.clip_model.get_image_features(image)
    image_embeddings_clip = image_embeddings_clip / image_embeddings_clip.norm(p=2, dim=-1, keepdim=True)

    if use_cutouts:
        dists = spherical_dist_loss(image_embeddings_clip, text_embeddings_clip)
        dists = dists.view([num_cutouts, sample.shape[0], -1])
        loss = dists.sum(2).mean(0).sum() * clip_guidance_scale
    else:
        loss = spherical_dist_loss(image_embeddings_clip, text_embeddings_clip).mean() * clip_guidance_scale

    grads = -torch.autograd.grad(loss, latents)[0]

    if isinstance(self.scheduler, (LMSDiscreteScheduler, EulerAncestralDiscreteScheduler)):
        latents = latents.detach() + grads * (sigma**2)
        noise_pred = noise_pred_original
    else:
        noise_pred = noise_pred_original - torch.sqrt(beta_prod_t) * grads
    return noise_pred, latents   `

patrickvonplaten · 2022-12-01T16:26:23Z

Hey @ScottishFold007,

I think we currently sadly don't have the time to look into this problem. We would love to review a PR for the clip guided pipeline though that would make it work with v2.

github-actions · 2022-12-26T15:03:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ScottishFold007 added the bug Something isn't working label Nov 25, 2022

github-actions bot added the stale Issues that haven't received updates label Dec 26, 2022

github-actions bot closed this as completed Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic. #1422

When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic. #1422

ScottishFold007 commented Nov 25, 2022

akirchmeyer commented Nov 28, 2022 •

edited

Loading

ScottishFold007 commented Nov 29, 2022

patrickvonplaten commented Dec 1, 2022

github-actions bot commented Dec 26, 2022

When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic. #1422

When I use CLIPGuidedStableDiffusion based on stable-diffusion-2, the generated result is incredibly bad, all mosaic. #1422

Comments

ScottishFold007 commented Nov 25, 2022

Describe the bug

Reproduction

save images locally

Logs

System Info

akirchmeyer commented Nov 28, 2022 • edited Loading

ScottishFold007 commented Nov 29, 2022

patrickvonplaten commented Dec 1, 2022

github-actions bot commented Dec 26, 2022

akirchmeyer commented Nov 28, 2022 •

edited

Loading