<a href="https://colab.research.google.com/github/wandb/examples/blob/master/colabs/diffusers/sdxl-text-to-image.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Stable-Diffusion XL 1.0 using 🤗 Diffusers

This notebook demonstrates the following:
- Performing text-conditional image-generations using [🤗 Diffusers](https://huggingface.co/docs/diffusers).
- Using the Stable Diffusion XL Refiner pipeline to further refine the outputs of the base model.
- Manage image generation experiments using [Weights & Biases](http://wandb.ai/geekyrakshit).
- Log the prompts and generated images to [Weigts & Biases](http://wandb.ai/geekyrakshit) for visalization.

## Installing the Dependencies

In [None]:
# !pip install -qq diffusers["torch"] transformers wandb

In [None]:
import torch
import wandb
from diffusers import DiffusionPipeline, EulerDiscreteScheduler

## Experiment Management using Weights & Biases

Managing our image generation experiments is crucial for the sake of reproducibility. Hence we sync all the configs of our experiments with our Weights & Biases run. This stores all the configs of the experiments, right from the prompts to the refinement technque and the configuration of the scheduler.

In [None]:
# initialize a wandb run
wandb.init(project="stable-diffusion-xl", entity="mratanusarkar", job_type="text-to-image", save_code=True)

# define experiment configs
config = wandb.config
config.stable_diffusion_checkpoint = "stabilityai/stable-diffusion-xl-base-1.0"
config.refiner_checkpoint = "stabilityai/stable-diffusion-xl-refiner-1.0"
config.offload_to_cpu = True
config.compile_model = False
config.prompt_1 = "A phtoto of a muscular lady, wearing a helmet and a latex suit. She is riding a futuristic bike with bright headlights. The background is of a brightly lit city with neon lights and traffic headlights causing motion blur, cyberpunk aesthetic, realistic, 8k"
config.prompt_2 = "" # Leave blank if you want both text encoders to use the same prompt
config.negative_prompt_1 = "paiting, nature, static, sd character, low quality, low resolution, greyscale, monochrome, nose, cropped, lowres, jpeg artifacts, deformed iris, deformed pupils, bad eyes, semi-realistic worst quality, bad lips, deformed mouth, deformed face, deformed fingers, standing still, posing"
config.negative_prompt_2 = ""
config.seed = 42
config.use_ensemble_of_experts = True
config.num_inference_steps = 100
config.num_refinement_steps = 100
config.high_noise_fraction = 0.8 # Set explicitly only if config.use_ensemble_of_experts is True
config.scheduler_kwargs = {
    "beta_end": 0.012,
    "beta_schedule": "scaled_linear", # one of ["linear", "scaled_linear"]
    "beta_start": 0.00085,
    "interpolation_type": "linear", # one of ["linear", "log_linear"]
    "num_train_timesteps": 1000,
    "prediction_type": "epsilon", # one of ["epsilon", "sample", "v_prediction"]
    "steps_offset": 1,
    "timestep_spacing": "leading", # one of ["linspace", "leading"]
    "trained_betas": None,
    "use_karras_sigmas": False,
}

We can make the experiment deterministic based on the seed specified in the experiment configs.

In [None]:
if config.seed is not None:
    generator = [torch.Generator(device="cuda").manual_seed(config.seed)]
else:
    generator = [torch.Generator(device="cuda")]

## Creating the Diffusion Pipelines

For performing text-conditional image generation, we use the `diffusers` library to define the diffusion pipelines corresponding to the base SDXL model and the SDXL refinement model.

In [None]:
# Define base model
pipe = DiffusionPipeline.from_pretrained(
    config.stable_diffusion_checkpoint,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
    scheduler=EulerDiscreteScheduler(**config.scheduler_kwargs),
)

# Offload to CPU in case of OOM
if config.offload_to_cpu:
    pipe.enable_model_cpu_offload()
else:
    pipe.to("cuda")

# Compile model using `torch.compile`, this might give a significant speedup
if config.compile_model:
    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

In [None]:
# # Define base model
# refiner = DiffusionPipeline.from_pretrained(
#     config.refiner_checkpoint,
#     text_encoder_2=pipe.text_encoder_2,
#     vae=pipe.vae,
#     torch_dtype=torch.float16,
#     use_safetensors=True,
#     variant="fp16",
#     scheduler=EulerDiscreteScheduler(**config.scheduler_kwargs),
# )

# # Offload to CPU in case of OOM
# if config.offload_to_cpu:
#     refiner.enable_model_cpu_offload()
# else:
#     refiner.to("cuda")

# # Compile model using `torch.compile`, this might give a significant speedup
# if config.compile_model:
#     refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)

We now define a utility function to postprocess the latents obtained from the base model.

In [None]:
def postprocess_latent(latent):
    vae_output = pipe.vae.decode(
        latent.images / pipe.vae.config.scaling_factor, return_dict=False
    )[0].detach()
    return pipe.image_processor.postprocess(vae_output, output_type="pil")[0]

## Text-to-Image Generation

Now, we pass the prompts and the negative prompts to the base model and then pass the output to the refiner for firther refinement. In order to know more about the different refinement techniques that can be used with SDXL, you can check [`diffusers` docs](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#refining-the-image-output).

In [None]:
if config.use_ensemble_of_experts:
    latent = pipe(
        prompt=config.prompt_1 if config.prompt_1 != "" else None,
        prompt_2=config.prompt_2 if config.prompt_2 != "" else None,
        negative_prompt=config.negative_prompt_1 if config.negative_prompt_1 != "" else None,
        negative_prompt_2=config.negative_prompt_2 if config.negative_prompt_2 != "" else None,
        output_type="latent",
        num_inference_steps=config.num_inference_steps,
        denoising_end=config.high_noise_fraction,
        generator=generator,
    )
else:
    latent = pipe(
        prompt=config.prompt_1 if config.prompt_1 != "" else None,
        prompt_2=config.prompt_2 if config.prompt_2 != "" else None,
        negative_prompt=config.negative_prompt_1 if config.negative_prompt_1 != "" else None,
        negative_prompt_2=config.negative_prompt_2 if config.negative_prompt_2 != "" else None,
        output_type="latent",
        num_inference_steps=config.num_inference_steps,
        generator=generator,
    )
unrefined_image = postprocess_latent(latent)

In [None]:
# if config.use_ensemble_of_experts:
#     refined_image = refiner(
#         prompt=config.prompt_1 if config.prompt_1 != "" else None,
#         prompt_2=config.prompt_2 if config.prompt_2 != "" else None,
#         negative_prompt=config.negative_prompt_1 if config.negative_prompt_1 != "" else None,
#         negative_prompt_2=config.negative_prompt_2 if config.negative_prompt_2 != "" else None,
#         image=latent.images,
#         num_inference_steps=config.num_refinement_steps,
#         denoising_start=config.high_noise_fraction,
#         generator=generator,
#     ).images[0]
# else:
#     refined_image = refiner(
#         prompt=config.prompt_1 if config.prompt_1 != "" else None,
#         prompt_2=config.prompt_2 if config.prompt_2 != "" else None,
#         negative_prompt=config.negative_prompt_1 if config.negative_prompt_1 != "" else None,
#         negative_prompt_2=config.negative_prompt_2 if config.negative_prompt_2 != "" else None,
#         image=latent.images[0][None, :],
#         generator=generator,
#     ).images[0]

In [None]:
unrefined_image

## Logging the Images to Weights & Biases

Now, we log the images to Weights & Biases. This enables us to:

- Visualize our generations
- Examine the generated images across different images
- Ensure reproducibility of the experiments

In [None]:
# # Create a [wandb table](https://docs.wandb.ai/guides/tables)
# table = wandb.Table(columns=[
#     "Prompt-1",
#     "Prompt-2",
#     "Negative-Prompt-1",
#     "Negative-Prompt-2",
#     "Unrefined-Image",
#     "Refined-Image",
#     "Use-Ensemble-of-Experts",
# ])

# unrefined_image = wandb.Image(unrefined_image)
# refined_image = unrefined_image
# # wandb.Image(refined_image)

# # Add the images to the table
# table.add_data(
#     config.prompt_1,
#     config.prompt_2,
#     config.negative_prompt_1,
#     config.negative_prompt_2,
#     unrefined_image,
#     refined_image,
#     config.use_ensemble_of_experts,
# )

# # Log the images and table to wandb
# wandb.log({
#     "Unrefined-Image": unrefined_image,
#     "Refined-Image": refined_image,
#     "Text-to-Image": table
# })

# # finish the experiment
# wandb.finish()

Here's how you can examine your generations across multiple experiments 👇

![](https://i.imgur.com/zNynGye.png)

Here's how you can manage your prompts and your generations across experiments 👇

![](https://i.imgur.com/JVEXkx0.png)