This notebook intends to show how to use the developed API to construct a `StableDiffusion` object and generate an image from a textual prompt.

---

We started noticing that, in the reference codebase, running any part of the stable diffusion model required the initialization and loading of the entire `LatentDiffusion` module, which loads the stable diffusion models' checkpoints.

The `LatentDiffusion` module is composed mainly by three submodules: `UNetModel`, `Autoencoder` and `CLIPTextEmbedder`. It makes sense to use these submodules individually, and we wanted to increase the pipeline modularity, in order for it to support individual runs. For instance, the `CLIPTextEmbedder` turns textual prompts into tensors in an embedding space. It makes perfect sense to embed a number of textual prompts, then run the rest of the pipeline over these embeddings. Heavier parts of the `LatentDiffusion` model, such as the UNet and the autoencoder aren't needed to embed the prompts. Hence the first goal was to add support for the individual usage of the submodules that 'made sense' to be used individually.

The second goal was to avoid downloads during runtime altogether, while minimizing usage of external libraries. These downloads were occurring mainly due to the `transformers` library, that was being used to load/download the `openai/clip-vit-large-patch14` model and construct the text embedder.

The third goal was to add support to `.safetensors`. Since `torch` still doesn't support it natively, and the very process of saving and loading being different, that was quite troublesome. To begin with, with `safetensors` the serialization is only at tensors level, so you can only save/load dictionaries of tensors, which are in general weights or state dicts; with pickle, you can save the very Python object, so that when you load it *you already have an instance of it*. That's important because *you need an instance of the model you are trying to load weights into*. And if you don't have the code for the object (in general, the code for a `nn.Module` that instantiates a module with a state dict compatible with what you are loading into object), you won't be able to use `.safetensors`. And we didn't had the codes for the submodels the `CLIPTextEmbedder` has, since they were coming from `transformers` lib.

In [None]:
# Check if we're on Google Colab to clone and change dir into the repo
if 'google.colab' in str(get_ipython()):
  !git clone https://github.com/kk-digital/kcg-ml-sd1p4
  %cd kcg-ml-sd1p4

In [None]:
!pip install -r requirements.txt

In [None]:
!python3 ./download_models.py

In [None]:
!python3 ./process_models.py

In [3]:
import os
import sys

base_directory = "./"
sys.path.insert(0, base_directory)
print(os.path.abspath(base_directory))

import json
import torch
import configparser
import safetensors
from stable_diffusion import StableDiffusion
from stable_diffusion.utils_backend import *
from stable_diffusion.utils_image import *
from stable_diffusion.utils_model import *
from stable_diffusion.utils_logger import *
from stable_diffusion.model.clip_image_encoder import CLIPImageEncoder

from stable_diffusion.constants import IODirectoryTree

device = get_device()
to_pil = lambda image: ToPILImage()(torch.clamp((image + 1.0) / 2.0, min=0.0, max=1.0))

/devbox/kcg-ml-sd1p4


In [None]:
base_dir = os.getcwd()
sys.path.insert(0, base_dir)

import configparser
config = configparser.ConfigParser(interpolation=configparser.ExtendedInterpolation())
config.read(os.path.join(base_dir, "config.ini"))
config['BASE']['BASE_DIRECTORY'] = base_dir
config["BASE"].get('base_io_directory')

batch_size = 1
pt = IODirectoryTree(base_io_directory_prefix = config["BASE"].get('base_io_directory_prefix'), base_directory=base_dir)

In [None]:
pt.create_directory_tree_folders()
pt

We are using `transformers` for the CLIP models.

On a first run, since we don't have the required model on cache, the next cell would normally download the pretrained tokenizer from `openai/clip-vit-large-patch14` on Huggingface.

In [5]:
from transformers import CLIPTokenizer

# tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-large-patch14')

Instead, we have the tokenizer files (are very light) in our repo, so we load from it with:

In [6]:
tokenizer = CLIPTokenizer.from_pretrained(pt.tokenizer_path, local_files_only=True)

In [7]:
# this is how you save it
# sd_savepath = os.path.join(pt.sd_model_dir, "clip_")
# tokenizer.save_pretrained(sd_savepath+"tokenizer", safe_serialization=True)

Here again, if we didn't have the required configuration file on cache, the next cell would normally download the `CLIPTextModel` config file from `openai/clip-vit-large-patch14` on Huggingface. That is needed for us to initialize an empty `CLIPTextModel` object.

In [8]:
from transformers import CLIPTextConfig, CLIPTextModel

#fetch config file from huggingface and save it to the model folder
# config = CLIPTextConfig.from_pretrained("openai/clip-vit-large-patch14")
# config.save_pretrained(pt.text_model_path)
# config

We also have that config file in our repo, so we can load it from disk.

In [None]:
config = CLIPTextConfig.from_pretrained(pt.text_model_path, local_files_only=True)
# config = CLIPTextConfig.from_pretrained('../input/model/clip/text_embedder/text_model/config.json')
config

Then we can finally instantiate a `CLIPTextModel`:

In [10]:
text_model = CLIPTextModel(config)

In [None]:
get_memory_status(device)

In [None]:
text_model.to(device)

In [None]:
get_memory_status(device)

In [15]:
text_model.save_pretrained(pt.text_model_path, safe_serialization=True)

In [16]:
# test load
# text_model = CLIPTextModel.from_pretrained(pt.text_model_path, local_files_only=True, use_safetensors=True).eval().to(DEVICE)

Now we finally can instantiate a text embedder without loading any weights.

In [17]:
from stable_diffusion.model.clip_text_embedder import CLIPTextEmbedder

In [19]:
text_embedder = CLIPTextEmbedder(pt, device=device, tokenizer = tokenizer, transformer=text_model)

In [None]:
text_embedder.to(text_embedder.device)

Naturally, at this point we should be able to embed a prompt, albeit badly, because we started the CLIPTextModel with no weights, the configuration alone:

In [None]:
text_embedder('A great sword')

If we haven't done the process of creating the submodels instances, we would have, instead:

In [None]:
not_text_embedder = CLIPTextEmbedder(pt, device=device, tokenizer = None, transformer= None)

In [None]:
not_text_embedder.to(not_text_embedder.device)

And, obviously, our forward wouldn't work:

In [None]:
try:
  not_text_embedder('A great sword')
except:
  print("Noup")

Let's redo the text embedder, but now loading the saved submodels.

In [None]:
text_embedder = CLIPTextEmbedder(pt, device=device, tokenizer = None, transformer= None)

In [None]:
# still empty
text_embedder

In [None]:
text_embedder.load_submodels(tokenizer_path = pt.tokenizer_path, transformer_path = pt.text_model_path)

In [None]:
# we could also save our submodels to disk for later use
# text_embedder.save_submodels(tokenizer_path=pt.tokenizer_path, text_model_path=pt.text_model_path)

Now we need to create an instance for two other submodules, `UNetModel` and `Autoencoder`. Those submodules should be easier to initialize since we have the `nn.Module` objects defined, and can avoid `transformers` entirely.

The `Autoencoder` is also composed of two submodules that are actually useful individually, `Encoder` and `Decoder`. Let's start instantiating it.

In [32]:
# from stable_diffusion.utils.model import initialize_encoder
from stable_diffusion.model.vae import Encoder

In [None]:
encoder = Encoder(device=device)

In [35]:
# from stable_diffusion.utils.model import initialize_decoder
from stable_diffusion.model.vae import Decoder

In [None]:
# decoder = initialize_decoder(device=DEVICE)
decoder = Decoder(device=device)

In [38]:
# from stable_diffusion.utils.model import initialize_autoencoder
from stable_diffusion.model.vae import Autoencoder

In [None]:
# autoencoder = initialize_autoencoder(device=DEVICE, encoder=encoder, decoder=decoder)
autoencoder = Autoencoder(device=device, encoder=encoder, decoder=decoder)

Okay, now we have an untrained autoencoder. Now we just need the UNet.

In [41]:
from stable_diffusion.model.unet import UNetModel
# from stable_diffusion.utils.model import initialize_unet

In [42]:
unet_model = UNetModel(device=device)

In [None]:
get_memory_status(device)

Now we need to build a model with the same structure that the checkpoint we are going to use (by default, `runwayml/stable-diffusion-v1-5`), so the weights get properly mapped. This model is called `LatentDiffusion`. We also have a `initialize_latent_diffusion` function, which I will omit since it's a bit longer than the others.

In [44]:
from stable_diffusion import LatentDiffusion
# from stable_diffusion.utils.model import initialize_latent_diffusion

In [45]:
latent_diffusion = LatentDiffusion(
                            autoencoder=autoencoder,
                            clip_embedder=text_embedder,
                            unet_model=unet_model,
                            device=device
                            )

In [46]:
import safetensors

In [None]:
with section(f"stable diffusion checkpoint loading, from {pt.checkpoint_path}"):
    stable_diffusion_checkpoint = safetensors.torch.load_file(pt.checkpoint_path, device="cpu")

Push them weights into dat model, ya

In [None]:
with section('model state loading'):
    missing_keys, extra_keys = latent_diffusion.load_state_dict(stable_diffusion_checkpoint, strict=False)

It's common that some weights don't get mapped perfectly.

In [None]:
print(extra_keys)
print(len(extra_keys))
print(missing_keys)
print(len(missing_keys))

But now we have a fully loaded latent diffusion model. To actually perform the 'stable diffusion', which is actually a kind of latent diffusion model, we need yet another class, the `StableDiffusion`. Roughly speaking, the `StableDiffusion` class uses the `LatentDiffusion` model in a specific way to denoise a random sample from the latent space. It uses a diffusion process for that, hence 'latent diffusion'. What defines this process, i.e, how to use the `LatentDiffusion` model to denoise a random sampling is a sampler. That's what gets added into the `StableDiffusion` class. Besides that, it provides a unified interface for inference.

In [51]:
from stable_diffusion import StableDiffusion

In [None]:
stable_diffusion = StableDiffusion(device=device, model = latent_diffusion, ddim_steps = 20)

In [54]:
prompt = 'A cat'

In [None]:
with section('sampling...'):
    image_tensor = stable_diffusion.generate_images(prompt = prompt, seed = 1)

In [None]:
to_pil(image_tensor.squeeze())

Let's finish this notebook by saving all the relevant submodels to disk, with their weights loaded in. What we did: we broke the `v1-5...` checkpoint, a big file, into one checkpoint for each model, so now we can load the weights that were contained in the checkpoint more modularly. We will start part 2 by redoing the process of assembling a `StableDiffusion` instance by loading the checkpoints for the saved models, instead of loading the checkpoint for the `LatentDiffusion` model.

In [None]:
# first stage model is the autoencoder; let's save it's submodels
stable_diffusion.model.first_stage_model.save_submodels(encoder_path = pt.encoder_path, decoder_path = pt.decoder_path)

In [58]:
# the autoencoder itself also has parameters, so we also need to save it; but let's unload it's submodels first
stable_diffusion.model.first_stage_model.unload_submodels()

In [None]:
# now save the unloaded autoencoder
stable_diffusion.model.first_stage_model.save(autoencoder_path=pt.autoencoder_path)

In [None]:
# cond stage is the conditioning stage: the CLIPTextEmbedder model. let's save it's submodels too
stable_diffusion.model.cond_stage_model.save_submodels(tokenizer_path = pt.tokenizer_path, transformer_path = pt.text_model_path)

In [61]:
stable_diffusion.model.cond_stage_model.unload_submodels()

In [None]:
# save the UNet model
stable_diffusion.model.model.diffusion_model.save(unet_path=pt.unet_path)

In [None]:
# `LatentDiffusion` also has parameters, so we should save it as well, but only after unloading the submodels.
stable_diffusion.model.unload_submodels()

In [None]:
# save the unloaded latent diffusion model
stable_diffusion.model.save(latent_diffusion_path=pt.latent_diffusion_path)

In [69]:
# Delete stable diffusion object
del stable_diffusion

Now, in part 2, let's rebuild a `StableDiffusion` class, with the saved submodels.