# Diffusion Pipeline

## I. Stable diffusion

All pipelines are based on DiffusionPipeline. A pipeline can fulfill one or some tasks. There is an explaination of the pipelines and their tasks: https://huggingface.co/docs/diffusers/v0.29.2/en/api/pipelines/overview#diffusers. 
But for some tasks, we can use AutoPipeline to load the model without knowing the specific pipeline to use.
To choose which pipeline to use, we have either look at the doc or the code source.

In [15]:
# DiffusionPipeline

# it loads a StableDiffusionPipeline
# it contains 
#   - feature_extractor: to translate encoded prompt to image features
#   - scheduler: scheduler
#   - text_encoder: to encode tokenized prompt
#   - tokenizer: to tokenize prompt
#   - unet: to estimate noise for each time step
#   - vae: to encode image space to latent space

# The corresponding models for each components can be seen in the pipe object.
# note: image_encoder is not used in this pipe.

# Each component was loaded from a subfolder of the repository.

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_safetensors=True)
pipe

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

StableDiffusionPipeline {
  "_class_name": "StableDiffusionPipeline",
  "_diffusers_version": "0.29.2",
  "_name_or_path": "CompVis/stable-diffusion-v1-4",
  "feature_extractor": [
    "transformers",
    "CLIPImageProcessor"
  ],
  "image_encoder": [
    null,
    null
  ],
  "requires_safety_checker": true,
  "safety_checker": [
    "stable_diffusion",
    "StableDiffusionSafetyChecker"
  ],
  "scheduler": [
    "diffusers",
    "PNDMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet2DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

In [16]:
# The same pipeline can be loaded using autopipeline for tasks
# We get the same pipeline as before.

from diffusers import AutoPipelineForText2Image

pipe_auto = AutoPipelineForText2Image.from_pretrained("CompVis/stable-diffusion-v1-4", use_safetensors=True)
pipe_auto

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

StableDiffusionPipeline {
  "_class_name": "StableDiffusionPipeline",
  "_diffusers_version": "0.29.2",
  "_name_or_path": "CompVis/stable-diffusion-v1-4",
  "feature_extractor": [
    "transformers",
    "CLIPImageProcessor"
  ],
  "image_encoder": [
    null,
    null
  ],
  "requires_safety_checker": true,
  "safety_checker": [
    "stable_diffusion",
    "StableDiffusionSafetyChecker"
  ],
  "scheduler": [
    "diffusers",
    "PNDMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet2DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}

In [17]:
# access to a component

pipe_auto.tokenizer

CLIPTokenizer(name_or_path='/home/niuniu/.cache/huggingface/hub/models--CompVis--stable-diffusion-v1-4/snapshots/133a221b8aa7292a167afc5127cb63fb5005638b/tokenizer', vocab_size=49408, model_max_length=77, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	49406: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	49407: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

In [19]:
# for single file models

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_single_file(
    "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.ckpt"
)
pipe

v1-5-pruned.ckpt:   0%|          | 0.00/7.70G [00:00<?, ?B/s]

Error while downloading from https://cdn-lfs.huggingface.co/repos/6b/20/6b201da5f0f5c60524535ebb7deac2eef68605655d3bbacfee9cce0087f3b3f5/e1441589a6f3c5a53f5f54d0975a18a7feb7cdf0b0dee276dfc3331ae376a053?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27v1-5-pruned.ckpt%3B+filename%3D%22v1-5-pruned.ckpt%22%3B&Expires=1720798478&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcyMDc5ODQ3OH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy82Yi8yMC82YjIwMWRhNWYwZjVjNjA1MjQ1MzVlYmI3ZGVhYzJlZWY2ODYwNTY1NWQzYmJhY2ZlZTljY2UwMDg3ZjNiM2Y1L2UxNDQxNTg5YTZmM2M1YTUzZjVmNTRkMDk3NWExOGE3ZmViN2NkZjBiMGRlZTI3NmRmYzMzMzFhZTM3NmEwNTM%7EcmVzcG9uc2UtY29udGVudC1kaXNwb3NpdGlvbj0qIn1dfQ__&Signature=fVbqd-cpb9hVzCAJvOG%7EsInp8V0AWGS52QSkrpdTwlEQO4O5via1k2L544IkgcPFK8F78UQo3-hM63vXYGYcQhhwVuZvZjhPz47PM-2sjmmhy5lKs4uHPZq4yF7-CLrbXqWAtNNN6C8sy-E2HgqUoVhbqAIPbJCAraeSydbakSPvJHDRi3GLRLwcFx08N7hKtfwLm7OaX5mkbDKooMXBbaGRTdCpk3P5gBLEnpit-PXzGcIniW-Nzr

v1-5-pruned.ckpt:  82%|########2 | 6.33G/7.70G [00:00<?, ?B/s]

OSError: Unable to load weights from checkpoint file for '/home/niuniu/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9/v1-5-pruned.ckpt' at '/home/niuniu/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/1d0c4ebf6ff58a5caecab40fa1406526bca4b5b9/v1-5-pruned.ckpt'. 

## II. Checkpoint

The variants are:
    - precision: f32 by default, or half (f16), this can't by used for training or on cpu. However, we could finetune it using half precision but with some modifications (see half precision training for tranformers).
    - no-exponential mean averagbed (EMA) weights, should not be used for inference.

In [13]:
# show the precision of the default model

for name, param in pipe_auto.unet.named_parameters():
    print(name, param.dtype)

conv_in.weight torch.float32
conv_in.bias torch.float32
time_embedding.linear_1.weight torch.float32
time_embedding.linear_1.bias torch.float32
time_embedding.linear_2.weight torch.float32
time_embedding.linear_2.bias torch.float32
down_blocks.0.attentions.0.norm.weight torch.float32
down_blocks.0.attentions.0.norm.bias torch.float32
down_blocks.0.attentions.0.proj_in.weight torch.float32
down_blocks.0.attentions.0.proj_in.bias torch.float32
down_blocks.0.attentions.0.transformer_blocks.0.norm1.weight torch.float32
down_blocks.0.attentions.0.transformer_blocks.0.norm1.bias torch.float32
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q.weight torch.float32
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k.weight torch.float32
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_v.weight torch.float32
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.weight torch.float32
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.bias torch.float32


In [14]:
# There are 2 parameters for the precision:
# - "variant" designates which variant we use to load the model
#   in this case, if the torch_dtype is not set, the variant's precision
#   will be converted to the default precision - f32.
# - "torch_dtype" is the type of the conversion after loading.
#   if the variant is f32 and dtype is f16, the loaded model will be converted 
#   to f16.

import torch

pipe = AutoPipelineForText2Image.from_pretrained("CompVis/stable-diffusion-v1-4", variant="fp16", torch_dtype=torch.half, use_safetensors=True)
for name, param in pipe.unet.named_parameters():
    print(name, param.dtype)

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

model.fp16.safetensors:   0%|          | 0.00/608M [00:00<?, ?B/s]

diffusion_pytorch_model.fp16.safetensors:   0%|          | 0.00/1.72G [00:00<?, ?B/s]

model.fp16.safetensors:   0%|          | 0.00/246M [00:00<?, ?B/s]

diffusion_pytorch_model.fp16.safetensors:   0%|          | 0.00/167M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

conv_in.weight torch.float16
conv_in.bias torch.float16
time_embedding.linear_1.weight torch.float16
time_embedding.linear_1.bias torch.float16
time_embedding.linear_2.weight torch.float16
time_embedding.linear_2.bias torch.float16
down_blocks.0.attentions.0.norm.weight torch.float16
down_blocks.0.attentions.0.norm.bias torch.float16
down_blocks.0.attentions.0.proj_in.weight torch.float16
down_blocks.0.attentions.0.proj_in.bias torch.float16
down_blocks.0.attentions.0.transformer_blocks.0.norm1.weight torch.float16
down_blocks.0.attentions.0.transformer_blocks.0.norm1.bias torch.float16
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_q.weight torch.float16
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_k.weight torch.float16
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_v.weight torch.float16
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.weight torch.float16
down_blocks.0.attentions.0.transformer_blocks.0.attn1.to_out.0.bias torch.float16


## III. Community pipelines

Community pipelines some variant implementations of the original papers. 
All community pipelines are listed in: https://github.com/huggingface/diffusers/tree/main/examples/community.
The pipelines can be either in HF hub or in the Git repos.