# Quantize Stable Diffusion with quanto

_Authored by: [Thomas Liang](https://github.com/thliang01)_


- Stable Diffusion models are a type of generative AI that specialize in creating high-quality images from textual descriptions. They leverage deep learning techniques to understand and translate text inputs into visual outputs. Quantization, particularly Post Training Quantization (PTQ), is a crucial process in optimizing these models for faster performance and reduced model size, making them more efficient for deployment.

## Install and setup

In [None]:
! pip install --upgrade diffusers accelerate transformers safetensors datasets quanto
! pip install -q numpy Pillow torchmetrics[image] torch-fidelity

### Import modules

In [None]:
import torch
import numpy as np
import os

import time
from time import perf_counter

from PIL import Image
from IPython import display as IPdisplay
from tqdm.auto import tqdm

from diffusers import DiffusionPipeline
from diffusers import DDIMScheduler
from diffusers import AutoencoderKL
from transformers import logging

logging.set_verbosity_error()

# Check CUDA is available
print(torch.cuda.is_available())

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

## Base Model

In [None]:
model_name_or_path = "stabilityai/stable-diffusion-xl-base-1.0"
scheduler = DDIMScheduler.from_pretrained(model_name_or_path, subfolder="scheduler")
num_inference_steps = 50
height = 512
width = 512
generator = torch.manual_seed(42)

vae = AutoencoderKL.from_pretrained(
  'madebyollin/sdxl-vae-fp16-fix',
  use_safetensors=True,
  torch_dtype=torch.float16,
).to('cuda')

pipeline = DiffusionPipeline.from_pretrained(
    model_name_or_path,
    scheduler = scheduler,
    torch_dtype = torch.float16, 
    variant = "fp16",
    height = height,
    width = width,
    generator = generator,
    num_inference_steps = num_inference_steps,
    vae = vae,
    use_safetensors = True, 
).to(device)

### Prompts and seeds

In [None]:
queue = []

# Photorealistic portrait (Portrait)
queue.extend([{
  'prompt': '3/4 shot, candid photograph of a beautiful 30 year old redhead woman with messy dark hair, peacefully sleeping in her bed, night, dark, light from window, dark shadows, masterpiece, uhd, moody',
  'seed': 877866767,
}])

# Creative interior image (Interior)
queue.extend([{
  'prompt': 'futuristic living room with big windows, brown sofas, coffee table, plants, cyberpunk city, concept art, earthy colors',
  'seed': 5567822442,
}])

# Macro photography (Macro)
queue.extend([{
  'prompt': 'macro shot of a bee collecting nectar from lavender flowers',
  'seed': 2257899409,
}])

# Rendered 3D image (3D)
queue.extend([{
  'prompt': '3d rendered isometric fiji island beach, 3d tile, polygon, cartoony, mobile game',
  'seed': 987865634,
}])

### Display_images & Memory & Execution time

#### Display Single image

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
start = time.time()
images = pipeline(prompt).images[0]
end = time.time()
mem_bytes = torch.cuda.max_memory_allocated()
images

In [None]:
print(f"Execution time: {(end - start):.3f} sec")
print(f"Memory: {mem_bytes/(10**6):.3f} MB")

#### Display Multi images

In [None]:
# Create a generator
generator = torch.Generator(device='cuda')

# Start a loop to process prompts one by one
for i, generation in enumerate(queue, start=1):

  # We start the counter
  image_start = perf_counter()

  # Assign the seed to the generator
  generator.manual_seed(generation['seed'])

  # Create the image
  image = pipeline(
    prompt=generation['prompt'],
    generator=generator,
  ).images[0]

  # Save the image
  image.save(f'image_{i}.png')

  # We stop the counter and save the result
  generation['total_time'] = perf_counter() - image_start

# Print the generation time of each image
images_totals = ', '.join(map(lambda generation: str(round(generation['total_time'], 1)), queue))
print('Image time:', images_totals)

# Print the average time
images_average = round(sum(generation['total_time'] for generation in queue) / len(queue), 1)
print('Average image time:', images_average)

# Print the Max. memory used
max_memory = round(torch.cuda.max_memory_allocated(device='cuda') / 1000000000, 2)
print('Max. memory used:', max_memory, 'GB')

In [None]:
# Show the image_1
from PIL import Image

img1 = Image.open("image_1.png")
img1.show()

In [None]:
# Show the image_2
from PIL import Image

img2 = Image.open("image_2.png")
img2.show()

In [None]:
# Show the image_3
from PIL import Image

img3 = Image.open("image_3.png")
img3.show()

In [None]:
# Show the image_4
from PIL import Image

img4 = Image.open("image_4.png")
img4.show()

### Evaluating Diffusion Models (default)

We will evaluating both in CLIP score and PickScore

* CLIP score
* PickScore

#### CLIP score

CLIP, which stands for Contrastive Language-Image Pre-training, is a model developed by OpenAI that learns visual concepts from natural language descriptions. It is trained on a variety of (image, text) pairs and can predict the most relevant text snippet given an image, similar to the zero-shot capabilities of GPT-2 and GPT-3 
.


The CLIP model consists of a text encoder and an image encoder. These encoders transform the input data (text or image) into a shared multimodal embedding space. The goal of the model is to maximize the cosine similarity between the embeddings of matching image-text pairs while minimizing the cosine similarity between the embeddings of mismatching pairs. This is achieved through a contrastive objective 
.


The CLIP score, in this context, is the cosine similarity between the image and text embeddings. A higher CLIP score indicates a better match between the image and the text 
.

In [None]:
prompts = [
    "a photo of an astronaut riding a horse on mars",
    "A high tech solarpunk utopia in the Amazon rainforest",
    "A pikachu fine dining with a view to the Eiffel Tower",
    "A mecha robot in a favela in expressionist style",
    "an insect robot preparing a delicious meal",
    "A small cabin on top of a snowy mountain in the style of Disney, artstation",
]

images = pipeline(prompts, num_images_per_prompt=1, output_type="np", height = height, width = width).images

print(images.shape)
# (6, 512, 512, 3)

In [None]:
from torchmetrics.functional.multimodal import clip_score
from functools import partial

clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16")

def calculate_clip_score(images, prompts):
    images_int = (images * 255).astype("uint8")
    clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach()
    return round(float(clip_score), 4)

sd_clip_score = calculate_clip_score(images, prompts)
print(f"CLIP score: {sd_clip_score}")

#### PickScore

# TODO
![PickScore](https://huggingface.co/datasets/huggingface/cookbook-images/)

Abstract

The collection of large datasets of human preferences from text-to-image users is typically a privilege reserved for corporations, leaving such datasets out of public reach. To tackle this problem, we've developed a web application that allows text-to-image users to generate images and express their preferences. This application has been instrumental in the creation of Pick-a-Pic, an extensive, publicly accessible dataset of text-to-image prompts and genuine user preferences for generated images.


We've utilized this dataset to train a CLIP-based scoring function, known as PickScore, which has shown extraordinary performance in predicting human preferences. We've also tested PickScore's capability in model evaluation and found that it aligns better with human rankings compared to other automatic evaluation metrics.

In [None]:
# import
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch

# load model
device = "cuda"
processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
model_pretrained_name_or_path = "yuvalkirstain/PickScore_v1"
processor = AutoProcessor.from_pretrained(processor_name_or_path)
model = AutoModel.from_pretrained(model_pretrained_name_or_path).eval().to(device)

In [None]:
# Score function adapted from their docs
def get_scores(prompt, images):
    
    # preprocess
    image_inputs = processor(
        images=images,
        padding=True,
        truncation=True,
        max_length=77,
        return_tensors="pt",
    ).to(device)
    
    text_inputs = processor(
        text=prompt,
        padding=True,
        truncation=True,
        max_length=77,
        return_tensors="pt",
    ).to(device)


    with torch.no_grad():
        # embed
        image_embs = model.get_image_features(**image_inputs)
        image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True)
    
        text_embs = model.get_text_features(**text_inputs)
        text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True)
    
        # score
        scores = model.logit_scale.exp() * (text_embs @ image_embs.T)[0]
       
    return scores.cpu().tolist()

In [None]:
get_scores("a photo of an astronaut riding a horse on mars", images)

In [None]:
get_scores("a photo of a pretty flower", images)

In [None]:
from datasets import load_dataset
pap = load_dataset("yuvalkirstain/pickapic_v1_no_images")
prompts = pap['validation_unique']['caption']
prompts[:10]

Measuring the effect of CFG_Scale on Score

In [None]:
import matplotlib.pyplot as plt
from IPython.display import clear_output

average_scores = []
cfg_scales = [2, 12, 30]
for cfg_scale in cfg_scales:
    scores = []
    for i, prompt in enumerate(prompts[:5]):
        print(f"Scale {cfg_scale}, prompt {i}")
        generator = generator # For reproducibility
        im = pipeline(prompt, num_inference_steps=50, 
                  generator=generator, guidance_scale=cfg_scale).images[0]
        scores.append(get_scores(prompt, im)[0])
        clear_output(wait=True)
    average_scores.append(sum(scores)/len(scores))

plt.plot(cfg_scales, average_scores)

Using A Score Model for Re-Ranking

In [None]:
def generate_good_image(prompt):
    images = []
    # Generate 4 images with two different guidance scales (for example):
    images += pipeline(prompt, num_inference_steps=50, num_images_per_prompt=1,height = height, width = width).images
    images += pipeline(prompt, num_inference_steps=50, num_images_per_prompt=1,height = height, width = width, guidance_scale=5).images 
    # Score them and pick the best one
    scores = get_scores(prompt, images)
    best_image = images[scores.index(max(scores))]
    return best_image

generate_good_image("a photo of an astronaut riding a horse on mars")

## Quantization Stable Diffusion with quanto

Let's Post Training Quantization our base model

In [None]:
from quanto import quantize, freeze, qint8
import torch

model = "stabilityai/stable-diffusion-xl-base-1.0"

print(model)

In [None]:
def PTQ(torch_dtype, unet_dtype=None, device="cuda"):
    vae = AutoencoderKL.from_pretrained(
        'madebyollin/sdxl-vae-fp16-fix',
        torch_dtype=torch.float16,
        use_safetensors=True,
    ).to(device)
    
    pipe = DiffusionPipeline.from_pretrained(
        model, 
        torch_dtype=torch_dtype,
        vae=vae,
        scheduler = scheduler,
        height = height,
        width = width,
        generator = generator,
        num_inference_steps = num_inference_steps, 
        use_safetensors=True
    ).to(device)

    if unet_dtype:
        quantize(pipe.unet, weights=unet_dtype)
        freeze(pipe.unet)

    pipe.set_progress_bar_config(disable=True)
    return pipe

In [None]:
qpipe = PTQ(torch_dtype=torch.float16, unet_dtype=qint8)

### Prompts and seeds

In [None]:
queue = []

# Photorealistic portrait (Portrait)
queue.extend([{
  'prompt': '3/4 shot, candid photograph of a beautiful 30 year old redhead woman with messy dark hair, peacefully sleeping in her bed, night, dark, light from window, dark shadows, masterpiece, uhd, moody',
  'seed': 877866767,
}])

# Creative interior image (Interior)
queue.extend([{
  'prompt': 'futuristic living room with big windows, brown sofas, coffee table, plants, cyberpunk city, concept art, earthy colors',
  'seed': 5567822442,
}])

# Macro photography (Macro)
queue.extend([{
  'prompt': 'macro shot of a bee collecting nectar from lavender flowers',
  'seed': 2257899409,
}])

# Rendered 3D image (3D)
queue.extend([{
  'prompt': '3d rendered isometric fiji island beach, 3d tile, polygon, cartoony, mobile game',
  'seed': 987865634,
}])

### Display_images & Memory & Execution time

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
start = time.time()
images = qpipe(prompt).images[0]
end = time.time()
mem_bytes = torch.cuda.max_memory_allocated()
images

In [None]:
print(f"Execution time: {(end - start):.3f} sec")
print(f"Memory: {mem_bytes/(10**6):.3f} MB")

#### Display Multi images

In [None]:
# Create a generator
generator = torch.Generator(device='cuda')

# Start a loop to process prompts one by one
for i, generation in enumerate(queue, start=1):

  # We start the counter
  image_start = perf_counter()

  # Assign the seed to the generator
  generator.manual_seed(generation['seed'])

  # Create the image
  image = qpipe(
    prompt=generation['prompt'],
    generator=generator,
  ).images[0]

  # Save the image
  image.save(f'q_image_{i}.png')

  # We stop the counter and save the result
  generation['total_time'] = perf_counter() - image_start

# Print the generation time of each image
images_totals = ', '.join(map(lambda generation: str(round(generation['total_time'], 1)), queue))
print('Image time:', images_totals)

# Print the average time
images_average = round(sum(generation['total_time'] for generation in queue) / len(queue), 1)
print('Average image time:', images_average)

# Print the Max. memory used
max_memory = round(torch.cuda.max_memory_allocated(device='cuda') / 1000000000, 2)
print('Max. memory used:', max_memory, 'GB')

In [None]:
# Show the q_image_1
from PIL import Image

img1 = Image.open("q_image_1.png")
img1.show()

In [None]:
# Show the q_image_2
from PIL import Image

img2 = Image.open("q_image_2.png")
img2.show()

In [None]:
# Show the image_3
from PIL import Image

img3 = Image.open("q_image_3.png")
img3.show()

In [None]:
# Show the q_image_4
from PIL import Image

img4 = Image.open("q_image_4.png")
img4.show()

### Evaluating Diffusion Models After (Post Training Quantization)

* CLIP score
* PickScore

#### CLIP score

In [None]:
prompts = [
    "a photo of an astronaut riding a horse on mars",
    "A high tech solarpunk utopia in the Amazon rainforest",
    "A pikachu fine dining with a view to the Eiffel Tower",
    "A mecha robot in a favela in expressionist style",
    "an insect robot preparing a delicious meal",
    "A small cabin on top of a snowy mountain in the style of Disney, artstation",
]

images = qpipe(prompts, num_images_per_prompt=1, output_type="np", height = height, width = width).images

print(images.shape)
# (6, 512, 512, 3)

In [None]:
from torchmetrics.functional.multimodal import clip_score
from functools import partial

clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16")

def calculate_clip_score(images, prompts):
    images_int = (images * 255).astype("uint8")
    clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach()
    return round(float(clip_score), 4)

sd_clip_score = calculate_clip_score(images, prompts)
print(f"CLIP score: {sd_clip_score}")

#### PickScore

In [None]:
# import
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch

# load model
device = "cuda"
processor_name_or_path = "laion/CLIP-ViT-H-14-laion2B-s32B-b79K"
model_pretrained_name_or_path = "yuvalkirstain/PickScore_v1"
processor = AutoProcessor.from_pretrained(processor_name_or_path)
model = AutoModel.from_pretrained(model_pretrained_name_or_path).eval().to(device)

In [None]:
# Score function adapted from their docs
def get_scores(prompt, images):
    
    # preprocess
    image_inputs = processor(
        images=images,
        padding=True,
        truncation=True,
        max_length=77,
        return_tensors="pt",
    ).to(device)
    
    text_inputs = processor(
        text=prompt,
        padding=True,
        truncation=True,
        max_length=77,
        return_tensors="pt",
    ).to(device)


    with torch.no_grad():
        # embed
        image_embs = model.get_image_features(**image_inputs)
        image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True)
    
        text_embs = model.get_text_features(**text_inputs)
        text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True)
    
        # score
        scores = model.logit_scale.exp() * (text_embs @ image_embs.T)[0]
       
    return scores.cpu().tolist()
get_scores("a photo of an astronaut riding a horse on mars", images)

In [None]:
get_scores("a photo of a pretty flower", images)

In [None]:
from datasets import load_dataset
pap = load_dataset("yuvalkirstain/pickapic_v1_no_images")
prompts = pap['validation_unique']['caption']
prompts[:10]

#### Measuring the effect of CFG_Scale on Score

In [None]:
import matplotlib.pyplot as plt
from IPython.display import clear_output

average_scores = []
cfg_scales = [2, 9, 12, 30]
for cfg_scale in cfg_scales:
    scores = []
    for i, prompt in enumerate(prompts[:5]):
        print(f"Scale {cfg_scale}, prompt {i}")
        generator = generator # For reproducibility
        im = qpipe(prompt, num_inference_steps=50, 
                  generator=generator, guidance_scale=cfg_scale).images[0]
        scores.append(get_scores(prompt, im)[0])
        clear_output(wait=True)
    average_scores.append(sum(scores)/len(scores))

plt.plot(cfg_scales, average_scores)

#### Using A Score Model for Re-Ranking

In [None]:
def generate_good_image(prompt):
    images = []
    # Generate 4 images with two different guidance scales (for example):
    images += qpipe(prompt, num_inference_steps=50, num_images_per_prompt=1,height = height, width = width).images
    images += qpipe(prompt, num_inference_steps=50, num_images_per_prompt=1,height = height, width = width, guidance_scale=5).images 
    # Score them and pick the best one
    scores = get_scores(prompt, images)
    best_image = images[scores.index(max(scores))]
    return best_image

generate_good_image("a photo of an astronaut riding a horse on mars")

## Conclusion 

|  | Memory | Execution time |
| :-----| ----: | :----: |
| SDXL | 11238.718 MB | 19.318 sec |
| PTQ SDXL | 15.8 GB | 23.6 sec |

|  | CLIP score | PickScore |
| :-----| ----: | :----: |
| SDXL | 31.4234 | 22.10 |
| PTQ SDXL | 34.1379 | 22.1 |

## References

* [Diffusers](https://huggingface.co/docs/diffusers/)
* [Evaluating Diffusion Models](https://huggingface.co/docs/diffusers/conceptual/evaluation#text-guided-image-generation)
* [Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation](https://arxiv.org/abs/2305.01569)
