# Image generation with DeepFloyd IF and OpenVINO™

DeepFloyd IF is an advanced open-source text-to-image model that delivers remarkable photorealism and language comprehension. DeepFloyd IF consists of a frozen text encoder and three cascaded pixel diffusion modules: a base model that creates 64x64 px images based on text prompts and two super-resolution models, each designed to generate images with increasing resolution: 256x256 px and 1024x1024 px. All stages of the model employ a frozen text encoder, built on the T5 transformer, to derive text embeddings, which are then passed to a UNet architecture enhanced with cross-attention and attention pooling.

![deepfloyd_if_scheme](https://github.com/deep-floyd/IF/raw/develop/pics/deepfloyd_if_scheme.jpg)


## Prerequisites
Install required packages.

In [None]:
%%bash
# Set up requirements

pip install --upgrade pip
pip install deepfloyd_if==1.0.2rc0
pip install xformers==0.0.16
pip install git+https://github.com/openai/CLIP.git --no-deps
pip install huggingface_hub
pip install --upgrade diffusers accelerate transformers safetensors
pip install openvino-dev==2023.0.0.dev20230407

In [None]:
import gc
import os
from pathlib import Path

from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil
from openvino.runtime import Core, serialize
from openvino.tools import mo
import torch

In [None]:
device = 'cpu'
output_dtype = torch.float32
compress_to_fp16 = False

models_dir = Path('./models')
models_dir.mkdir(exist_ok=True)

encoder_ir_path = models_dir / 'encoder_ir.xml'
first_stage_unet_ir_path = models_dir / 'unet_ir_I_l.xml'
second_stage_unet_ir_path = models_dir / 'unet_ir_II_m.xml'

### Authentication
In order to access IF checkpoints, users need to provide an authentication token. To generate a token, follow the link displayed in the cell output.

In [None]:
from huggingface_hub import login

login()

## Stable Diffusion in Diffusers library
To work with IF by DeepFloyd Lab, we will use Hugging Face Diffusers library. To experiment with diffusion models, Diffusers exposes the DiffusionPipeline. The code below demonstrates how to create a DiffusionPipeline using IF configs:

In [None]:
%%time

#Downloading the model weights may take some time. The approximate total checkpoints size is 27GB.
stage_1 = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-L-v1.0",
    variant="fp32",
    torch_dtype=output_dtype
)

stage_2 = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-II-M-v1.0",
    text_encoder=None,
    variant="fp32",
    torch_dtype=output_dtype
)

## Convert models to OpenVINO Intermediate representation (IR) format
The OpenVINO Model Optimizer enables direct conversion of PyTorch models. We will utilize the mo.convert_model method to acquire OpenVINO IR versions of the models. This requires providing a model object, input data for model tracing, and other relevant parameters. The use_legacy_frontend=True parameter instructs the Model Optimizer to employ the ONNX model format as an intermediate step, as opposed to using the PyTorch JIT compiler, which is not optimal for our situation.

The pipeline consists of three important parts:

 - A Text Encoder that translates user prompts to vectors in the latent space that the Diffusion model can understand.
 - A Stage 1 U-Net for step-by-step denoising latent image representation.
 - A Stage 2 U-Net that takes low resolution output from the previous step and the latent representations to upscale the resulting image.
 
Let us convert each part

In [None]:
# This context manager will keep intermediate ONNX model weights out of the working directory.

import contextlib
import tempfile

@contextlib.contextmanager
def temp_dir():
    cwd = Path.cwd()
    temp_dir = tempfile.TemporaryDirectory()
    os.chdir(temp_dir.name)
    yield
    os.chdir(cwd)
    temp_dir.cleanup()

## 1. Convert Text Encoder

The text encoder is responsible for converting the input prompt, such as "ultra close-up color photo portrait of rainbow owl with deer horns in the woods" into an embedding space that can be fed to the next stage's U-Net. Typically, it is a transformer-based encoder that maps a sequence of input tokens to a sequence of text embeddings.

The input for the text encoder consists of a tensor `input_ids`, which contains token indices from the text processed by the tokenizer and padded to the maximum length accepted by the model, and `attention_mask`, which marks relevant tokens with 1s and padded tokens with 0s.

In [None]:
%%time

if not encoder_ir_path.exists():
    # Define example inputs for model conversion
    example_inputs = {
        'input_ids': torch.ones((1, 77), dtype=torch.long),
        'attention_mask': torch.ones((1, 77), dtype=torch.long)
    }
    
    with temp_dir():
        encoder_ir = mo.convert_model(
            stage_1.text_encoder,
            example_input=example_inputs,
            input_shape=[[-1, 77], [-1, 77]],
            compress_to_fp16=compress_to_fp16,
            progress=True,
            onnx_opset_version=14,
            use_legacy_frontend=True
        )

    serialize(encoder_ir, encoder_ir_path)
    del encoder_ir
    
del stage_1.text_encoder
gc.collect();

## Convert the first Pixel Diffusion module's UNet

U-Net model gradually denoises latent image representation guided by text encoder hidden state.

U-Net model has three inputs:

`sample` - latent image sample from previous step. Generation process has not been started yet, so you will use random noise.
`timestep` - current scheduler step.
`encoder_hidden_state` - hidden state of text encoder.
Model predicts the sample state for the next step.

The first Diffusion module in the cascade generates 64x64 pixel low resolution images.

In [None]:
%%time

if not first_stage_unet_ir_path.exists():
    example_inputs = {
        'sample': torch.rand((2, 3, 64, 64), device=device, dtype=output_dtype),
        'timestep': torch.tensor([500], device=device, dtype=output_dtype),
        'encoder_hidden_states': torch.rand((2, 77, 4096), device=device, dtype=output_dtype),
    }

    with temp_dir():
        unet_1_ir = mo.convert_model(
            stage_1.unet,
            example_input=example_inputs,
            compress_to_fp16=compress_to_fp16,
            input_shape=[[-1, 3, -1, -1], [1], [-1, 77, 4096]],
            progress=True,
            onnx_opset_version=14,
            use_legacy_frontend=True
        )

    serialize(unet_1_ir, first_stage_unet_ir_path)
    
    del unet_1_ir

stage_1_config = stage_1.unet.config
del stage_1.unet
gc.collect();

## Convert the second pixel diffusion module

The second Diffusion module in the cascade generates 256x256 pixel images.

In [None]:
%%time

if not second_stage_unet_ir_path.exists():
    example_inputs = {
        'sample': torch.rand((2, 6, 256, 256), device=device, dtype=output_dtype),
        'timestep': torch.tensor([500], device=device, dtype=output_dtype),
        'encoder_hidden_states': torch.rand((2, 77, 4096), device=device, dtype=output_dtype),
        'class_labels': torch.tensor([250, 250])
    }

    with temp_dir():
        unet_2_ir = mo.convert_model(
            stage_2.unet,
            example_input=example_inputs,
            compress_to_fp16=compress_to_fp16,
            input_shape=[[-1, 6, 256, 256], [1], [-1, 77, 4096], [-1]],
            progress=True,
            onnx_opset_version=14,
            use_legacy_frontend=True
        )

    serialize(unet_2_ir, second_stage_unet_ir_path)
    
    del unet_2_ir
    
stage_2_config = stage_2.unet.config
del stage_2.unet
gc.collect();

## Prepare Inference pipeline

The original pipeline from the source repository will be reused in this example. In order to achieve this, adapter classes were created to enable OpenVINO models to replace Pytorch models and integrate seamlessly into the pipeline.

In [None]:
core = Core()

In [None]:
class TextEncoder:
    def __init__(self, ir_path, dtype=output_dtype):
        self.ir_path = ir_path 
        self.dtype = dtype
        
    def __call__(self, *args, **kwargs):
        self.encoder_openvino = core.compile_model(self.ir_path, "CPU")
        try:
            result = self.encoder_openvino(list(args) + list(kwargs.values()))
            result_numpy = result[self.encoder_openvino.outputs[0]]
        finally:
            del self.encoder_openvino
            gc.collect()
        return [torch.tensor(result_numpy, dtype=self.dtype)]

In [None]:
class UnetFirstStage:
    def __init__(self, unet_ir_path, config, dtype=output_dtype):
        self.unet_openvino = core.compile_model(unet_ir_path, "CPU")
        self.config = config
        self.dtype = dtype
        
    def __call__(self, *args, **kwargs):
        parameters = [*args, kwargs['encoder_hidden_states']]
        parameters[1] = torch.tensor(parameters[1])
        result = self.unet_openvino(parameters)
        result_numpy = result[self.unet_openvino.outputs[0]]
        class a:
            sample = torch.tensor(result_numpy, dtype=self.dtype)
        return a
    
class UnetSecondStage:
    def __init__(self, unet_ir_path, config, dtype=output_dtype):
        self.unet_openvino = core.compile_model(unet_ir_path, "CPU")
        self.config = config
        self.dtype = dtype
        
    def __call__(self, *args, **kwargs):
        parameters = [*args, kwargs['encoder_hidden_states'], kwargs['class_labels']]
        parameters[1] = torch.tensor(parameters[1])
        result = self.unet_openvino(parameters)
        result_numpy = result[self.unet_openvino.outputs[0]]
        class a:
            sample = torch.tensor(result_numpy, dtype=self.dtype)
        return a

## Run Text-to-Image generation

Now, we can set a text prompt for image generation and execute the inference pipeline. Optionally, you can also modify the random generator seed for latent state initialization and adjust the number of images to be generated for the given prompt.

In [None]:
%%time

stage_1.text_encoder = TextEncoder(encoder_ir_path)

In [None]:
%%time

prompt = 'uultra close-p color photo portrait of rainbow owl with deer horns in the woods'
count = 1

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt, num_images_per_prompt=count)

In [None]:
%%time

stage_1.unet = UnetFirstStage(first_stage_unet_ir_path, stage_1_config)

In [None]:
%%time

generator = torch.manual_seed(142)

image = stage_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt").images
pt_to_pil(image)[0]

In [None]:
%%time

stage_2.unet = UnetSecondStage(second_stage_unet_ir_path, stage_2_config)

In [None]:
%%time

image = stage_2(
    image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
for i, im in enumerate(pt_to_pil(image)):
    im.save(f"./if_stage_II_ov_{i}.png")
pt_to_pil(image)[0]