# Image generation with DeepFloyd IF and OpenVINO™

DeepFloyd IF is an advanced open-source text-to-image model that delivers remarkable photorealism and language comprehension. DeepFloyd IF consists of a frozen text encoder and three cascaded pixel diffusion modules: a base model that creates 64x64 px images based on text prompts and two super-resolution models, each designed to generate images with increasing resolution: 256x256 px and 1024x1024 px. All stages of the model employ a frozen text encoder, built on the T5 transformer, to derive text embeddings, which are then passed to a UNet architecture enhanced with cross-attention and attention pooling.

![deepfloyd_if_scheme](https://github.com/deep-floyd/IF/raw/develop/pics/deepfloyd_if_scheme.jpg)


## Prerequisites
install required packages

In [None]:

# conda deactivate && conda remove -n if --all -y || 1
# conda create -n if python=3.9 -y
# conda activate if

In [1]:
# Set up requirements

# pip install --upgrade pip
# pip install deepfloyd_if==1.0.2rc0
# pip install xformers==0.0.16
# pip install git+https://github.com/openai/CLIP.git --no-deps
# pip install huggingface_hub
# pip install --upgrade diffusers accelerate transformers safetensors
# pip install openvino-dev==2023.0.0.dev20230407

In [1]:
import gc
import os
from pathlib import Path

# Memory efficient attention is not supported by ONNX
os.environ['FORCE_MEM_EFFICIENT_ATTN'] = "0"

from deepfloyd_if.modules import IFStageI, IFStageII, StableStageIII
from deepfloyd_if.modules.t5 import T5Embedder
from deepfloyd_if.pipelines import dream
from openvino.runtime import Core, serialize
from openvino.tools import mo
import torch

FORCE_MEM_EFFICIENT_ATTN= 0 @UNET:QKVATTENTION


2023-05-05 20:30:47.419592: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [2]:
device = 'cpu'
encoder_ir_path = 'encoder_ir.xml'
first_stage_unet_ir_path = './unet_ir.xml'

In [None]:
from huggingface_hub import login

login()

## Convert text encoder

### Initialize Pytorch model

Downloading the model weights may take some time. Approximate checkpoint size is 20GB.

In [3]:
%%time

t5 = T5Embedder(device=device, torch_dtype=torch.float32)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 597 ms, sys: 6.72 s, total: 7.32 s
Wall time: 10 s


In [4]:
%%time
# Top memory consumption during this cell execution is 45GB

# Define example inputs for model conversion
example_inputs = {
    'input_ids': torch.ones((1, 77), dtype=torch.long),
    'attention_mask': torch.ones((1, 77), dtype=torch.long)
}

encoder_ir = mo.convert_model(
    t5.model,
    example_input=example_inputs,
    input_shape=[[-1, 77], [-1, 77]],
    compress_to_fp16=False,
    progress=True,
    onnx_opset_version=14,
    use_legacy_frontend=True
)

serialize(encoder_ir, str(encoder_ir_path))

del t5.model
del encoder_ir
gc.collect();



Progress: [....................] 100.00% doneCPU times: user 3min 10s, sys: 2min 1s, total: 5min 11s
Wall time: 9min 3s


## Convert the first pixel diffusion module

## Initialize Pytorch model

The first stage UNet requires conversion using a CUDA device because an operation in the model is not supported by the CPU backend library ("cos_vml_cpu" not implemented for 'Half').

Alternatively, one can download the same model in fp32 precision using the Diffusers API. Here's the code for converting the model, although I've experienced issues with the conversion never completing:
```
from diffusers import DiffusionPipeline
from openvino.tools import mo
import torch

stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp32", torch_dtype=torch.float32)

example_inputs = {
    'sample': torch.rand((2, 3, 64, 64), device=device, dtype=torch.float32),
    'timestep': torch.rand((2), device=device, dtype=torch.float32),
    'encoder_hidden_states': torch.rand((2, 77, 4096), device=device, dtype=torch.float32),
}

unet_1_ir = mo.convert_model(
    stage_1.unet,
    example_input=example_inputs,
    compress_to_fp16=False,
    input_shape=[[-1, 3, -1, -1], [-1,], [-1, 77, 4096]],
    progress=True,
    onnx_opset_version=14,
    use_legacy_frontend=True
)
```

In [4]:
%%time

# # "cos_vml_cpu" not implemented for 'Half'
device = 'cpu'
device='cuda'
if_I = IFStageI("IF-I-M-v1.0", device=device)
# if_I.model.to(torch.float32)

CPU times: user 1.29 s, sys: 717 ms, total: 2 s
Wall time: 2.6 s


In [8]:
%%time

example_inputs = {
    'x': torch.rand((2, 3, 64, 64), device=device, dtype=torch.float32),
    'timesteps': torch.rand((2), device=device, dtype=torch.float32),
    'text_emb': torch.rand((2, 77, 4096), device=device, dtype=torch.float32),
}

unet_1_ir = mo.convert_model(
    if_I.model,
    example_input=example_inputs,
    compress_to_fp16=False,
    input_shape=[[-1, 3, -1, -1], [-1,], [-1, 77, 4096]],
    progress=True,
    onnx_opset_version=14,
    use_legacy_frontend=True
)

device = 'cpu'
if_I.device = device


serialize(unet_1_ir, first_stage_unet_ir_path)
del unet_1_ir

## Prepare Inference pipeline

The original pipeline from the source repository will be reused in this example. In order to achieve this, adapter classes were created to enable OpenVINO models to replace Pytorch models and integrate seamlessly into the pipeline.

In [5]:
core = Core()

In [16]:
class TextEncoder:
    def __init__(self, encoder_ir_path, dtype=torch.float16):
        self.encoder_ir_path = encoder_ir_path
#         self.encoder_openvino = core.compile_model(encoder_ir_path, "CPU")
        
        self.dtype = dtype
        
    def __call__(self, *args, **kwargs):
        print("ENCODER CALL")
        self.encoder_openvino = core.compile_model(encoder_ir_path, "CPU")
        
        result = self.encoder_openvino(*args, list(kwargs.values()))
        result_numpy = result[self.encoder_openvino.outputs[0]]
        
        del self.encoder_openvino
        gc.collect()
        return {'last_hidden_state': torch.tensor(result_numpy, dtype=self.dtype)}

In [17]:
class UnetFirstStage:
    def __init__(self, unet_ir_path, dtype=torch.float16):
        self.unet_openvino = core.compile_model(unet_ir_path, "CPU")
        self.dtype = dtype
        
    def __call__(self, *args, **kwargs):
        parameters = [*args, kwargs['text_emb']]
        # [t.cpu() for t in parameters]
        result = self.unet_openvino(parameters)
        result_numpy = result[self.unet_openvino.outputs[0]]
        return torch.tensor(result_numpy, dtype=self.dtype)

In [18]:
%%time

t5.device = 'cpu'
if_I.device = 'cpu'

del if_I.model
del t5.model
gc.collect();

t5.model = TextEncoder(encoder_ir_path)
if_I.model = UnetFirstStage(first_stage_unet_ir_path)

CPU times: user 2.83 s, sys: 1.87 s, total: 4.7 s
Wall time: 4.03 s


# Running Dream pipeline

In [None]:
prompt = 'a photo of hamster with sign that says "it runs on OpenVINO" styled as a soviet cartoon'
style_prompt = 'in rage meme style'

count = 1

result = dream(
    t5=t5, if_I=if_I, #if_II=if_I,
    prompt=[f'{style_prompt}, {prompt}']*count,
    seed=16,
    if_I_kwargs={
        "guidance_scale": 7.0,
        "sample_timestep_respacing": "smart100",
    },
#     if_II_kwargs={
#         "guidance_scale": 4.0,
#         "sample_timestep_respacing": "smart50",
#     },
)
if_I.show(result['I'], size=3)
# if_I.show(result['II'], size=6)
# if_I.show(result['III'], size=14)

ENCODER CALL


  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]