# Virtual Try-On with CatVTON and OpenVINO

### Abstract
Virtual try-on methods based on diffusion models achieve realistic try-on effects but replicate the backbone network as a ReferenceNet or leverage additional image encoders to process condition inputs, resulting in high training and inference costs. [In this work](http://arxiv.org/abs/2407.15886), authors rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person, proposing CatVTON, a simple and efficient virtual try-on diffusion model.
It facilitates the seamless transfer of in-shop or worn garments of arbitrary categories to target persons by simply
concatenating them in spatial dimensions as inputs. The efficiency of the model is demonstrated in three aspects: 
 1. Lightweight network. Only the original diffusion modules are used, without additional network modules. The text encoder and cross attentions for text injection in the backbone are removed, further reducing the parameters by 167.02M.
 2. Parameter-efficient training. We identified the try-on relevant modules through experiments and achieved high-quality try-on effects by training only 49.57M parameters (∼5.51% of the backbone network’s parameters). 
 3. Simplified inference. CatVTON eliminates all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input, requiring only garment reference, target person image, and mask for the virtual try-on process. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results with fewer prerequisites and trainable parameters than baseline methods. Furthermore, CatVTON shows good generalization in in-the-wild scenarios despite using open-source datasets with only 73K samples.


Teaser image from [CatVTON GitHub](https://github.com/Zheng-Chong/CatVTON)
![teaser](https://github.com/Zheng-Chong/CatVTON/blob/edited/resource/img/teaser.jpg?raw=true)

In this tutorial we consider how to convert, optimize and run this model using OpenVINO.


#### Table of contents:

- [Prerequisites](#Prerequisites)
- [Convert and Optimize model](#Convert-and-Optimize-model)
- [Run model inference](#Run-model-inference)
    - [Select inference device](#Select-inference-device)
    - [Initialize inference pipeline](#Initialize-inference-pipeline)
- [Interactive demo](#Interactive-demo)


### Installation Instructions

This is a self-contained example that relies solely on its own code.

We recommend  running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/README.md#-installation-guide).

<img referrerpolicy="no-referrer-when-downgrade" src="https://static.scarf.sh/a.png?x-pxid=5b5a4db0-7875-4bfb-bdbd-01698b5b1a77&file=notebooks/catvton/catvton.ipynb" />


## Prerequisites
[back to top ⬆️](#Table-of-contents:)

In [31]:
%pip install -q "openvino>=2024.4"
%pip install -q "torch>=2.1" "diffusers>=0.29.1" torchvision opencv_python --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -q fvcore "pillow" "tqdm" "gradio>=4.36" "omegaconf==2.4.0.dev3" av pycocotools cloudpickle scipy accelerate "transformers>=4.27.3"

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [32]:
import sys
from pathlib import Path


catvton_path = Path("CatVTON")

if not catvton_path.exists():
    exit_code = os.system("git clone https://github.com/Zheng-Chong/CatVTON.git")
    if exit_code != 0:
        raise Exception("Failed to clone the repository!")

sys.path.insert(0, str(catvton_path))

r = requests.get(
    url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py",
)
open("notebook_utils.py", "w").write(r.text)

### Convert the model to OpenVINO IR
[back to top ⬆️](#Table-of-contents:)

In [33]:
from pathlib import Path


MODEL_DIR = Path("models")
VAE_ENCODER_PATH = MODEL_DIR / "vae_encoder.xml"
VAE_DECODER_PATH = MODEL_DIR / "vae_decoder.xml"
UNET_PATH = MODEL_DIR / "unet.xml"

In [34]:
import os

from diffusers.image_processor import VaeImageProcessor
from huggingface_hub import snapshot_download
import yaml

from model.cloth_masker import AutoMasker
from model.pipeline import CatVTONPipeline


resume_path = "zhengchong/CatVTON"
base_model_path = "booksforcharlie/stable-diffusion-inpainting"
repo_path = snapshot_download(repo_id=resume_path)
output_dir = "output"


pipeline = CatVTONPipeline(base_ckpt=base_model_path, attn_ckpt=repo_path, attn_ckpt_version="mix", use_tf32=True, device="cpu")

# fix default config to use cpu
with open(f"{repo_path}/DensePose/densepose_rcnn_R_50_FPN_s1x.yaml", "r") as fp:
    data = yaml.safe_load(fp)

data["MODEL"].update({"DEVICE": "cpu"})

with open(f"{repo_path}/DensePose/densepose_rcnn_R_50_FPN_s1x.yaml", "w") as fp:
    yaml.safe_dump(data, fp)


mask_processor = VaeImageProcessor(vae_scale_factor=8, do_normalize=False, do_binarize=True, do_convert_grayscale=True)
automasker = AutoMasker(
    densepose_ckpt=os.path.join(repo_path, "DensePose"),
    schp_ckpt=os.path.join(repo_path, "SCHP"),
    device="cpu",
)

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 
```
pip install accelerate
```
.
Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 
```
pip install accelerate
```
.
An error occurred while trying to fetch booksforcharlie/stable-diffusion-inpainting: booksforcharlie/stable-diffusion-inpainting does not appear to have a file named diffusion_pytorch_model.safetensors.
Defaulting to unsafe serialization. Pass `allow_pickle=False` to raise an error instead.
  state_dict = torch.load(ckpt_path, map_location='cpu')['state_dict']


Let's define the conversion function for PyTorch modules. We use `ov.convert_model` function to obtain OpenVINO Intermediate Representation object and `ov.save_model` function to save it as XML file.

In [35]:
import torch
import openvino as ov


def convert(model: torch.nn.Module, xml_path: str, example_input):
    xml_path = Path(xml_path)
    if not xml_path.exists():
        xml_path.parent.mkdir(parents=True, exist_ok=True)
        model.eval()
        with torch.no_grad():
            converted_model = ov.convert_model(model, example_input=example_input)
        ov.save_model(converted_model, xml_path)

        # cleanup memory
        torch._C._jit_clear_class_registry()
        torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()
        torch.jit._state._clear_class_state()

In [36]:
class VaeEncoder(torch.nn.Module):
    def __init__(self, vae):
        super().__init__()
        self.vae = vae

    def forward(self, x):
        return self.vae.encode(x).latent_dist.sample()


convert(VaeEncoder(pipeline.vae), VAE_ENCODER_PATH, torch.zeros(1, 3, 1024, 768))

  assert hidden_states.shape[1] == self.channels
  assert hidden_states.shape[1] == self.channels
	%2495 : Float(1, 4, 128, 96, strides=[49152, 12288, 96, 1], requires_grad=0, device=cpu) = aten::randn(%2489, %2490, %2491, %2492, %2493, %2494) # /home/maleksandr/test_notebooks/pixtral/openvino_notebooks/notebooks/pixtral/venv/lib/python3.10/site-packages/diffusers/utils/torch_utils.py:81:0
This may cause errors in trace checking. To disable trace checking, pass check_trace=False to torch.jit.trace()
  _check_trace(
Tensor-likes are not close!

Mismatched elements: 27151 / 49152 (55.2%)
Greatest absolute difference: 0.0009250640869140625 at index (0, 2, 0, 82) (up to 1e-05 allowed)
Greatest relative difference: 0.0008534972047966543 at index (0, 2, 0, 82) (up to 1e-05 allowed)
  _check_trace(


In [37]:
class VaeDecoder(torch.nn.Module):
    def __init__(self, vae):
        super().__init__()
        self.vae = vae

    def forward(self, latents):
        return self.vae.decode(latents)


convert(VaeDecoder(pipeline.vae), VAE_DECODER_PATH, torch.zeros(1, 4, 128, 96))

  assert hidden_states.shape[1] == self.channels
  if hidden_states.shape[0] >= 64:


In [38]:
class UNetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample=None, timestep=None, encoder_hidden_states=None, return_dict=None):
        result = self.unet(sample=sample, timestep=timestep, encoder_hidden_states=encoder_hidden_states, return_dict=False)
        return result


inpainting_latent_model_input = torch.zeros(2, 9, 256, 96)
timestep = torch.tensor(0)
encoder_hidden_states = torch.zeros(2, 1, 768)
example_input = (inpainting_latent_model_input, timestep, encoder_hidden_states)

convert(UNetWrapper(pipeline.unet), UNET_PATH, example_input)

  if dim % default_overall_up_factor != 0:


## Compiling models
[back to top ⬆️](#Table-of-contents:)

Select device from dropdown list for running inference using OpenVINO.

In [39]:
core = ov.Core()

from notebook_utils import device_widget

device = device_widget()

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

In [40]:
compiled_unet = core.compile_model(UNET_PATH, device.value)
compiled_vae_encoder = core.compile_model(VAE_ENCODER_PATH, device.value)
compiled_vae_decoder = core.compile_model(VAE_DECODER_PATH, device.value)

Let's create callable wrapper classes for compiled models to allow interaction with original pipelines. Note that all of wrapper classes return `torch.Tensor`s instead of `np.array`s. And then insert wrappers instances in the pipeline.

In [41]:
from collections import namedtuple


class VAEWrapper(torch.nn.Module):
    def __init__(self, vae_encoder, vae_decoder, config):
        super().__init__()
        self.vae_enocder = vae_encoder
        self.vae_decoder = vae_decoder
        self.device = "cpu"
        self.dtype = torch.float32
        self.config = config

    def encode(self, pixel_values):
        outs = self.vae_enocder(pixel_values)
        outs = torch.from_numpy(outs[0])
        result = namedtuple("VAE", "latent_dist")(namedtuple("Sample", "sample")(lambda: outs))
        return result

    def decode(self, latents):
        outs = self.vae_decoder(latents)
        outs = namedtuple("VAE", "sample")(torch.from_numpy(outs[0]))
        return outs


class ConvUnetWrapper(torch.nn.Module):
    def __init__(self, unet):
        super().__init__()
        self.unet = unet

    def forward(self, sample, timestep, encoder_hidden_states=None, **kwargs):
        outputs = self.unet(
            {
                "sample": sample,
                "timestep": timestep,
            },
        )

        return [torch.from_numpy(outputs[0])]


pipeline.vae = VAEWrapper(compiled_vae_encoder, compiled_vae_decoder, pipeline.vae.config)
pipeline.unet = ConvUnetWrapper(compiled_unet)

## Interactive inference
[back to top ⬆️](#Table-of-contents:)

Please select below whether you would like to use the quantized models to launch the interactive demo.

In [None]:
from gradio_helper import make_demo

demo = make_demo(pipeline, mask_processor, automasker, output_dir)
try:
    demo.launch(debug=True)
except Exception:
    demo.launch(debug=True, share=True)

* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
100%|███████████████████████████████████████████████████████████████████████████████████| 50/50 [28:13<00:00, 33.88s/it]
