# Convert and Optimize RVM(Robust Video Matting) with OpenVINO™

The RVM algorithm is specifically designed for robust human video matting. Unlike existing neural models that process frames as independent images, RVM uses a recurrent neural network to process videos with temporal memory. RVM can perform matting in real-time on any videos without additional inputs.  
More details about its realization can be found in original model [paper](https://arxiv.org/abs/2108.11515) and [repository](https://github.com/PeterL1n/RobustVideoMatting).

This tutorial demonstrates step-by-step instructions on how to run and optimize PyTorch/* RVM with OpenVINO. The tutorial consists of the following steps:
- Prepare PyTorch model and videos
- Validate original model
- Convert PyTorch model to ONNX
- Convert ONNX model to OpenVINO IR
- Validate converted model

## Get Pytorch model

Generally, PyTorch models represent an instance of the [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) class, initialized by a state dictionary with model weights.

We will use the RVM MobileNetv3 model, which available in this [repo](https://github.com/PeterL1n/RobustVideoMatting).

In this case, the model creators provide a tool that enables converting the RVM model to ONNX, so we don't need to do these steps manually.

## Prerequisites

In [None]:
import sys
from pathlib import Path

sys.path.append("../utils")
from notebook_utils import download_file

In [None]:
# Clone RVM repo
if not Path('RobustVideoMatting').exists():
    !git clone https://github.com/PeterL1n/RobustVideoMatting.git
%cd RobustVideoMatting

In [None]:
# Download pretrained model weights
MODEL_LINK = "https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_mobilenetv3.pth"
VIDEO_LINK = "https://drive.google.com/uc?id=1I0v72-hNlK1hm9q1OwyaATUYApXpotS6"
MODEL_DIR = Path("../model/")
VIDEO_DIR = Path("../video/")
MODEL_DIR.mkdir(exist_ok=True)
VIDEO_DIR.mkdir(exist_ok=True)

download_file(MODEL_LINK, directory=MODEL_DIR, show_progress=True)
download_file(VIDEO_LINK, directory=VIDEO_DIR, show_progress=True)

Before running this file, we need to install the relevant dependencies.

In [None]:
!pip install av==8.0.3 pims==0.5 torchvision==0.10.0

## Check model inference

`inference.py` script run pytorch model inference and save video as result. This will takes a few time which depends on your device performance.  

In [None]:
# visualize the original video

from IPython.display import HTML
import base64
import io
video = io.open('../video/asianboss2.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

The most important parameters:
* `--variant` - types of backbone networks
* `--checkpoint` - path to model weigths checkpoint

In [None]:
# perform inference
!python inference.py --variant mobilenetv3 --checkpoint ../model/rvm_mobilenetv3.pth --device cpu --input-source "../video/asianboss2.mp4" --output-type video --output-composition "../video/rvm_pth.mp4"

In [None]:
# visualize inference result

video = io.open('../video/rvm_pth.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

## Export to ONNX
To export and ONNX format of the model, we will use the [onnx branch](https://github.com/PeterL1n/RobustVideoMatting/tree/onnx) of RVM repo script. Let's clone it.

In [None]:
if not Path('onnx').exists():
    !git clone https://github.com/PeterL1n/RobustVideoMatting.git  -b onnx onnx

Then we can execute `onnx/epxort_onnx.py` to convert the pytorch file into an onnx file. After testing, the current file only supports float32 precision, which you can only set `--precision` to `float32`.

In [None]:
!python ./onnx/export_onnx.py --model-variant mobilenetv3 --checkpoint ../model/rvm_mobilenetv3.pth --precision float32 --opset 12 --device cpu --output ../model/rvm_mobilenetv3.onnx

### Convert ONNX Model to OpenVINO Intermediate Representation (IR)
While ONNX models are directly supported by OpenVINO runtime, it can be useful to convert them to IR format to take advantage of OpenVINO optimization tools and features.

`mo.convert_model` python function can be used for converting model using OpenVINO Model Optimizer.  
The ONNX model can be exported to OpenVINO IR with `serialize()`:

In [None]:
from openvino.tools import mo
from openvino.runtime import serialize

model = mo.convert_model('../model/rvm_mobilenetv3.onnx')
# serialize model for saving IR
serialize(model, '../model/rvm_mobilenetv3.xml')

## Verify model inference

To test model work, we create inference pipeline similar to `inference.py`. As the expected type of the original pytorch model is torch.tensor, which can't be used as the input of the openvino model. On the other hand, for the ov model, the input dimension requires explicit initialization.

Our pipeline consists from preprocessing step, inference of OpenVINO model and results post-processing to get matting video.

### Load the network

In [None]:
from openvino.runtime import Core

core = Core()
# read converted model
model = core.read_model('../model/rvm_mobilenetv3.xml')
# load model on CPU device
compiled_model = core.compile_model(model, 'CPU')

### Processing

In [None]:
import torch
import os
import av
from torch.utils.data import DataLoader
from torchvision import transforms
from typing import Optional, Tuple
from tqdm.auto import tqdm

from inference_utils import VideoReader, VideoWriter
from openvino.runtime import Core
import numpy as np


def write_numpy(writer, frames):
    writer.stream.width = frames.shape[3]
    writer.stream.height = frames.shape[2]
    frames *= 255
    frames = frames.transpose(0, 2, 3, 1).astype(np.uint8)
    for t in range(frames.shape[0]):
        frame = frames[t]
        frame = av.VideoFrame.from_ndarray(frame, format='rgb24')
        writer.container.mux(writer.stream.encode(frame))


def convert_video(model,
                  input_source: str,
                  input_resize: Optional[Tuple[int, int]] = None,
                  downsample_ratio: Optional[float] = None,
                  output_type: str = 'video',
                  output_composition: Optional[str] = None,
                  output_alpha: Optional[str] = None,
                  output_foreground: Optional[str] = None,
                  output_video_mbps: Optional[float] = None,
                  seq_chunk: int = 1,
                  num_workers: int = 0,
                  progress: bool = True,
                  device: Optional[str] = None,
                  dtype: Optional[torch.dtype] = None):

    # Initialize transform
    if input_resize is not None:
        transform = transforms.Compose(
            [transforms.Resize(input_resize[::-1]),
             transforms.ToTensor()])
    else:
        transform = transforms.ToTensor()

    # Initialize reader
    if os.path.isfile(input_source):
        source = VideoReader(input_source, transform)
    reader = DataLoader(source,
                        batch_size=seq_chunk,
                        pin_memory=True,
                        num_workers=num_workers)

    # Initialize writers
    if output_type == 'video':
        frame_rate = source.frame_rate if isinstance(source,
                                                     VideoReader) else 30
        output_video_mbps = 1 if output_video_mbps is None else output_video_mbps
        if output_composition is not None:
            writer_com = VideoWriter(path=output_composition,
                                     frame_rate=frame_rate,
                                     bit_rate=int(output_video_mbps * 1000000))

    # Inference
    # refer here /https://github.com/PeterL1n/RobustVideoMatting/blob/master/documentation/inference.md
    if (output_composition is not None) and (output_type == 'video'):
        bgr = np.reshape(
            np.array([120, 255, 155], dtype=np.float32) / 255, [1, 3, 1, 1])
    try:
        bar = tqdm(total=len(source), disable=not progress, dynamic_ncols=True)
        rec = [np.zeros([1, 1, 1, 1], dtype=np.float16)] * 4
        for src in reader:

            if downsample_ratio is None:
                downsample_ratio = np.asarray(
                    [min(512 / max(*src.shape[2:]), 1)], dtype=np.float32)
            src = np.array(src, dtype=np.float32)

            inputs = {
                "src": src,
                "downsample_ratio": downsample_ratio,
                "r1i": rec[0],
                "r2i": rec[1],
                "r3i": rec[2],
                "r4i": rec[3]
            }

            request = model.create_infer_request()
            request.infer(inputs=inputs)
            fgr = request.get_output_tensor(0).data  # 1,3,1080,1920
            pha = request.get_output_tensor(1).data  # 1,1,1080,1920
            rec[0] = request.get_output_tensor(2).data  #
            rec[1] = request.get_output_tensor(3).data
            rec[2] = request.get_output_tensor(4).data
            rec[3] = request.get_output_tensor(5).data
            if output_composition is not None:
                if output_type == 'video':
                    com = fgr * pha + bgr * (1 - pha)
                write_numpy(writer_com, com)
            bar.update(1)

    finally:
        # Clean up
        if output_composition is not None:
            writer_com.close()

In [None]:
convert_video(compiled_model,
              input_source='../video/asianboss2.mp4',
              output_type='video',
              output_composition='../video/res_ov.mp4',
              device='cpu')

Finally let's look at the inference results of the openvino models.

In [None]:
# visualize results
ov_res_video = io.open('../video/res_ov.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))