# Convert models to OpenVINO Intermediate representation (IR) format

The OpenVINO model conversion API enables direct conversion of PyTorch models. We will utilize the `openvino.convert_model` method to acquire OpenVINO IR versions of the models. The method requires a model object and example input for model tracing. Under the hood, the converter will use the PyTorch JIT compiler, to build a frozen model graph.

The pipeline consists of three important parts:

 - The [T5 text encoder](https://huggingface.co/google/flan-t5-base) that translates user prompts into vectors in the latent space that the next model - the MusicGen decoder can utilize.
 - The [MusicGen Language Model](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenForCausalLM) that auto-regressively generates audio tokens (codes).
 - The [EnCodec model](https://huggingface.co/facebook/encodec_24khz) (we will use only the decoder part of it) is used to decode the audio waveform from the audio tokens predicted by the MusicGen Language Model.

### Dynamic shapes
The text encoder can take in prompts of varying length. The MusicGen model is auto-regressive, meaning it takes in the encoded prompt along with the previously-generated token(s), which will grow with each successive pass. These are examples of dynamic shapes. Currently the NPU does not support dynamic shapes. And there may be other times you wish to convert dynamic shapes to static for performance reasons. So as we convert these models, we will also show strategies for converting dynamic shapes to static.

Let us convert each model step by step.

In [None]:
from pathlib import Path
import openvino as ov

import torch
from torch.jit import TracerWarning
from transformers import AutoProcessor, MusicgenProcessor, MusicgenForConditionalGeneration
import gc
import warnings
# Ignore tracing warnings
warnings.filterwarnings("ignore", category=TracerWarning)

models_dir = Path("./models")
original_model_dir = models_dir / "musicgen-small"
processor_dir = models_dir / "musicgen-small-processor"
t5_dynamic_ir_path = models_dir / "t5.xml"
t5_static_ir_path = models_dir / "t5_static.xml"
musicgen_0_ir_path = models_dir / "mg_0.xml"
musicgen_ir_path = models_dir / "mg.xml"
audio_decoder_dynamic_ir_path = models_dir / "encodec.xml"

core = ov.Core()

loading_kwargs = {}
# If transformers version is >= 4.40.0
loading_kwargs["attn_implementation"] = "eager" 
  
# Load the pipeline
model = MusicgenForConditionalGeneration.from_pretrained(original_model_dir, torchscript=True, return_dict=False, **loading_kwargs)
# tokenizer = AutoTokenizer.from_pretrained("t5-small")  # or use another specific tokenizer if necessary
processor = MusicgenProcessor.from_pretrained(processor_dir)

Let's test the pipeline. 

In [None]:
from IPython.display import Audio

sample_length = 8  # seconds

frame_rate = model.config.audio_encoder.frame_rate
n_tokens = sample_length * frame_rate + 3
print(f"Each second of output music requires generating {frame_rate} tokens")
sampling_rate = model.config.audio_encoder.sampling_rate
print("Sampling rate is", sampling_rate, "Hz")

model.to("cpu")
model.eval();

inputs = processor(
    text=["80s pop track with bassy drums and synth"],
    padding=True,
    return_tensors="pt",
)
# Test the pipeline using the above prompt. Generate 8 seconds (the model generates 50 tokens for each second)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=n_tokens)

Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)

### 1. Convert Text Encoder

The text encoder is responsible for converting the input prompt, such as "90s rock song with loud guitars and heavy drums" into an embedding space that can be fed to the next model. Typically, it is a transformer-based encoder that maps a sequence of input tokens to a sequence of text embeddings.

The input for the text encoder consists of a tensor `input_ids`, which contains token indices from the text processed by the tokenizer and `attention_mask` that we will ignore as we will process one prompt at a time and this vector will just consist of ones.

We use OpenVINO Converter (OVC) below to convert the PyTorch model to the OpenVINO Intermediate Representation format (IR). First, convert with dynamically-shaped inputs, then read that model to learn which input dimensions are dynamic.

In [None]:
t5_ov = ov.convert_model(model.text_encoder, example_input={"input_ids": inputs["input_ids"]})
ov.save_model(t5_ov, t5_dynamic_ir_path)

t5_ov = core.read_model(t5_dynamic_ir_path)
print(f"Input shapes: {t5_ov.inputs}")

The `[?,?]` indicates that both dimensions are dynamic. 

In [None]:
print("input_ids shape = ", inputs.input_ids.shape)
inputs

So the shapes of both inputs correspond to (batch size, sequence length), both of which are dynamic. A couple common approaches to converting models with dynamically-shaped inputs to statically-shaped are:
1. Set a fixed max size and add padding. You can see this in the above example, where `input_ids[0]` is longer than `input_ids[1]` - it padded `input_ids[1]` with 0's to reach the same length as `input_ids[0]`. We can take this approach with our sequence length. Let's set it to be 100 tokens. Anything under 100 tokens will be padded with 0's. What about inputs longer than 100 tokens?
2. Create multiple models with different statically-shaped inputs. This approach might be more practical for the batch size dimension, where we could create different models for different batch sizes. In this case, we will just create one model with batch size of 1.

In [None]:
t5_input_layer = t5_ov.input(0)
t5_output_layer = t5_ov.output(0)
t5_ov.reshape({t5_input_layer.any_name: ov.PartialShape([1, 100])})
t5_ov.validate_nodes_and_infer_types()
ov.save_model(t5_ov, t5_static_ir_path)
print(f"Static input shapes: {t5_ov.inputs}")
del t5_ov
gc.collect()

### 2. Convert MusicGen Language Model

Skipping this step for this workshop. To learn more about converting this model for OpenVINO, see the OpenVINO notebook [Controllable Music Generation with MusicGen and OpenVINO](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/music-generation)

### 3. Convert Audio Decoder

The audio decoder which is a part of the EnCodec model is used to recover the audio waveform from the audio tokens predicted by the MusicGen decoder. To learn more about the model please refer to the corresponding [OpenVINO example](../encodec-audio-compression).

In [None]:
class AudioDecoder(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, output_ids):
        return self.model.decode(output_ids, [None])

audio_decoder_input = {"output_ids": torch.ones((1, 1, 4, n_tokens - 3), dtype=torch.int64)}

with torch.no_grad():
    audio_decoder_ov = ov.convert_model(AudioDecoder(model.audio_encoder), example_input=audio_decoder_input)
ov.save_model(audio_decoder_ov, audio_decoder_dynamic_ir_path)

print(f"Audio Decoder Inputs:\n{audio_decoder_ov.inputs}")
shapes = audio_decoder_input["output_ids"].shape
print(f"output_ids shape = {shapes}")

All the input dimensions are dynamic. Let's address each:
* Batch size. We can set this to 1.
* Number of channels. Our model is mono, so we can also set this to 1.
* Number of codebooks. For our model, this is 4.
* Sequence length (in tokens). This will vary. Remember, this model generates 50 tokens per second of audio. The two approaches we discussed when converting the t5 model were to set a max length and apply padding, or generate multiple versions of the model. Applying padding here is not practical, since we would have to generate extra silence then trim it. Instead we will generate multiple models, for outputs of 5s, 10s, and 20s.

In [None]:
audio_decoder_input_layer = audio_decoder_ov.input(0)

for length in [5, 10, 20]:
    n_tokens = length*frame_rate
    audio_decoder_ov.reshape({audio_decoder_input_layer.any_name: ov.PartialShape([1, 1, 4, n_tokens])})
    audio_decoder_ov.validate_nodes_and_infer_types()
    ir_path = f"{models_dir}/encodec_{length}s_ir.xml"
    print(f"Saving to {ir_path}")
    ov.save_model(audio_decoder_ov, ir_path)

del audio_decoder_ov
gc.collect()