# Music Generation with Auto-regressive Transformer Model

This module is based on the OpenVINO notebook [Controllable Music Generation with MusicGen and OpenVINO™](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/music-generation)

If you are running this on your own, not as part of a workshop, install packages using `requirements.txt` and run the `setup.py` script before using this notebook.

Start with the `05a_convert_models.ipynb` notebook. This converts the text encoder and audio decoder models to use static shapes for the inputs, which is required for running them on the NPU. It also shows different approachs for converting to static shapes.

This notebook then loads in the MusicGen pipeline, tests it out so you can see how it works, then replaces the original text encoder and audio decoder models with the statically-shaped versions so they can run on the NPU. The device settings are hard-coded, feel free to try different devices. The core MusicGen "mg" model was not converted to static shapes just to keep things simple for this module.

### MusicGen

MusicGen is a single-stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text prompt is passed to a text encoder model (T5) to obtain a sequence of hidden-state representations. These hidden states are fed to MusicGen, which predicts discrete audio tokens (audio codes). Finally, audio tokens are then decoded using an audio compression model (EnCodec) to recover the audio waveform.

![pipeline](https://user-images.githubusercontent.com/76463150/260439306-81c81c8d-1f9c-41d0-b881-9491766def8e.png)

[The MusicGen model](https://arxiv.org/abs/2306.05284) does not require a self-supervised semantic representation of the text/audio prompts; it operates over several streams of compressed discrete music representation with efficient token interleaving patterns, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Unlike prior models addressing music generation, it is able to generate all the codebooks in a single forward pass.

We will use a model implementation from the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library. 

### Imports

In [None]:
from functools import partial
import gc
from pathlib import Path

import numpy as np
import torch
from transformers import MusicgenProcessor, MusicgenForConditionalGeneration

import mg_utils as utils

from torch.jit import TracerWarning
import warnings
# Ignore tracing warnings
warnings.filterwarnings("ignore", category=TracerWarning)

models_dir = Path("./models")
original_model_path = models_dir / "musicgen-small"
processor_path = models_dir / "musicgen-small-processor"
t5_dynamic_ir_path = models_dir / "t5.xml"
t5_static_ir_path = models_dir / "t5_static.xml"
musicgen_0_ir_path = models_dir / "mg_0.xml"
musicgen_ir_path = models_dir / "mg.xml"
audio_decoder_dynamic_ir_path = models_dir / "encodec.xml"
max_prompt_length = 100 # tokens

### Original Pipeline Inference

Text Preprocessing prepares the text prompt to be fed into the model, the `processor` object abstracts this step for us. Text tokenization is performed under the hood, it assigning tokens or IDs to the words; in other words, token IDs are just indices of the words in the model vocabulary. It helps the model understand the context of a sentence.

In [None]:
from IPython.display import Audio
import time

loading_kwargs = {}
# If transformers version is >= 4.40.0, add the following:
loading_kwargs["attn_implementation"] = "eager" 

# Load the pipeline
model = MusicgenForConditionalGeneration.from_pretrained(original_model_path, torchscript=True, return_dict=False, **loading_kwargs)
processor = MusicgenProcessor.from_pretrained(processor_path)

sample_length = 10  # seconds

frame_rate = model.config.audio_encoder.frame_rate
n_tokens = sample_length * frame_rate + 3
print(f"Each second of output music requires generating {frame_rate} tokens")
sampling_rate = model.config.audio_encoder.sampling_rate
print("Audio sampling rate is", sampling_rate, "Hz")

model.to("cpu")
model.eval();

inputs = processor(
    text=["80s pop track with bassy drums and synth"],
    padding=True,
    return_tensors="pt",
)
start_time=time.time()
# Test the pipeline using the above prompt. Generate 8 seconds (the model generates 50 tokens for each second)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=n_tokens)
print(f"time={time.time()-start_time}")
generated_music = audio_values[0].cpu().numpy()

Audio(generated_music, rate=sampling_rate)

## Create a spectrogram of the generated music

In [None]:
import matplotlib.pyplot as plt
from scipy.signal import spectrogram

# Compute the Short-Time Fourier Transform (STFT)
frequencies, times, Sxx = spectrogram(generated_music, fs=sampling_rate)

# Squeeze the Sxx array to ensure it's 2D
Sxx = np.squeeze(Sxx)

plt.figure(figsize=(10, 6))
plt.pcolormesh(times, frequencies, 10 * np.log10(Sxx), shading='gouraud')
plt.colorbar(label='Intensity (dB)')
plt.title('STFT-based Spectrogram')
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [s]')
plt.tight_layout()
plt.show()

## Embedding the converted models into the original pipeline
### Adapt OpenVINO models to the original pipeline

Here we create wrapper classes for the OpenVINO models that we want to embed in the original inference pipeline.
Here are some of the things to consider when adapting an OV model:
 - Make sure that parameters passed by the original pipeline are forwarded to the compiled OV model properly; sometimes the OV model uses only a portion of the input arguments and some are ignored, sometimes you need to convert the argument to another data type or unwrap some data structures such as tuples or dictionaries.
 - Guarantee that the wrapper class returns results to the pipeline in an expected format. In the example below you can see how we pack OV model outputs into special classes declared in the HF repo.
 - Pay attention to the model method used in the original pipeline for calling the model - it may be not the `forward` method! Refer to the `AudioDecoderWrapper` to see how we wrap OV model inference into the `decode` method.

Note that for this notebook, we will defer embedding of the audio_decoder to the generate() function, since the user interface will provide a choice of the length of the output.

In [None]:
t5_device = "CPU"
audio_decoder_device = "HETERO:GPU,NPU"

text_encode_ov = utils.TextEncoderWrapper(t5_static_ir_path, model.text_encoder.config, t5_device)

del model.text_encoder
gc.collect()

model.text_encoder = text_encode_ov

## Try out the converted pipeline

We can now infer the pipeline backed by OpenVINO models. Note that with statically-shaped models, we need to explicitly set the padding to our max token length.

The demo app below is created using [Gradio package](https://www.gradio.app/docs/interface)

In [None]:
def generate(prompt, output_length):
    global model, processor, max_prompt_length, sampling_rate, frame_rate

    n_tokens_to_generate = output_length * frame_rate + 3
    
    inputs = processor(text=[prompt],
        padding="max_length",
        max_length=max_prompt_length,
        truncation=True,
        return_tensors="pt",
    )

    # Plug in the appropriate audio_decoder model based on the desired output length
    audio_decoder_static_ir_path = f"{models_dir}/encodec_{output_length}s_ir.xml"
    audio_encoder_ov = utils.AudioDecoderWrapper(
        audio_decoder_static_ir_path, 
        model.audio_encoder.config,
        audio_decoder_device,
    )
    del model.audio_encoder
    gc.collect()
    model.audio_encoder = audio_encoder_ov
    
    audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=n_tokens_to_generate, use_cache=True)
    # Convert the output to the number format required by the Gradio Audio player
    waveform = audio_values[0].cpu().squeeze() * 2**15
    return (sampling_rate, waveform.numpy().astype(np.int16))

In [None]:
demo = utils.build_gr_blocks(generate)
demo.launch()