# Audio-language assistant with Qwen2Audio and OpenVINO

Qwen2-Audio is the new series of Qwen large audio-language models. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.Model can work in two distinct audio interaction modes:
* **voice chat**: users can freely engage in voice interactions with Qwen2-Audio without text input;
* **audio analysis**: users could provide audio and text instructions for analysis during the interaction;

More details about model can be found in [model card](https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct), [blog](https://qwenlm.github.io/blog/qwen2-audio/), [original repository](https://github.com/QwenLM/Qwen2-Audio) and [technical report](https://www.arxiv.org/abs/2407.10759).

In this tutorial we consider how to convert and optimize Qwen2Audio model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf)

## Prerequisites

In [None]:
%pip install -q "git+https://github.com/huggingface/transformers.git" "torch>=2.1" "librosa"  "gradio>=4.36" "modelscope-studio>=0.4.2" --extra-index-url https://download.pytorch.org/whl/cpu
%pip install -qU "openvino>=2024.3.0" "nncf>=2.12.0"

In [2]:
from pathlib import Path
import requests

if not Path("ov_qwen2_audio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-audio/ov_qwen2_audio_helper.py")
    open("ov_qwen2_audio_helper.py", "w").write(r.text)

if not Path("notebook_utils.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/notebook_utils.py")
    open("notebook_utils.py", "w").write(r.text)

## Convert and Optimize model

In [3]:
pt_model_id = "Qwen/Qwen2-Audio-7B-Instruct"

model_dir = Path(pt_model_id.split("/")[-1])

In [4]:
from ov_qwen2_audio_helper import convert_qwen2audio_model

# uncomment these lines to see model conversion code
# convert_qwen2audio_model??

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


2024-09-12 15:17:41.274753: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-12 15:17:41.276514: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-12 15:17:41.313271: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [5]:
import nncf

compression_configuration = {
    "mode": nncf.CompressWeightsMode.INT4_ASYM,
    "group_size": 128,
    "ratio": 1.0,
}

convert_qwen2audio_model(pt_model_id, model_dir, compression_configuration)

✅ Qwen/Qwen2-Audio-7B-Instruct model already converted. You can find results in Qwen2-Audio-7B-Instruct


## Prepare model inference pipeline

In [6]:
from ov_qwen2_audio_helper import OVQwen2AudioForConditionalGeneration

# Uncomment below lines to see the model inference class code
# OVQwen2AudioForConditionalGeneration??

In [7]:
from notebook_utils import device_widget

device = device_widget(default="AUTO", exclude=["NPU"])

device

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

In [8]:
ov_model = OVQwen2AudioForConditionalGeneration(model_dir, device.value)

## Run model inference

In [9]:
from transformers import AutoProcessor, TextStreamer
import librosa
import IPython.display as ipd


processor = AutoProcessor.from_pretrained(model_dir)

audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"
audio_file = Path(audio_url.split("/")[-1])

if not audio_file.exists():
    r = requests.get(audio_url)
    with audio_file.open('wb') as f:
        f.write(r.content)
question = "What does the person say?"

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": audio_url},
        {"type": "text", "text": question},
    ]},
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load(audio_file, sr=processor.feature_extractor.sampling_rate)[0]]

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
print("Question:")
print(question)
display(ipd.Audio(audio_file))
print("Answer:")

generate_ids = ov_model.generate(**inputs, max_new_tokens=50, streamer=TextStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True))

It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Question:
What does the person say?


Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.


Answer:
The person says: 'Mister Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'


## Interactive Demo

In [10]:
if not Path("gradio_helper.py").exists():
    r = requests.get(url="https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/notebooks/qwen2-vl/gradio_helper.py")
    open("gradio_helper.py", "w").write(r.text)

In [None]:
from gradio_helper import make_demo


demo = make_demo(ov_model, processor)

try:
    demo.launch(debug=True)
except Exception:
    demo.launch(debug=True, share=True)
# if you are launching remotely, specify server_name and server_port
# demo.launch(server_name='your server name', server_port='server port in int')
# Read more in the docs: https://gradio.app/docs/