# SpeechBrain Emotion Recognition with OpenVINO

[SpeechBrain](https://github.com/speechbrain/speechbrain) is an open-source PyTorch toolkit that accelerates Conversational AI development, i.e., the technology behind speech assistants, chatbots, and large language models. 

Lear more in [GitHub repo](https://github.com/speechbrain/speechbrain) and [paper](https://arxiv.org/pdf/2106.04624)

This notebook tutorial demonstrates optimization and inference of speechbrain emotion recognition model with OpenVINO.

#### Table of contents:

- [Installations](#Installations)
- [Imports](#Imports)
- [Prepare base model](#Prepare-base-model)
- [Initialize model](#Initialize-model)
- [PyTorch inference](#PyTorch-inference)
- [SpeechBrain model optimization with Intel OpenVINO](#SpeechBrain-model-optimization-with-Intel-OpenVINO)
    - [Step 1: Prepare input tensor](#Step-1:-Prepare-input-tensor)
    - [Step 2: Convert model to OpenVINO IR](#Step-2:-Convert-model-to-OpenVINO-IR)
    - [Step 3: OpenVINO model inference](#Step-3:-OpenVINO-model-inference)



### Installations
[back to top ⬆️](#Table-of-contents:)

In [1]:
%pip install speechbrain
%pip install "torch>=1.9.0" "torchaudio>=1.9.0" --index-url https://download.pytorch.org/whl/cpu
%pip install "transformers>=4.30.0" "huggingface_hub>=0.8.0" "SoundFile"
%pip install -q "openvino>=2024.1.0" "nncf>=2.10.0" 

Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


### Imports
[back to top ⬆️](#Table-of-contents:)

In [2]:
import torch, torchaudio
from speechbrain.inference.interfaces import foreign_class
from speechbrain.inference.interfaces import Pretrained

import openvino as ov
import numpy as np
import time

torchvision is not available - cannot save figures


### Prepare base model
[back to top ⬆️](#Table-of-contents:)

In [3]:
classifier = foreign_class(
    source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP",
    pymodule_file="custom_interface.py",
    classname="CustomEncoderWav2vec2Classifier"
)



config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
speechbrain.lobes.models.huggingface_transformers.huggingface - Wav2Vec2Model is frozen.


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

### Initialize model
[back to top ⬆️](#Table-of-contents:)

In [4]:
# wav2vec2 torch model
torch_model = classifier.mods["wav2vec2"].model

### PyTorch inference 
[back to top ⬆️](#Table-of-contents:)

In [5]:
out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")
print(f"Emotion Recognition with SpeechBrain PyTorch model: {text_lab}")

Emotion Recognition with SpeechBrain PyTorch model: ['ang']


## SpeechBrain model optimization with Intel OpenVINO
[back to top ⬆️](#Table-of-contents:)

### Step 1: Prepare input tensor
[back to top ⬆️](#Table-of-contents:)

In [6]:
# Using sample audio file
signals = []
batch_size = 1
signal, sr = torchaudio.load(str("./anger.wav"), channels_first=False)
norm_audio = classifier.audio_normalizer(signal, sr)
signals.append(norm_audio)

sequence_length = norm_audio.shape[-1]

wavs = torch.stack(signals, dim=0)
wav_len = torch.tensor([sequence_length] * batch_size).unsqueeze(0)

### Step 2: Convert model to OpenVINO IR
[back to top ⬆️](#Table-of-contents:)

In [7]:
# Model optimization process 
input_tensor = wavs.float()
ov_model = ov.convert_model(torch_model, example_input=input_tensor)

  if attn_output.size() != (bsz, self.num_heads, tgt_len, self.head_dim):


### Step 3: OpenVINO model inference
[back to top ⬆️](#Table-of-contents:)

In [8]:
# sample set of configuration parameters
target_device = "CPU" #default
opts = {"device_name": target_device, "PERFORMANCE_HINT":"LATENCY"}

core = ov.Core()
compiled_model = core.compile_model(ov_model, config=opts)
output = compiled_model.outputs[0]

# Perform model inference
output_tensor = compiled_model(wavs)[output]
output_tensor = torch.from_numpy(output_tensor)

# output post-processing 
outputs = classifier.mods.avg_pool(output_tensor, wav_len)
outputs = outputs.view(outputs.shape[0], -1)
outputs = classifier.mods.output_mlp(outputs).squeeze(1)
ov_out_prob = classifier.hparams.softmax(outputs)
score, index = torch.max(ov_out_prob, dim=-1)
text_lab = classifier.hparams.label_encoder.decode_torch(index)

print(f"Emotion Recognition with OpenVINO Model: {text_lab}")


Emotion Recognition with OpenVINO Model: ['ang']
