# SpeechBrain Emotion Recognition with OpenVINO

<div class="alert alert-block alert-danger"> <b>Important note:</b> This notebook requires python >= 3.9. Please make sure that your environment fulfill to this requirement  before running it </div>

[SpeechBrain](https://github.com/speechbrain/speechbrain) is an open-source PyTorch toolkit that accelerates Conversational AI development, i.e., the technology behind speech assistants, chatbots, and large language models. 

Lear more in [GitHub repo](https://github.com/speechbrain/speechbrain) and [paper](https://arxiv.org/pdf/2106.04624)

This notebook tutorial demonstrates optimization and inference of speechbrain emotion recognition model with OpenVINO.

#### Table of contents:

- [Installations](#Installations)
- [Imports](#Imports)
- [Prepare base model](#Prepare-base-model)
- [Initialize model](#Initialize-model)
- [PyTorch inference](#PyTorch-inference)
- [SpeechBrain model optimization with Intel OpenVINO](#SpeechBrain-model-optimization-with-Intel-OpenVINO)
    - [Step 1: Prepare input tensor](#Step-1:-Prepare-input-tensor)
    - [Step 2: Convert model to OpenVINO IR](#Step-2:-Convert-model-to-OpenVINO-IR)
    - [Step 3: OpenVINO model inference](#Step-3:-OpenVINO-model-inference)



### Installations
[back to top ⬆️](#Table-of-contents:)

In [1]:
%pip install speechbrain --extra-index-url https://download.pytorch.org/whl/cpu
%pip install --upgrade --force-reinstall torch torchaudio --index-url https://download.pytorch.org/whl/cpu
%pip install "transformers>=4.30.0" "huggingface_hub>=0.8.0" "SoundFile"
%pip install -q "openvino>=2024.1.0"

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Note: you may need to restart the kernel to use updated packages.
Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting torch
  Using cached https://download.pytorch.org/whl/cpu/torch-2.3.1%2Bcpu-cp310-cp310-linux_x86_64.whl (190.4 MB)
Collecting torchaudio
  Using cached https://download.pytorch.org/whl/cpu/torchaudio-2.3.1%2Bcpu-cp310-cp310-linux_x86_64.whl (1.7 MB)
Collecting filelock (from torch)
  Using cached https://download.pytorch.org/whl/filelock-3.13.1-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.8.0 (from torch)
  Using cached https://download.pytorch.org/whl/typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting sympy (from torch)
  Using cached https://download.pytorch.org/whl/sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch)
  Using cached https://download.pytorch.org/whl/networkx-3.2.1-py3-none-any.whl (1.6 MB)
Collecting jinja2 (from torc

### Imports
[back to top ⬆️](#Table-of-contents:)

In [2]:
import torch
import torchaudio
from speechbrain.inference.interfaces import foreign_class

import openvino as ov

torchvision is not available - cannot save figures


### Prepare base model
[back to top ⬆️](#Table-of-contents:)

The foreign_class function in SpeechBrain is a utility that allows you to load and use custom PyTorch models within the SpeechBrain ecosystem. It provides a convenient way to integrate external or custom-built models into SpeechBrain's inference pipeline without modifying the core SpeechBrain codebase.

1. source: This argument specifies the source or location of the pre-trained model checkpoint. In this case, "speechbrain/emotion-recognition-wav2vec2-IEMOCAP" refers to a pre-trained model checkpoint available on the Hugging Face Hub.
2. pymodule_file: This argument is the path to a Python file containing the definition of your custom PyTorch model class. In this example, "custom_interface.py" is the name of the Python file that defines the CustomEncoderWav2vec2Classifier class.
3. classname: This argument specifies the name of the custom PyTorch model class defined in the pymodule_file. In this case, "CustomEncoderWav2vec2Classifier" is the name of the class that extends SpeechBrain's Pretrained class and implements the necessary methods for inference.

In [3]:
classifier = foreign_class(
    source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP", pymodule_file="custom_interface.py", classname="CustomEncoderWav2vec2Classifier"
)



config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
speechbrain.lobes.models.huggingface_transformers.huggingface - Wav2Vec2Model is frozen.


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

### Initialize model
[back to top ⬆️](#Table-of-contents:)

In [4]:
# wav2vec2 torch model
torch_model = classifier.mods["wav2vec2"].model

### PyTorch inference 
[back to top ⬆️](#Table-of-contents:)

Perform emotion recognition on the sample audio file.

1. out_prob: Tensor or list containing the predicted probabilities or log probabilities for each emotion class.
2. score: Scalar value representing the predicted probability or log probability of the most likely emotion class.
3. index: Integer value representing the index of the most likely emotion class in the out_prob tensor or list.
4. text_lab: String or list of strings containing the textual labels corresponding to the predicted emotion classes (["anger", "happiness", "sadness", "neutrality"]). 

In [5]:
out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")
print(f"Emotion Recognition with SpeechBrain PyTorch model: {text_lab}")

Emotion Recognition with SpeechBrain PyTorch model: ['ang']


## SpeechBrain model optimization with Intel OpenVINO
[back to top ⬆️](#Table-of-contents:)

### Step 1: Prepare input tensor
[back to top ⬆️](#Table-of-contents:)

In [6]:
# Using sample audio file
signals = []
batch_size = 1
signal, sr = torchaudio.load(str("./anger.wav"), channels_first=False)
norm_audio = classifier.audio_normalizer(signal, sr)
signals.append(norm_audio)

sequence_length = norm_audio.shape[-1]

wavs = torch.stack(signals, dim=0)
wav_len = torch.tensor([sequence_length] * batch_size).unsqueeze(0)

### Step 2: Convert model to OpenVINO IR
[back to top ⬆️](#Table-of-contents:)

In [7]:
# Model optimization process
input_tensor = wavs.float()
ov_model = ov.convert_model(torch_model, example_input=input_tensor)

  if attn_output.size() != (bsz, self.num_heads, tgt_len, self.head_dim):


### Step 3: OpenVINO model inference
[back to top ⬆️](#Table-of-contents:)

In [8]:
# Sample configuration parameters, target_device set to CPU by default
target_device = "CPU"
opts = {"device_name": target_device, "PERFORMANCE_HINT": "LATENCY"}

core = ov.Core()
compiled_model = core.compile_model(ov_model, config=opts)

# Perform model inference
output_tensor = compiled_model(wavs)[0]
output_tensor = torch.from_numpy(output_tensor)

# output post-processing
outputs = classifier.mods.avg_pool(output_tensor, wav_len)
outputs = outputs.view(outputs.shape[0], -1)
outputs = classifier.mods.output_mlp(outputs).squeeze(1)
ov_out_prob = classifier.hparams.softmax(outputs)
score, index = torch.max(ov_out_prob, dim=-1)
text_lab = classifier.hparams.label_encoder.decode_torch(index)

print(f"Emotion Recognition with OpenVINO Model: {text_lab}")

Emotion Recognition with OpenVINO Model: ['ang']
