# Speech to Text with OpenVINO

This tutorial demonstrates speech-to-text recognition with OpenVINO.

For this tutorial, we use the [quartznet 15x5](https://docs.openvino.ai/2021.4/omz_models_model_quartznet_15x5_en.html) model. QuartzNet performs automatic speech recognition. Its design is based on the Jasper architecture, which is a convolutional model trained with Connectionist Temporal Classification (CTC) loss. The model is available from [Open Model Zoo](https://github.com/openvinotoolkit/open_model_zoo/).


## Imports

In [None]:
import librosa
import librosa.display
import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
import scipy
from openvino.inference_engine import IECore
from pathlib import Path

## Settings

In this part, set up all variables used in the notebook.

In [None]:
model_folder = "model"
download_folder = "output"
data_folder = "data"

precision = "FP16"
model_name = "quartznet-15x5-en"

## Download and Convert Public Model
If it is your first run, models will be downloaded and converted here. It my take a few minutes. We use `omz_downloader` and `omz_converter`, which are command-line tools from the `openvino-dev` package. 


### Download Model

`omz_downloader` automatically creates a directory structure and downloads the selected model. This step is skipped if the model is already downloaded. The selected model comes from the public directory, which means it must be converted into Intermediate Representation (IR).

In [None]:
# Check if model is already downloaded in download directory
path_to_model_weights = Path(f'{download_folder}/public/{model_name}/models')
downloaded_model_file = list(path_to_model_weights.glob('*.pth'))

if not path_to_model_weights.is_dir() or len(downloaded_model_file) == 0:
    download_command = f"omz_downloader --name {model_name} --output_dir {download_folder} --precision {precision}"
    ! $download_command

### Convert Model

`omz_converter` is needed to convert pre-trained `PyTorch` model to ONNX model format, which is further converted to OpenVINO IR format. Both stages of conversion are handled by calling `omz_converter`.

In [None]:
# Check if model is already converted in model directory
path_to_converted_weights = Path(f'{model_folder}/public/{model_name}/{precision}/{model_name}.bin')

if not path_to_converted_weights.is_file():
    convert_command = f"omz_converter --name {model_name} --precisions {precision} --download_dir {download_folder} --output_dir {model_folder}"
    ! $convert_command

## Audio Processing

Now that the model is converted, load an audio file. 

### Defining constants

First, locate an audio file and define the alphabet used by the model. In this tutorial, we will use the Latin alphabet beginning with a space symbol and ending with a blank symbol, in our case it will be `~`, but that could be any other char.

In [None]:
audio_file_name = "edge_to_cloud.ogg"
alphabet = " abcdefghijklmnopqrstuvwxyz'~"

### Availble Audio Formats

There are multiple audio formats that can be used with the model. 

**List of supported audio formats:** 

AIFF, AU, AVR, CAF, FLAC, HTK, SVX, MAT4, MAT5, MPC2K, OGG, PAF, PVF, RAW, RF64, SD2, SDS, IRCAM, VOC, W64, WAV, NIST, WAVEX, WVE, XI

### Load Audio File

After checking file extension, you have to load the file. As an additional parameter, you have to pass `sr` which stands for `sampling rate`. Model is supporting files with `sampling rate` of 16 kHz.

In [None]:
audio, sampling_rate = librosa.load(path=f'{data_folder}/{audio_file_name}', sr=16000)

You can play your audio file.

In [None]:
ipd.Audio(audio, rate=sampling_rate)

### Visualise Audio File

You can visualize how your audio file presents on a wave plot and spectrogram.

In [None]:
plt.figure()
librosa.display.waveplot(y=audio, sr=sampling_rate, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000);
plt.show()
specto_audio = librosa.stft(audio)
specto_audio = librosa.amplitude_to_db(np.abs(specto_audio), ref=np.max)
print(specto_audio.shape)
librosa.display.specshow(specto_audio, sr=sampling_rate, x_axis='time', y_axis='hz');

### Change Type of Data

The file loaded in previous step may contain data in `float` type with a range of values between -1 and 1. To generate viable input, we have to multiply each value by the max value of `int16` and convert it to `int16` type. 

In [None]:
if max(np.abs(audio)) <= 1:
    audio = (audio * (2**15 - 1))
audio = audio.astype(np.int16)

### Convert Audio to Mel Spectrum

Next, we need to convert our pre-pre-processed audio to [Mel Spectrum](https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53). To learn more about why we do this, see [this article](https://towardsdatascience.com/audio-deep-learning-made-simple-part-2-why-mel-spectrograms-perform-better-aad889a93505).

In [None]:
def audio_to_mel(audio, sampling_rate):
    assert sampling_rate == 16000, "Only 16 KHz audio supported"
    preemph = 0.97
    preemphased = np.concatenate([audio[:1], audio[1:] - preemph * audio[:-1].astype(np.float32)])

    # Calculate window length
    win_length = round(sampling_rate * 0.02)

    # Based on previously calculated window length run short-time Fourier transform
    spec = np.abs(librosa.core.spectrum.stft(preemphased, n_fft=512, hop_length=round(sampling_rate * 0.01),
                  win_length=win_length, center=True, window=scipy.signal.windows.hann(win_length), pad_mode='reflect'))

    # Create mel filter-bank, produce transformation matrix to project current values onto Mel-frequency bins
    mel_basis = librosa.filters.mel(sampling_rate, 512, n_mels=64, fmin=0.0, fmax=8000.0, htk=False)
    return mel_basis, spec


def mel_to_input(mel_basis, spec, padding=16):
    # Convert to logarithmic scale
    log_melspectrum = np.log(np.dot(mel_basis, np.power(spec, 2)) + 2 ** -24)

    # Normalize output
    normalized = (log_melspectrum - log_melspectrum.mean(1)[:, None]) / (log_melspectrum.std(1)[:, None] + 1e-5)

    # Calculate padding
    remainder = normalized.shape[1] % padding
    if remainder != 0:
        return np.pad(normalized, ((0, 0), (0, padding - remainder)))[None]
    return normalized[None]

### Run Conversion from Audio to Mel Format

In this step, you want to convert a current audio file into [Mel scale](https://en.wikipedia.org/wiki/Mel_scale).

In [None]:
mel_basis, spec = audio_to_mel(audio=audio.flatten(), sampling_rate=sampling_rate)

### Visualise Mel Spectogram

If you want to know more about Mel spectrogram follow this [link](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0). The first image visualizes Mel frequency spectrogram, the second one presents filter bank for converting Hz to Mels.

In [None]:
librosa.display.specshow(data=spec, sr=sampling_rate, x_axis='time', y_axis='log');
plt.show();
librosa.display.specshow(data=mel_basis, sr=sampling_rate, x_axis='linear');
plt.ylabel('Mel filter');

### Adjust Mel scale to Input

Before reading the network, check that the input is ready.

In [None]:
audio = mel_to_input(mel_basis=mel_basis, spec=spec)

## Load Model

Now, we can read and load network. 

In [None]:
ie = IECore()

You may choose to run the network on multiple devices. By default, it will load the model on the CPU (you can choose manually CPU, GPU, MYRIAD, etc.) or let the engine choose the best available device (AUTO).

To list all available devices that you can use, run line `print(ie.available_devices)`.

In [None]:
print(ie.available_devices)

To change device used for your network change value of variable `device_name` to one of the values listed by print in the cell above.

In [None]:
net = ie.read_network(
    model=f"{model_folder}/public/{model_name}/{precision}/{model_name}.xml"
)
net.reshape({next(iter(net.input_info)): audio.shape})
exec_net = ie.load_network(network=net, device_name="CPU")

### Do Inference

Everything is set up. Now the only thing remaining is passing input to the previously loaded network and running inference.

In [None]:
input_layer_ir = next(iter(exec_net.input_info))

character_probabilities = exec_net.infer({input_layer_ir: audio}).values()

### Read Output

After inference, you need to reach out the output. The default output format for `quartznet 15x5` are per-frame probabilities (after LogSoftmax) for every symbol in the alphabet, name - output, shape - 1x64x29, output data format is BxNxC, where:

* B - batch size
* N - number of audio frames
* C - alphabet size, including the Connectionist Temporal Classification (CTC) blank symbol

You need to make it in a more human-readable format. To do this you, use a symbol with the highest probability. When you hold a list of indexes that are predicted to have the highest probability, due to limitations given by [Connectionist Temporal Classification Decoding](https://towardsdatascience.com/beam-search-decoding-in-ctc-trained-neural-networks-5a889a3d85a7) you will remove concurrent symbols and then remove all the blanks.

The last step is getting symbols from corresponding indexes in charlist.

In [None]:
character_probabilities = next(iter(character_probabilities))

# Remove unnececery dimension
character_probabilities = np.squeeze(character_probabilities)

# Run argmax to pick most possible symbols
character_probabilities = np.argmax(character_probabilities, axis=1)

### Implementation of Decoding

To decode previously explained output we need [Connectionist Temporal Classification (CTC) decode](https://towardsdatascience.com/beam-search-decoding-in-ctc-trained-neural-networks-5a889a3d85a7) function. This solution will remove consecutive letters from the output.

In [None]:
def ctc_greedy_decode(predictions):
    previous_letter_id = blank_id = len(alphabet) - 1
    transcription = list()
    for letter_index in predictions:
        if previous_letter_id != letter_index != blank_id:
            transcription.append(alphabet[letter_index])
        previous_letter_id = letter_index
    return ''.join(transcription)

### Run Decoding and Print Output

In [None]:
transcription = ctc_greedy_decode(character_probabilities)
print(transcription)