# Speech-to-Text RNNT PyTorch Multilanguage

Encoder model + RNNT loss using PyTorch

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/stt-transducer-model-pt-multilanguage](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-transducer-model-pt-multilanguage).
    
</div>

<div class="alert alert-warning">

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
    
</div>

In [1]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = ''

In [2]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

`pyaudio` is not available, `malaya_speech.streaming.pyaudio` is not able to use.


In [3]:
import logging

logging.basicConfig(level=logging.INFO)

### List available RNNT model

In [4]:
malaya_speech.stt.transducer.available_pt_transformer()

INFO:malaya_speech.stt:for `malay-fleur102` language, tested on FLEURS102 `ms_my` test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `malay-malaya` language, tested on malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt
INFO:malaya_speech.stt:for `singlish` language, tested on IMDA malaya-speech test set, https://github.com/huseinzol05/malaya-speech/tree/master/pretrained-model/prepare-stt


Unnamed: 0,Size (MB),malay-malaya,malay-fleur102,Language,singlish
mesolitica/conformer-tiny,38.5,"{'WER': 0.17341180814, 'CER': 0.05957485024}","{'WER': 0.19524478979, 'CER': 0.0830808938}",[malay],
mesolitica/conformer-base,121.0,"{'WER': 0.122076123261, 'CER': 0.03879606324}","{'WER': 0.1326737206665, 'CER': 0.05032914857}",[malay],
mesolitica/conformer-medium,243.0,"{'WER': 0.1054817492564, 'CER': 0.0313518992842}","{'WER': 0.1172708897486, 'CER': 0.0431050488}",[malay],
mesolitica/emformer-base,162.0,"{'WER': 0.175762423786, 'CER': 0.06233919000537}","{'WER': 0.18303839134, 'CER': 0.0773853362}",[malay],
mesolitica/conformer-base-singlish,121.0,,,[singlish],"{'WER': 0.06517537334361, 'CER': 0.03265430876}"
mesolitica/conformer-medium-mixed,243.0,"{'WER': 0.111166517935, 'CER': 0.03410958328}","{'WER': 0.108354748, 'CER': 0.037785722}","[malay, singlish]","{'WER': 0.091969755225, 'CER': 0.044627194623}"
mesolitica/conformer-medium-mixed-augmented,243.0,"{'WER': 0.111166517935, 'CER': 0.03410958328}","{'WER': 0.108354748, 'CER': 0.037785722}","[malay, singlish]","{'WER': 0.091969755225, 'CER': 0.044627194623}"
mesolitica/conformer-large-mixed-augmented,413.0,"{'WER': 0.111166517935, 'CER': 0.03410958328}","{'WER': 0.108354748, 'CER': 0.037785722}","[malay, singlish]","{'WER': 0.091969755225, 'CER': 0.044627194623}"


In [5]:
malaya_speech.stt.google_accuracy

{'malay-malaya': {'WER': 0.16477548774, 'CER': 0.05973209121},
 'malay-fleur102': {'WER': 0.109588779, 'CER': 0.047891527},
 'singlish': {'WER': 0.4941349, 'CER': 0.3026296}}

### Load RNNT model

```python
def pt_transformer(
    model: str = 'mesolitica/conformer-base',
    **kwargs,
):
    """
    Load Encoder-Transducer ASR model using Pytorch.

    Parameters
    ----------
    model : str, optional (default='mesolitica/conformer-base')
        Check available models at `malaya_speech.stt.transducer.available_torch_transformer()`.

    Returns
    -------
    result : malaya_speech.torch_model.torchaudio.Conformer class
    """
```

In [6]:
model_mixed = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/conformer-medium-mixed')

INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed/model.pt
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed/malay-stt.model
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed/malay-stats.json


In [7]:
model_mixed_augmented = malaya_speech.stt.transducer.pt_transformer(model = 'mesolitica/conformer-medium-mixed-augmented')

INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed-augmented/model.pt
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed-augmented/malay-stt.model
INFO:malaya_boilerplate.huggingface:downloading frozen mesolitica/conformer-medium-mixed-augmented/malay-stats.json


### Load sample

In [8]:
from datasets import Audio

sr = 16000
audio = Audio(sampling_rate=sr)

In [9]:
y, _ = malaya_speech.load('speech/example-speaker/husein-zolkepli.wav')
y1 = audio.decode_example(audio.encode_example('speech/example-speaker/husein-zolkepli-mixed-1.mp3'))['array']
y2 = audio.decode_example(audio.encode_example('speech/example-speaker/husein-zolkepli-mixed-2.mp3'))['array']
singlish0, _ = malaya_speech.load('speech/singlish/singlish0.wav')
singlish1, _ = malaya_speech.load('speech/singlish/singlish1.wav')

In [10]:
import IPython.display as ipd

ipd.Audio(y, rate = sr)

In [11]:
ipd.Audio(y1, rate = sr)

In [12]:
ipd.Audio(y2, rate = sr)

In [13]:
ipd.Audio(singlish0, rate = sr)

In [14]:
ipd.Audio(singlish1, rate = sr)

### Predict using beam decoder

```python
def beam_decoder(self, inputs, beam_width: int = 20):
    """
    Transcribe inputs using beam decoder.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    beam_width: int, optional (default=20)
        beam size for beam decoder.

    Returns
    -------
    result: List[str]
    """
```

In [15]:
%%time

model_mixed.beam_decoder([y, y1, y2, singlish0, singlish1])

CPU times: user 43.4 s, sys: 1.02 s, total: 44.4 s
Wall time: 3.88 s


['testing nama saya husin bin zulkifli',
 'hello nama saya mesin i hate fish but the lightry chicken thank you',
 'hari ini saya nak cakap tentang harian saya sampai is good setting sebab bermutu is good market arfafani alat tepi hujan suka mainan di ruang',
 'and then see how they roll it in film okay actually',
 'then you tell to your eyes']

In [16]:
%%time

model_mixed_augmented.beam_decoder([y, y1, y2, singlish0, singlish1])

CPU times: user 39.3 s, sys: 357 ms, total: 39.6 s
Wall time: 3.44 s


['testing nama saya hussein bin zulcaffly',
 'hello nama saya send i hate fish but i like chicken thank you',
 'ini saya nak cakap tentang harian saya something is good something is bad but most of the day is good my kids are funny i like to play with them saya suka mainan di ruang',
 'and then see how they roll it in film okay actually',
 'i tech to your eyes']

As you can see `model_mixed_augmented` is much better to decode multilanguage samples, this is because,

- Even though `model_mixed` supported malay and singlish, but the model trained on monolanguage samples, so if an audio sample got multilanguage, the inference result is not really that good.

- While `model_mixed_augmented` trained on augmented noisy join dataset, from https://huggingface.co/datasets/mesolitica/noisy-join-mixed-asr, so it can infer an audio sample with multilanguage.

### Compare with Google STT

In [17]:
import speech_recognition as sr

r = sr.Recognizer()

In [24]:
import soundfile as sf

sf.write('test-mixed1.wav', y1, 16000)
sf.write('test-mixed2.wav', y2, 16000)

In [26]:
with sr.AudioFile('speech/example-speaker/husein-zolkepli.wav') as source:
    a = r.record(source)

text = r.recognize_google(a, language = 'ms')
text

'testing Nama saya Hussein bin Zulkifli'

In [22]:
with sr.AudioFile('test-mixed1.wav') as source:
    a = r.record(source)

text = r.recognize_google(a, language = 'ms')
text

'Helo nama saya Hussein Aidil Hafiz lagi pun Thank you'

In [25]:
with sr.AudioFile('test-mixed2.wav') as source:
    a = r.record(source)

text = r.recognize_google(a, language = 'ms')
text

'sains nak cakap dengan angah harian saya macam saya juga nak tengok cepat sebab musuh boleh diskaun semakin Zam pahala kita hujan Saya suka main dia orang'

Straight bad.