# Speech-to-Text RNNT

Encoder model + RNNT loss

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/stt-transducer-model](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-transducer-model).
    
</div>

<div class="alert alert-warning">

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
    
</div>

<div class="alert alert-warning">

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).
    
</div>

In [1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

### List available RNNT model

In [2]:
malaya_speech.stt.available_transducer()

Unnamed: 0,Size (MB),Quantized Size (MB),WER,CER
small-conformer,49.2,18.1,0.23728,0.08524
conformer,120.0,32.7,0.2442,0.091
large-conformer,399.0,103.0,0.239,0.0812
alconformer,33.2,10.5,0.30567,0.12267
large-alconformer,33.2,10.5,0.30567,0.12267


Lower is better. Below is Google Speech accuracy,

In [5]:
malaya_speech.stt.google_accuracy

{'WER': 0.1427,
 'CER': 0.04682,
 'last update': '2021-03-11',
 'library use': 'https://pypi.org/project/SpeechRecognition/',
 'notebook link': 'https://github.com/huseinzol05/malaya-speech/blob/master/data/semisupervised-audiobook/benchmark-google-speech-malaya-speech-test-dataset.ipynb'}

### Load RNNT model

```python
def deep_transducer(
    model: str = 'conformer', quantized: bool = False, **kwargs
):
    """
    Load Encoder-Transducer ASR model.

    Parameters
    ----------
    model : str, optional (default='jasper')
        Model architecture supported. Allowed values:

        * ``'small-conformer'`` - SMALL size Google Conformer, https://arxiv.org/pdf/2005.08100.pdf
        * ``'conformer'`` - BASE size Google Conformer, https://arxiv.org/pdf/2005.08100.pdf
        * ``'large-conformer'`` - LARGE size Google Conformer, https://arxiv.org/pdf/2005.08100.pdf
        * ``'alconformer'`` - BASE size A-Lite Google Conformer.
        * ``'large-alconformer'`` - LARGE size A-Lite Google Conformer.
        
    quantized : bool, optional (default=False)
        if True, will load 8-bit quantized model. 
        Quantized model not necessary faster, totally depends on the machine.

    Returns
    -------
    result : malaya_speech.model.tf.Transducer class
    """
```

In [19]:
model = malaya_speech.stt.deep_transducer(model = 'small-conformer')

### Load Quantized deep model

To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

In [20]:
quantized_model = malaya_speech.stt.deep_transducer(model = 'small-conformer', quantized = True)



### Load sample

In [26]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
record1, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-36-06_294832.wav')
record2, sr = malaya_speech.load('speech/record/savewav_2020-11-26_22-40-56_929661.wav')
shafiqah_idayu, sr = malaya_speech.load('speech/example-speaker/shafiqah-idayu.wav')
mas_aisyah, sr = malaya_speech.load('speech/example-speaker/mas-aisyah.wav')
khalil, sr = malaya_speech.load('speech/example-speaker/khalil-nooh.wav')

In [10]:
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

In [11]:
ipd.Audio(record1, rate = sr)

In [12]:
ipd.Audio(record2, rate = sr)

In [27]:
ipd.Audio(shafiqah_idayu, rate = sr)

In [28]:
ipd.Audio(mas_aisyah, rate = sr)

In [30]:
ipd.Audio(khalil, rate = sr)

### Predict

We can choose,

1. `greedy` decoder.
2. `beam` decoder, by default `beam_size` is 5, feel free to edit it.

```python
def predict(
    self, inputs, decoder: str = 'greedy', beam_size: int = 5, **kwargs
):
    """
    Transcribe inputs, will return list of strings.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    decoder: str, optional (default='greedy')
        decoder mode, allowed values:

        * ``'greedy'`` - will call self.greedy_decoder
        * ``'beam'`` - will call self.beam_decoder
    beam_size: int, optional (default=5)
        beam size for beam decoder.

    Returns
    -------
    result: List[str]
    """
```

### Greedy decoder

Greedy able to utilize batch processing, and faster than beam decoder.

```python
def greedy_decoder(self, inputs):
    """
    Transcribe inputs, will return list of strings.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].

    Returns
    -------
    result: List[str]
    """
```

In [36]:
%%time

model.greedy_decoder([ceramah, record1, record2, shafiqah_idayu, mas_aisyah, khalil])

CPU times: user 2.25 s, sys: 1.58 s, total: 3.83 s
Wall time: 820 ms


['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ini',
 'helo nama saya bersin saya tak suka mandi ke tak saya masam',
 'helo nama saya hussein saya suka mandi saya mandi titik hari',
 'nama saya shafiqah idayu',
 'sebut perkataan angka',
 'tolong sebut anti kata']

In [37]:
%%time

quantized_model.greedy_decoder([ceramah, record1, record2, shafiqah_idayu, mas_aisyah, khalil])

CPU times: user 2.33 s, sys: 1.64 s, total: 3.96 s
Wall time: 812 ms


['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ini',
 'helo nama saya bersin saya tak suka mandi ke tak saya masam',
 'helo nama saya hussein saya suka mandi saya mandi titik hari',
 'nama saya shafiqah idayu',
 'sebut perkataan angka',
 'tolong sebut anti kata']

### Beam decoder

To get better results, use beam decoder with optimum beam size.

```python
def beam_decoder(self, inputs, beam_size: int = 5):
    """
    Transcribe inputs, will return list of strings.

    Parameters
    ----------
    inputs: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.Frame].
    beam_size: int, optional (default=5)
        beam size for beam decoder.

    Returns
    -------
    result: List[str]
    """
```

In [33]:
%%time

model.beam_decoder([ceramah, record1, record2, shafiqah_idayu, mas_aisyah, khalil], beam_size = 3)

CPU times: user 17.1 s, sys: 2.13 s, total: 19.2 s
Wall time: 14.2 s


['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ini',
 'helo nama saya bersin saya tak suka mandi ke tak saya masam',
 'helo nama saya hussein saya suka mandi saya mandi titik hari',
 'nama saya shafiqah idayu',
 'sebut perkataan angka',
 'tolong sebut anti kata']

In [34]:
%%time

model.beam_decoder([ceramah, record1, record2, shafiqah_idayu, mas_aisyah, khalil], beam_size = 5)

CPU times: user 26.9 s, sys: 2.24 s, total: 29.1 s
Wall time: 23.3 s


['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ini',
 'helo nama saya bersin saya tak suka mandi ke tak saya masam',
 'helo nama saya hussein saya suka mandi saya mandi titik hari',
 'nama saya shafiqah idayu',
 'sebut perkataan angka',
 'tolong sebut anti kata']

In [35]:
%%time

model.beam_decoder([ceramah, record1, record2, shafiqah_idayu, mas_aisyah, khalil], beam_size = 10)

CPU times: user 53.9 s, sys: 3.01 s, total: 56.9 s
Wall time: 48.1 s


['jadi dalam perjalanan ini dunia yang susah ini ketika nabi mengajar muaz bin jabal tadi ni allah ini',
 'helo nama saya bersin saya tak suka mandi ke tak saya masam',
 'helo nama saya hussein saya suka mandi saya mandi titik hari',
 'nama saya shafiqah idayu',
 'sebut perkataan angka',
 'tolong sebut anti kata']

**RNNT model beam decoder not able to utilise batch programming, if feed a batch, it will process one by one**.

### Predict timestamp

We want to know when the speakers speak certain subwords, so we can use `predict_timestamp`,

```python
def predict_timestamp(self, input):
    """
    Transcribe input and get timestamp, only support greedy decoder.

    Parameters
    ----------
    input: np.array
        np.array or malaya_speech.model.frame.Frame.

    Returns
    -------
    result: List[Tuple[str, float]]
    """
```

In [39]:
%%time

model.predict_timestamp(shafiqah_idayu)

CPU times: user 177 ms, sys: 40.2 ms, total: 217 ms
Wall time: 89.8 ms


[('nam', 0.07),
 ('a_', 0.14),
 ('say', 0.16000001),
 ('a_', 0.24000001),
 ('sh', 0.33),
 ('af', 0.35000002),
 (b'i', 0.38000003),
 (b'q', 0.39000002),
 ('ah_', 0.43),
 ('ida', 0.45000002),
 ('yu', 0.52000004)]

In [41]:
%%time

quantized_model.predict_timestamp(shafiqah_idayu)

CPU times: user 187 ms, sys: 42.6 ms, total: 230 ms
Wall time: 91.2 ms


[('nam', 0.07),
 ('a_', 0.14),
 ('say', 0.16000001),
 ('a_', 0.24000001),
 ('sh', 0.33),
 ('af', 0.35000002),
 (b'i', 0.38000003),
 (b'q', 0.39000002),
 ('ah_', 0.43),
 ('ida', 0.45000002),
 ('yu', 0.52000004)]