# Speech-to-Text CTC

Encoder model + CTC loss

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [malaya-speech/example/stt-ctc-model](https://github.com/huseinzol05/malaya-speech/tree/master/example/stt-ctc-model).
    
</div>

<div class="alert alert-warning">

This module is not language independent, so it not save to use on different languages. Pretrained models trained on hyperlocal languages.
    
</div>

<div class="alert alert-warning">

This is an application of malaya-speech Pipeline, read more about malaya-speech Pipeline at [malaya-speech/example/pipeline](https://github.com/huseinzol05/malaya-speech/tree/master/example/pipeline).
    
</div>

In [1]:
import malaya_speech
import numpy as np
from malaya_speech import Pipeline

### List available CTC model

In [2]:
malaya_speech.stt.available_ctc()

Unnamed: 0,Size (MB),Quantized Size (MB),WER,CER
quartznet,77.2,20.2,0.0,0.0
mini-jasper,97.8,20.2,0.0,0.0
jasper,97.8,20.2,0.0,0.0


### Load CTC model

In [3]:
model = malaya_speech.stt.deep_ctc(model = 'quartznet')






### Load Quantized deep model

To load 8-bit quantized model, simply pass `quantized = True`, default is `False`.

We can expect slightly accuracy drop from quantized model, and not necessary faster than normal 32-bit float model, totally depends on machine.

In [4]:
quantized_model = malaya_speech.stt.deep_ctc(model = 'quartznet', quantized = True)



### Load sample

In [5]:
ceramah, sr = malaya_speech.load('speech/khutbah/wadi-annuar.wav')
podcast1, sr = malaya_speech.load('speech/podcast/manglish-1.wav')
podcast2, sr = malaya_speech.load('speech/podcast/manglish-3.wav')

In [6]:
import IPython.display as ipd

ipd.Audio(ceramah, rate = sr)

As we can hear, the speaker speaks in kedahan dialects plus some arabic words, let see how good our model is.

In [7]:
ipd.Audio(podcast1, rate = sr)

In [8]:
ipd.Audio(podcast2, rate = sr)

### Predict using default CTC

default CTC decoder is from Tensorflow. We can choose,

1. `greedy` decoder, automatically `beam_size` will become 1.
2. `beam` decoder, by default `beam_size` is 100, feel free to edit it.

```python
def predict(
    self, inputs, decoder: str = 'beam', beam_size: int = 100, **kwargs
):
    """
    Transcribe inputs, will return list of strings.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.FRAME].
    decoder: str, optional (default='beam')
        decoder mode, allowed values:

        * ``'greedy'`` - greedy decoder.
        * ``'beam'`` - beam decoder.
    beam_size: int, optional (default=100)
        beam size for beam decoder.

    Returns
    -------
    result: List[str]
    """
```

In [9]:
model.predict([ceramah, podcast1, podcast2])

["jadi dalam perjalangan ini binia yang ueuah ini ketika nabi mengajar moaubin jabat tadi ni alahma' ai ni",
 '720 i had bsoght the iphsne 1 fs',
 'be inupired by utof like that and rather than a ysotober']

In [10]:
quantized_model.predict([ceramah, podcast1, podcast2])

["tadi dalam perjalangan ini binia yang ueuah ini ketika dabi mengajar moaubin jabat tadi ni alahma' ai ni",
 '720 i had bsoght the iphsne10 fsr',
 'be inupired by utof like that and rather than a ysotober']

It is not that good, let's try CTC model with language model.

### Predict using LM CTC

```python
def predict_lm(self, inputs, lm, beam_size: int = 100, **kwargs):
    """
    Transcribe inputs using Beam Search + LM, will return list of strings.
    This method will not able to utilise batch decoding, instead will do loop to decode for each elements.

    Parameters
    ----------
    input: List[np.array]
        List[np.array] or List[malaya_speech.model.frame.FRAME].
    lm: ctc_decoders.Scorer
        Returned from `malaya_speech.stt.language_model()`.
    beam_size: int, optional (default=100)
        beam size for beam decoder.


    Returns
    -------
    result: List[str]
    """
```

In [11]:
lm = malaya_speech.stt.language_model(model = 'malaya-speech-wikipedia')

In [12]:
model.predict_lm([ceramah, podcast1, podcast2], lm)

['jadi dalam perjalanan ini ini yang uoah ini ketika nabi mengajar mobin cabar tadi ni alamanni',
 '720 i had bot he iphone 11',
 'be in upi my utor like that and rather than a soto']

In [13]:
quantized_model.predict_lm([ceramah, podcast1, podcast2], lm)

['jadi dalam perjalanan ini ini yang uoah ini ketika nabi mengajar mokibabak tadi ni alamanni',
 '720 i had bot the iphone 11 ',
 'be in upi my utor like that and rather than a soto']