<a href="https://colab.research.google.com/github/scgupta/ml4devs-notebooks/blob/master/speech/asr/deepspeech/mozilla_deepspeech_api_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mozilla DeepSpeech API Exploration

Mozilla released [DeepSpeech 0.8.2](https://github.com/mozilla/DeepSpeech/releases/tag/v0.8.2) with [APIs in C, Java, .NET, Python, and JavaScript](https://deepspeech.readthedocs.io/en/v0.8.2/Python-API.html).

From Colab menu, select: **Runtime** > **Change runtime type**, and verify that it is set to Python3, and select GPU if you want to try out GPU version.

You can install DeepSpeech with pip (make it deepspeech-gpu==0.8.2 if you are using GPU in colab runtime):


In [None]:
!python --version

In [None]:
!pip install deepspeech==0.8.2

## Download Models and Audio Files

Mozilla has released models for US English, we will use those in this code lab.

1. **Download the models:**


In [None]:
!mkdir deepspeech-0.8.2-models

In [None]:
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.pbmm
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.scorer

In [None]:
!mv deepspeech-0.8.1-models.pbmm deepspeech-0.8.1-models.scorer deepspeech-0.8.2-models/

In [None]:
!ls -l deepspeech-0.8.2-models/

2. **Download audio data files**

In [None]:
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/audio-0.8.2.tar.gz

4. **Unzip audio files**

In [None]:
!tar -xvzf audio-0.8.2.tar.gz

In [None]:
!ls -l ./audio/

5. **Test that it all works**

In [None]:
!deepspeech --model deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer --audio ./audio/2830-3980-0043.wav

In [None]:
!deepspeech --model deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer --audio ./audio/4507-16021-0012.wav

In [None]:
!deepspeech --model deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer --audio ./audio/8455-210777-0068.wav

In [None]:
!deepspeech --help

Examine the output of the last three commands, and you will see results *“experience proof less”*, *“why should one halt on the way”*, and *“your power is sufficient i said”* respectively. You are all set.

# DeepSpeech API

1.   **Import deepspeech**

In [None]:
import deepspeech

2. **Create a model**

In [None]:
model_file_path = 'deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm'
model = deepspeech.Model(model_file_path)

3. **Add scorer and other parameters**

In [None]:
scorer_file_path = 'deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer'
model.enableExternalScorer(scorer_file_path)

lm_alpha = 0.75
lm_beta = 1.85
model.setScorerAlphaBeta(lm_alpha, lm_beta)

beam_width = 500
model.setBeamWidth(beam_width)

## Batch API

1.   **Read an input wav file**


In [None]:
import wave
filename = 'audio/8455-210777-0068.wav'
w = wave.open(filename, 'r')
rate = w.getframerate()
frames = w.getnframes()
buffer = w.readframes(frames)

Checkout sample rate and buffer type

In [None]:
print(rate)
print(model.sampleRate())
print(str(type(buffer)))

As you can see that the speech sample rate of the wav file is 16000hz, same as the model’s sample rate. But the buffer is a byte array, whereas DeepSpeech model expects 16-bit int array.

2.  **Convert byte array buffer to int16 array**

In [None]:
import numpy as np
data16 = np.frombuffer(buffer, dtype=np.int16)
print(str(type(data16)))

3.  **Run speech-to-text in batch mode to get the text**

In [None]:
text = model.stt(data16)
print(text)

## Streaming API

Now let’s accomplish the same using streaming API. It consists of 3 steps: open session, feed data, close session.

1.  **Open a streaming session**

In [None]:
ds_stream = model.createStream()

2.  **Repeatedly feed chunks of speech buffer, and get interim results if desired**

In [None]:
buffer_len = len(buffer)
offset = 0
batch_size = 16384
text = ''
while offset < buffer_len:
    end_offset = offset + batch_size
    chunk = buffer[offset:end_offset]
    data16 = np.frombuffer(chunk, dtype=np.int16)
    ds_stream.feedAudioContent(data16)
    text = ds_stream.intermediateDecode()
    print(text)
    offset = end_offset

3.  **Close stream and get the final result**

In [None]:
text = ds_stream.finishStream()
print(text)

Verify that the output is same as as the batch API output: "your power is sufficient i said."

# Recap

DeepSpeech has two modes: batch and streaming. First step is to create a model object, and then either call `stt()` or `feedAudioContnet()` to transcribe audio to text.
<br/>

---
&copy; 2020 Satish Chandra Gupta