<a href="https://colab.research.google.com/github/scgupta/yearn2learn/blob/master/speech/asr/deepspeech/mozilla_deepspeech_api_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mozilla DeepSpeech API Exploration

Mozilla released [DeepSpeech 0.8.2](https://github.com/mozilla/DeepSpeech/releases/tag/v0.8.2) with [APIs in C, Java, .NET, Python, and JavaScript](https://deepspeech.readthedocs.io/en/v0.8.2/Python-API.html).

From Colab menu, select: **Runtime** > **Change runtime type**, and verify that it is set to Python3, and select GPU if you want to try out GPU version.

You can install DeepSpeech with pip (make it deepspeech-gpu==0.8.2 if you are using GPU in colab runtime):


In [1]:
!python --version

Python 3.6.9


In [2]:
!pip install deepspeech==0.8.2

Collecting deepspeech==0.8.2
[?25l  Downloading https://files.pythonhosted.org/packages/43/ff/f17ff70af03d27afb749f866cab2e6f5def29e02d5aa2762afc68ea92eab/deepspeech-0.8.2-cp36-cp36m-manylinux1_x86_64.whl (8.3MB)
[K     |████████████████████████████████| 8.3MB 7.1MB/s 
Installing collected packages: deepspeech
Successfully installed deepspeech-0.8.2


## Download Models and Audio Files

Mozilla has released models for US English, we will use those in this code lab.

1. **Download the models:**


In [3]:
!mkdir deepspeech-0.8.2-models

In [6]:
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.pbmm
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.scorer

--2020-08-24 16:42:09--  https://github.com/mozilla/DeepSpeech/releases/download/v0.8.1/deepspeech-0.8.1-models.pbmm
Resolving github.com (github.com)... 140.82.118.4
Connecting to github.com (github.com)|140.82.118.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/mozilla/STT/releases/download/v0.8.1/deepspeech-0.8.1-models.pbmm [following]
--2020-08-24 16:42:09--  https://github.com/mozilla/STT/releases/download/v0.8.1/deepspeech-0.8.1-models.pbmm
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/60273704/ae836480-db4e-11ea-9bf7-2d4ba9ea96f3?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200824%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200824T164209Z&X-Amz-Expires=300&X-Amz-Signature=53a83f7eaa1805af19f42192917a8cc264f10f6aafa08c5114fbad6793defb5b&X-Amz-SignedHeaders=host&a

In [7]:
!mv deepspeech-0.8.1-models.pbmm deepspeech-0.8.1-models.scorer deepspeech-0.8.2-models/

In [8]:
!ls -l deepspeech-0.8.2-models/

total 1115520
-rw-r--r-- 1 root root 188915984 Aug 10 19:15 deepspeech-0.8.1-models.pbmm
-rw-r--r-- 1 root root 953363776 Aug 10 19:17 deepspeech-0.8.1-models.scorer


2. **Download audio data files**

In [9]:
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.8.2/audio-0.8.2.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   140  100   140    0     0    933      0 --:--:-- --:--:-- --:--:--   933
100   629  100   629    0     0   1947      0 --:--:-- --:--:-- --:--:--  1947
100  194k  100  194k    0     0   189k      0  0:00:01  0:00:01 --:--:-- 1488k


4. **Unzip audio files**

In [10]:
!tar -xvzf audio-0.8.2.tar.gz

._audio
audio/
audio/._2830-3980-0043.wav
audio/2830-3980-0043.wav
audio/._Attribution.txt
audio/Attribution.txt
audio/._4507-16021-0012.wav
audio/4507-16021-0012.wav
audio/._8455-210777-0068.wav
audio/8455-210777-0068.wav
audio/._License.txt
audio/License.txt


In [11]:
!ls -l ./audio/

total 260
-rw-r--r-- 1 501 staff 63244 Nov 18  2017 2830-3980-0043.wav
-rw-r--r-- 1 501 staff 87564 Nov 18  2017 4507-16021-0012.wav
-rw-r--r-- 1 501 staff 82924 Nov 18  2017 8455-210777-0068.wav
-rw-r--r-- 1 501 staff   340 May 14  2018 Attribution.txt
-rw-r--r-- 1 501 staff 18652 May 12  2018 License.txt


5. **Test that it all works**

In [12]:
!deepspeech --model deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer --audio ./audio/2830-3980-0043.wav

Loading model from file deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
2020-08-24 16:43:12.104037: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.00996s.
Loading scorer from files deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer
Loaded scorer in 0.000286s.
Running inference.
experience proves this
Inference took 0.768s for 1.975s audio file.


In [13]:
!deepspeech --model deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer --audio ./audio/4507-16021-0012.wav

Loading model from file deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
2020-08-24 16:43:24.364009: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.0125s.
Loading scorer from files deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer
Loaded scorer in 0.000213s.
Running inference.
why should one hall on the way
Inference took 1.052s for 2.735s audio file.


In [14]:
!deepspeech --model deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer --audio ./audio/8455-210777-0068.wav

Loading model from file deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm
TensorFlow: v2.2.0-24-g1c1b2b9
DeepSpeech: v0.8.2-0-g02e4c76
2020-08-24 16:43:47.018632: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.00975s.
Loading scorer from files deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer
Loaded scorer in 0.000193s.
Running inference.
your power is sufficient i said
Inference took 1.040s for 2.590s audio file.


In [15]:
!deepspeech --help

usage: deepspeech [-h] --model MODEL [--scorer SCORER] --audio AUDIO
                  [--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA]
                  [--lm_beta LM_BETA] [--version] [--extended] [--json]
                  [--candidate_transcripts CANDIDATE_TRANSCRIPTS]

Running DeepSpeech inference.

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model (protocol buffer binary file)
  --scorer SCORER       Path to the external scorer file
  --audio AUDIO         Path to the audio file to run (WAV format)
  --beam_width BEAM_WIDTH
                        Beam width for the CTC decoder
  --lm_alpha LM_ALPHA   Language model weight (lm_alpha). If not specified,
                        use default from the scorer package.
  --lm_beta LM_BETA     Word insertion bonus (lm_beta). If not specified, use
                        default from the scorer package.
  --version             Print version and exits
  --extended          

Examine the output of the last three commands, and you will see results *“experience proof less”*, *“why should one halt on the way”*, and *“your power is sufficient i said”* respectively. You are all set.

# DeepSpeech API

1.   **Import deepspeech**

In [16]:
import deepspeech

2. **Create a model**

In [17]:
model_file_path = 'deepspeech-0.8.2-models/deepspeech-0.8.1-models.pbmm'
model = deepspeech.Model(model_file_path)

3. **Add scorer and other parameters**

In [18]:
scorer_file_path = 'deepspeech-0.8.2-models/deepspeech-0.8.1-models.scorer'
model.enableExternalScorer(scorer_file_path)

lm_alpha = 0.75
lm_beta = 1.85
model.setScorerAlphaBeta(lm_alpha, lm_beta)

beam_width = 500
model.setBeamWidth(beam_width)

0

## Batch API

1.   **Read an input wav file**


In [19]:
import wave
filename = 'audio/8455-210777-0068.wav'
w = wave.open(filename, 'r')
rate = w.getframerate()
frames = w.getnframes()
buffer = w.readframes(frames)

Checkout sample rate and buffer type

In [20]:
print(rate)
print(model.sampleRate())
print(str(type(buffer)))

16000
16000
<class 'bytes'>


As you can see that the speech sample rate of the wav file is 16000hz, same as the model’s sample rate. But the buffer is a byte array, whereas DeepSpeech model expects 16-bit int array.

2.  **Convert byte array buffer to int16 array**

In [21]:
import numpy as np
data16 = np.frombuffer(buffer, dtype=np.int16)
print(str(type(data16)))

<class 'numpy.ndarray'>


3.  **Run speech-to-text in batch mode to get the text**

In [22]:
text = model.stt(data16)
print(text)

your power is sufficient i said


## Streaming API

Now let’s accomplish the same using streaming API. It consists of 3 steps: open session, feed data, close session.

1.  **Open a streaming session**

In [23]:
ds_stream = model.createStream()

2.  **Repeatedly feed chunks of speech buffer, and get interim results if desired**

In [24]:
buffer_len = len(buffer)
offset = 0
batch_size = 16384
text = ''
while offset < buffer_len:
    end_offset = offset + batch_size
    chunk = buffer[offset:end_offset]
    data16 = np.frombuffer(chunk, dtype=np.int16)
    ds_stream.feedAudioContent(data16)
    text = ds_stream.intermediateDecode()
    print(text)
    offset = end_offset



your power i
your power is suffi
your power is sufficient i said
your power is sufficient i said


3.  **Close stream and get the final result**

In [25]:
text = ds_stream.finishStream()
print(text)

your power is sufficient i said


Verify that the output is same as as the batch API output: "your power is sufficient i said."

# Recap

DeepSpeech has two modes: batch and streaming. First step is to create a model object, and then either call `stt()` or `feedAudioContnet()` to transcribe audio to text.
<br/>

---
&copy; 2020 Satish Chandra Gupta