<a href="https://colab.research.google.com/github/ml4devs/ml4devs-notebooks/blob/master/speech/asr/deepspeech/mozilla_deepspeech_api_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center>How to Efficently Apply a Function to Pandas Dataframe Rows</center></h1>

<p><center>
<address>&copy; Satish Chandra Gupta<br/>
LinkedIn: <a href="https://www.linkedin.com/in/scgupta/">scgupta</a>,
Twitter: <a href="https://twitter.com/scgupta">scgupta</a>
</address> 
</center></p>

---

Blog post: [How to Build Python Transcriber Using Mozilla Deepspeech](https://www.ml4devs.com/articles/how-to-build-python-transcriber-using-mozilla-deepspeech/)

Update: [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech) is no longer maintaned, and its new home is [Coqui STT](https://github.com/coqui-ai/STT), which has same [APIs in C, Java, .NET, Python, and JavaScript](https://stt.readthedocs.io/) (and also appears that the team has moved too). This notebook is tested with the [Coqui STT 1.4.0](https://github.com/coqui-ai/STT/releases/tag/v1.4.0).

From Colab menu, select: **Runtime** > **Change runtime type**, and verify that it is set to Python3, and select GPU if you want to try out GPU version.

You can [pip-install Coqui STT](https://pypi.org/project/stt/):

In [1]:
!python --version

Python 3.7.15


In [2]:
!pip install stt==1.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Download Models and Audio Files

Mozilla has released models for US English, we will use those in this code lab.

1. **Download the models:**
Models can be downloaded from [Coqui Model repository](https://coqui.ai/models), for example, [English STT v1.0.0 (Large Vocabulary)](https://coqui.ai/english/coqui/v1.0.0-large-vocab) that is used here.

In [3]:
!mkdir coqui-stt-1.0.0-models

In [4]:
!wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/model.tflite

--2022-11-01 08:48:10--  https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/model.tflite
Resolving coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)... 54.70.21.136, 35.155.221.103
Connecting to coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)|54.70.21.136|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/model.tflite [following]
--2022-11-01 08:48:10--  https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/model.tflite
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/351871871/e6d0f95f-97dc-43ac-ac08-38660209ebbc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221101%2Fus-east-1%2Fs3%2Faws4_r

In [5]:
!wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/large_vocabulary.scorer

--2022-11-01 08:48:11--  https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/large_vocabulary.scorer
Resolving coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)... 54.70.21.136, 35.155.221.103
Connecting to coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)|54.70.21.136|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/large_vocabulary.scorer [following]
--2022-11-01 08:48:11--  https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/large_vocabulary.scorer
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/351871871/1df256c5-336b-424b-b7b9-a33d8262eb24?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F2

In [6]:
!wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/alphabet.txt

--2022-11-01 08:48:13--  https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/alphabet.txt
Resolving coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)... 54.70.21.136, 35.155.221.103
Connecting to coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)|54.70.21.136|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/alphabet.txt [following]
--2022-11-01 08:48:13--  https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/alphabet.txt
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/351871871/17a8ffed-fd5a-4225-bb12-884c66c87c62?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221101%2Fus-east-1%2Fs3%2Faws4_r

In [7]:
!wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/MODEL_CARD

--2022-11-01 08:48:14--  https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/MODEL_CARD
Resolving coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)... 54.70.21.136, 35.155.221.103
Connecting to coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)|54.70.21.136|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/MODEL_CARD [following]
--2022-11-01 08:48:14--  https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/MODEL_CARD
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/351871871/b03c95a9-30e2-420d-b07e-413b44525bf0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221101%2Fus-east-1%2Fs3%2Faws4_request

In [8]:
!wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/LOG_TESTING

--2022-11-01 08:48:15--  https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/LOG_TESTING
Resolving coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)... 54.70.21.136, 35.155.221.103
Connecting to coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)|54.70.21.136|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/LOG_TESTING [following]
--2022-11-01 08:48:15--  https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/LOG_TESTING
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/351871871/f33b2c5a-c27e-47b1-9870-4f2a190a4a83?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221101%2Fus-east-1%2Fs3%2Faws4_requ

In [9]:
!wget https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/LICENSE

--2022-11-01 08:48:15--  https://coqui.gateway.scarf.sh/english/coqui/v1.0.0-large-vocab/LICENSE
Resolving coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)... 54.70.21.136, 35.155.221.103
Connecting to coqui.gateway.scarf.sh (coqui.gateway.scarf.sh)|54.70.21.136|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/LICENSE [following]
--2022-11-01 08:48:16--  https://github.com/coqui-ai/STT-models/releases/download/english/coqui/v1.0.0-large-vocab/LICENSE
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/351871871/dc69c571-83ca-48c1-9b31-408e9be73bc1?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20221101%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Da

In [10]:
!mv model.tflite large_vocabulary.scorer alphabet.txt MODEL_CARD LOG_TESTING LICENSE coqui-stt-1.0.0-models/

In [11]:
!ls -l coqui-stt-1.0.0-models

total 175816
-rw-r--r-- 1 root root       329 Dec  7  2021 alphabet.txt
-rw-r--r-- 1 root root 132644544 Dec  7  2021 large_vocabulary.scorer
-rw-r--r-- 1 root root     11358 Dec  7  2021 LICENSE
-rw-r--r-- 1 root root     25391 Dec  7  2021 LOG_TESTING
-rw-r--r-- 1 root root      4244 Dec  7  2021 MODEL_CARD
-rw-r--r-- 1 root root  47332120 Dec  7  2021 model.tflite


2. **Download audio data files**

In [12]:
!curl -LO https://github.com/coqui-ai/STT/releases/download/v1.4.0/audio-1.4.0.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  193k  100  193k    0     0   609k      0 --:--:-- --:--:-- --:--:--  609k


4. **Unzip audio files**

In [13]:
!tar -xvzf audio-1.4.0.tar.gz

._audio
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
audio/
audio/._2830-3980-0043.wav
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
audio/2830-3980-0043.wav
audio/._Attribution.txt
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
audio/Attribution.txt
audio/._4507-16021-0012.wav
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
audio/4507-16021-0012.wav
audio/._8455-210777-0068.wav
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
audio/8455-210777-0068.wav
audio/._License.txt
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
audio/License.txt


In [14]:
!ls -l ./audio/

total 260
-rw-r--r-- 1 501 staff 63244 Nov 18  2017 2830-3980-0043.wav
-rw-r--r-- 1 501 staff 87564 Nov 18  2017 4507-16021-0012.wav
-rw-r--r-- 1 501 staff 82924 Nov 18  2017 8455-210777-0068.wav
-rw-r--r-- 1 501 staff   340 May 14  2018 Attribution.txt
-rw-r--r-- 1 501 staff 18652 May 12  2018 License.txt


5. **Test that it all works**

In [15]:
!stt --help

usage: stt [-h] --model MODEL [--scorer SCORER] --audio AUDIO
           [--beam_width BEAM_WIDTH] [--lm_alpha LM_ALPHA] [--lm_beta LM_BETA]
           [--version] [--extended] [--json]
           [--candidate_transcripts CANDIDATE_TRANSCRIPTS]
           [--hot_words HOT_WORDS]

Running Coqui STT inference.

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model (protocol buffer binary file)
  --scorer SCORER       Path to the external scorer file
  --audio AUDIO         Path to the audio file to run (WAV format)
  --beam_width BEAM_WIDTH
                        Beam width for the CTC decoder
  --lm_alpha LM_ALPHA   Language model weight (lm_alpha). If not specified,
                        use default from the scorer package.
  --lm_beta LM_BETA     Word insertion bonus (lm_beta). If not specified, use
                        default from the scorer package.
  --version             Print version and exits
  --extended    

In [16]:
!stt --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/2830-3980-0043.wav

Loading model from file coqui-stt-1.0.0-models/model.tflite
TensorFlow: v2.9.1-11-gf8242ebc005
 Coqui STT: v1.4.0-0-gfcec06bd
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Loaded model in 0.00167s.
Loading scorer from files coqui-stt-1.0.0-models/large_vocabulary.scorer
Loaded scorer in 0.000265s.
Running inference.
experience proves this
Inference took 0.795s for 1.975s audio file.


In [17]:
!stt --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/4507-16021-0012.wav

Loading model from file coqui-stt-1.0.0-models/model.tflite
TensorFlow: v2.9.1-11-gf8242ebc005
 Coqui STT: v1.4.0-0-gfcec06bd
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Loaded model in 0.00149s.
Loading scorer from files coqui-stt-1.0.0-models/large_vocabulary.scorer
Loaded scorer in 0.000223s.
Running inference.
why should one halt on the way
Inference took 0.910s for 2.735s audio file.


In [18]:
!stt --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/8455-210777-0068.wav

Loading model from file coqui-stt-1.0.0-models/model.tflite
TensorFlow: v2.9.1-11-gf8242ebc005
 Coqui STT: v1.4.0-0-gfcec06bd
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Loaded model in 0.00148s.
Loading scorer from files coqui-stt-1.0.0-models/large_vocabulary.scorer
Loaded scorer in 0.000231s.
Running inference.
your power is sufficient i said
Inference took 0.889s for 2.590s audio file.


Examine the output of the last three commands, and you will see results “experience proof this, “why should one halt on the way”, and “your power is sufficient i said” respectively. You are all set.

If you want the breakup and timestamp, you can use `--json` flag:

In [19]:
!stt --json --model coqui-stt-1.0.0-models/model.tflite --scorer coqui-stt-1.0.0-models/large_vocabulary.scorer --audio ./audio/8455-210777-0068.wav 

Loading model from file coqui-stt-1.0.0-models/model.tflite
TensorFlow: v2.9.1-11-gf8242ebc005
 Coqui STT: v1.4.0-0-gfcec06bd
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Loaded model in 0.00152s.
Loading scorer from files coqui-stt-1.0.0-models/large_vocabulary.scorer
Loaded scorer in 0.000266s.
Running inference.
{
  "transcripts": [
    {
      "confidence": -31.462177276611328,
      "words": [
        {
          "word": "your",
          "start_time": 0.72,
          "duration": 0.2
        },
        {
          "word": "power",
          "start_time": 0.98,
          "duration": 0.2
        },
        {
          "word": "is",
          "start_time": 1.28,
          "duration": 0.1
        },
        {
          "word": "sufficient",
          "start_time": 1.44,
          "duration": 0.36
        },
        {
          "word": "i",
          "start_time": 1.92,
          "duration": 0.12
        },
        {
          "word": "said",
          "start_time": 2.1,
   

# DeepSpeech API

1.   **Import deepspeech**

In [20]:
import stt

2. **Create a model**

In [21]:
model_file_path = 'coqui-stt-1.0.0-models/model.tflite'
model = stt.Model(model_file_path)

3. **Add scorer and other parameters**

In [22]:
scorer_file_path = 'coqui-stt-1.0.0-models/large_vocabulary.scorer'
model.enableExternalScorer(scorer_file_path)

lm_alpha = 0.75
lm_beta = 1.85
model.setScorerAlphaBeta(lm_alpha, lm_beta)

beam_width = 500
model.setBeamWidth(beam_width)

0

## Batch API

1.   **Read an input wav file**


In [23]:
import wave
filename = 'audio/8455-210777-0068.wav'
w = wave.open(filename, 'r')
rate = w.getframerate()
frames = w.getnframes()
buffer = w.readframes(frames)

Checkout sample rate and buffer type

In [24]:
print(rate)
print(model.sampleRate())
print(str(type(buffer)))

16000
16000
<class 'bytes'>


As you can see that the speech sample rate of the wav file is 16000hz, same as the model’s sample rate. But the buffer is a byte array, whereas DeepSpeech model expects 16-bit int array.

2.  **Convert byte array buffer to int16 array**

In [25]:
import numpy as np
data16 = np.frombuffer(buffer, dtype=np.int16)
print(str(type(data16)))

<class 'numpy.ndarray'>


3.  **Run speech-to-text in batch mode to get the text**

In [26]:
text = model.stt(data16)
print(text)

your power is sufficient i said


## Streaming API

Now let’s accomplish the same using streaming API. It consists of 3 steps: open session, feed data, close session.

1.  **Open a streaming session**

In [27]:
stt_stream = model.createStream()

2.  **Repeatedly feed chunks of speech buffer, and get interim results if desired**

In [28]:
buffer_len = len(buffer)
offset = 0
batch_size = 16384
text = ''
while offset < buffer_len:
    end_offset = offset + batch_size
    chunk = buffer[offset:end_offset]
    data16 = np.frombuffer(chunk, dtype=np.int16)
    stt_stream.feedAudioContent(data16)
    text = stt_stream.intermediateDecode()
    print(text)
    offset = end_offset



your power 
your power is suff
your power is sufficient i said
your power is sufficient i said


3.  **Close stream and get the final result**

In [29]:
text = stt_stream.finishStream()
print(text)

your power is sufficient i said


Verify that the output is same as as the batch API output: "your power is sufficient i said."

# Recap

DeepSpeech has two modes: batch and streaming. First step is to create a model object, and then either call `stt()` or `feedAudioContnet()` to transcribe audio to text.

---
<p>Copyright &copy 2020 - 2022 <a href="https://www.linkedin.com/in/scgupta">Satish Chandra Gupta</a>.</p>
<img src="https://licensebuttons.net/l/by-nc-sa/3.0/88x31.png" align="left"/> <p>&nbsp;<a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0 International</a> License.</p>