# DeepSpeech
### DeepSpeech is an open source Speech-To-Text engine using an RNN model. A pre-trained English model is available for use and can be downloaded following the instrcutions below.

In [1]:
# Install DeepSpeech GPU accelerated version. 
# Note: Make sure supported GPUs and CUDA dependencies are installed on Linux.
!pip3 install deepspeech-gpu 
# Install DeepSpeech; If GPU version is not supprted.
# !pip3 install deepspeech  

Collecting deepspeech-gpu
  Downloading deepspeech_gpu-0.9.3-cp37-cp37m-manylinux1_x86_64.whl (22.3 MB)
[K     |████████████████████████████████| 22.3 MB 46.5 MB/s 
Installing collected packages: deepspeech-gpu
Successfully installed deepspeech-gpu-0.9.3


In [2]:
# Download pre-trained English model files
# Files ending in .pbmm are compatible with clients and language bindings built against the standard TensorFlow runtime.
# Files ending in .scorer are external scorers (language models) that are used at inference time in conjunction with an acoustic model (.pbmm or .tflite file) to produce transcriptions.
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   634  100   634    0     0   2419      0 --:--:-- --:--:-- --:--:--  2419
100  180M  100  180M    0     0  13.8M      0  0:00:12  0:00:12 --:--:-- 15.1M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   667  100   667    0     0   3420      0 --:--:-- --:--:-- --:--:--  3420
100  909M  100  909M    0     0  24.5M      0  0:00:36  0:00:36 --:--:--  113M


In [3]:
# Download an example data set
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
!tar xvf audio-0.9.3.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   655  100   655    0     0   3164      0 --:--:-- --:--:-- --:--:--  3164
100  194k  100  194k    0     0   459k      0 --:--:-- --:--:-- --:--:-- 1034k
._audio
audio/
audio/._2830-3980-0043.wav
audio/2830-3980-0043.wav
audio/._Attribution.txt
audio/Attribution.txt
audio/._4507-16021-0012.wav
audio/4507-16021-0012.wav
audio/._8455-210777-0068.wav
audio/8455-210777-0068.wav
audio/._License.txt
audio/License.txt


In [4]:
# Transcribe an audio file
!deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/8455-210777-0068.wav

2021-11-13 01:16:27.356765: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Loading model from file deepspeech-0.9.3-models.pbmm
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-11-13 01:16:27.524912: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-13 01:16:27.527257: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-11-13 01:16:27.585908: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-13 01:16:27.586827: I tensorflo

In [5]:
# Mount to google drive
from google.colab import drive
import os
drive.mount('/content/drive')
checkpoints = '/content/drive/MyDrive/colab_files/'
if not os.path.exists(checkpoints):
    os.makedirs(checkpoints)

Mounted at /content/drive


# Vad_transcriber
### Vad_transcriber performs transcription on long wav files. It takes in a wav file of any duration, use the WebRTC Voice Activity Detector (VAD) to split it into smaller chunks and finally save a consolidated transcript.

In [6]:
# Install requiered packages
!sudo apt install sox

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3 sox
0 upgraded, 8 newly installed, 0 to remove and 37 not upgraded.
Need to get 760 kB of archives.
After this operation, 6,717 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrnb0 amd64 0.1.3-2.1 [92.0 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libopencore-amrwb0 amd64 0.1.3-2.1 [45.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 libmagic-mgc amd64 1:5.32-2ubuntu0.4 [184 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-updates/main a

In [17]:
!pip3 install webrtcvad

Collecting webrtcvad
  Downloading webrtcvad-2.0.10.tar.gz (66 kB)
[?25l[K     |█████                           | 10 kB 24.7 MB/s eta 0:00:01[K     |██████████                      | 20 kB 8.7 MB/s eta 0:00:01[K     |██████████████▉                 | 30 kB 5.9 MB/s eta 0:00:01[K     |███████████████████▉            | 40 kB 5.6 MB/s eta 0:00:01[K     |████████████████████████▊       | 51 kB 2.4 MB/s eta 0:00:01[K     |█████████████████████████████▊  | 61 kB 2.5 MB/s eta 0:00:01[K     |████████████████████████████████| 66 kB 1.6 MB/s 
[?25hBuilding wheels for collected packages: webrtcvad
  Building wheel for webrtcvad (setup.py) ... [?25l[?25hdone
  Created wheel for webrtcvad: filename=webrtcvad-2.0.10-cp37-cp37m-linux_x86_64.whl size=72375 sha256=a14fac50efff79a4ed375eed63da7c1ee0ce69c4201146bb7263d21d55e509d2
  Stored in directory: /root/.cache/pip/wheels/11/f9/67/a3158d131f57e1c0a7d8d966a707d4a2fb27567a4fe47723ad
Successfully built webrtcvad
Installing collected pa

In [7]:
# Download DeepSpeech examples - VAD_transcriber
!git clone "https://github.com/mozilla/DeepSpeech-examples.git"

Cloning into 'DeepSpeech-examples'...
remote: Enumerating objects: 1193, done.[K
remote: Counting objects: 100% (187/187), done.[K
remote: Compressing objects: 100% (169/169), done.[K
remote: Total 1193 (delta 67), reused 74 (delta 8), pack-reused 1006[K
Receiving objects: 100% (1193/1193), 1.04 MiB | 5.50 MiB/s, done.
Resolving deltas: 100% (551/551), done.


In [None]:
!ls DeepSpeech-examples/vad_transcriber

audioTranscript_cmd.py	README.md	  test.sh      wavTranscriber.py
audioTranscript_gui.py	requirements.txt  wavSplit.py


In [11]:
# Convert .mp3 to .wav 
# DeepSpeech is trained on 16k sampling rate and mono channel, so the audio file should match the format
!ffmpeg -i drive//MyDrive//colab_files//wa_senate_bfst_2021_0119.mp3 -acodec pcm_s16le -ac 1 -ar 16000 wa_senate_bfst_2021_0119.wav

ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lib

In [18]:
# The command line tool processes a wav file of any duration and returns a trancript which will the saved in the same directory as the input audio file.
# The command line tool gives you control over the aggressiveness of the VAD. Set the aggressiveness mode, to an integer between 0 and 3. 0 being the least aggressive about filtering out non-speech, 3 is the most aggressive.
import time
start = time.time()
!python3 DeepSpeech-examples/vad_transcriber/audioTranscript_cmd.py --aggressive 1 --audio wa_senate_bfst_2021_0119.wav --model ./
end = time.time()
print("Total time: {:.2f}".format(end-start))

2021-11-13 01:28:35.771954: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
DEBUG:root:Transcribing audio file @ wa_senate_bfst_2021_0119.wav
DEBUG:root:Found Model: ./deepspeech-0.9.3-models.pbmm
DEBUG:root:Found scorer: ./deepspeech-0.9.3-models.scorer
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2021-11-13 01:28:35.784891: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-13 01:28:35.786107: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-11-13 01:28:35.811267: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had nega

In [19]:
# Save the file
cp wa_senate_bfst_2021_0119.txt drive//MyDrive//colab_files//