# **Class Project: Bilingual Speech Recognition for Personal Assistants**

**Project Member:**  
Tharnarch Thoranisttakul (Omz), Student ID: 63340500025  
FIBO, KMUTT

As of now (Apr, 8 2023), the current stable release for DeepSpeech is 0.9.3. Therefore, we will use the DeepSpeech version 0.9.3.

## **References:**

https://www.section.io/engineering-education/speech-to-text-transcription-model-using-deep-speech/  
https://deepspeech.readthedocs.io/en/latest/Python-API.html

Import necessary packages

In [1]:
from deepspeech import Model
from datasets import load_dataset, config

import numpy as np
import os
import wave
# from pathlib import Path

from IPython.display import Audio

import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

2023-04-09 22:04:54.400449: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


2023-04-09 22:04:56.118966: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-09 22:04:56.213078: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-04-09 22:04:56.213215: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Downloading the models and creating alphabet.txt

In [2]:
# DeepSpeech 0.9.3
# Model, Scorer and Alphabet paths
model_file_path = 'models/deepspeech-0.9.3-models.pbmm'
scorer_file_path = 'models/deepspeech-0.9.3-models.scorer'

if not os.path.exists(model_file_path):
    # Acoustic Model
    !wget -P models https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
if not os.path.exists(scorer_file_path):
    # Language Model
    !wget -P models https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

In [3]:
# Install DeepSpeech 0.9.3 using pip
!pip install deepspeech==0.9.3

/bin/bash: /home/omzlette/miniconda3/envs/bsr/lib/libtinfo.so.6: no version information available (required by /bin/bash)


In [4]:
alphabet_path = 'models/alphabet.txt'

withTH = False

enAlpList = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
             'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
thAlpList = ['ก', 'ข', 'ฃ', 'ค', 'ฅ', 'ฆ', 'ง', 'จ', 'ฉ', 'ช', 'ซ', 'ฌ', 'ญ', 'ฎ', 'ฏ', 'ฐ', 'ฑ', 'ฒ', 'ณ', 'ด', 'ต', 'ถ', 'ท', 'ธ', 'น', 'บ',
             'ป', 'ผ', 'ฝ', 'พ', 'ฟ', 'ภ', 'ม', 'ย', 'ร', 'ล', 'ว', 'ศ', 'ษ', 'ส', 'ห', 'ฬ', 'อ', 'ฮ', 'ฯ', 'ะ', 'ั', 'า', 'ำ', 'ิ', 'ี', 'ึ', 'ื', 'ุ', 'ู', 'ฺ',
             'เ', 'แ', 'โ', 'ใ', 'ไ', 'ๅ', 'ๆ', '็', '่', '้', '๊', '๋', '์', 'ํ', '๎']
sortedAlpList = sorted(enAlpList + thAlpList + ["'", '"', ',', '.', '?', '!']) if withTH else sorted(enAlpList + ["'", '"', ',', '.', '?', '!'])

# Generate alphabet.txt (Every time so it's the correct one)
with open(alphabet_path, 'w') as f:
    for i in sortedAlpList:
        f.write(i + '\n')
    f.write(' ')

Load Train Dataset

In [6]:
# Login to HuggingFace
!huggingface-cli login --token=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

# Set download path and cache path
config.DOWNLOADED_DATASETS_PATH = "/media/omzlette/2ndSSD/CommonVoice_Corpus/data"
config.HF_CACHE_HOME = os.path.expanduser("~/BSR-Project/data")
config.HF_DATASETS_CACHE = os.path.join(config.HF_CACHE_HOME, "datasets")
config.HF_METRICS_CACHE = os.path.join(config.HF_CACHE_HOME, "metrics")
config.HF_MODULES_CACHE = os.path.join(config.HF_CACHE_HOME, "modules")

en_cv13 = load_dataset("mozilla-foundation/common_voice_12_0", "en", split='train')
# th_cv13 = load_dataset("mozilla-foundation/common_voice_13_0", "th", split="train")

/bin/bash: /home/omzlette/miniconda3/envs/bsr/lib/libtinfo.so.6: no version information available (required by /bin/bash)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid.
Your token has been saved to /home/omzlette/.cache/huggingface/token
Login successful


Downloading builder script:   0%|          | 0.00/8.25k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.5k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.57k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/63.2k [00:00<?, ?B/s]

Downloading and preparing dataset common_voice_12_0/en to /home/omzlette/BSR-Project/data/datasets/mozilla-foundation___common_voice_12_0/en/12.0.0/dd534e3c6006ee4b577c176df4a8ef23bced8b3150a3b64d2d0a7a5e3f942efb...


Downloading data:   0%|          | 0.00/12.7k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.84G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.79G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.76G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.72G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.72G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.64G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.66G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.59G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.57G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.59G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.59G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.65G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.65G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.65G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.64G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.48G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.80G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/727M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/719M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.30G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.37G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.29G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.38G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.38G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/115M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.87G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.62G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.10G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.65G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/698M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/238M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.74M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.69M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/64.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/61.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Reading metadata...: 986897it [00:04, 214928.95it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 16365it [00:00, 181616.52it/s]


Generating test split: 0 examples [00:00, ? examples/s]



Reading metadata...: 16365it [00:00, 202732.60it/s]


Generating other split: 0 examples [00:00, ? examples/s]




[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


[A[A[A


Reading metadata...: 283258it [00:01, 201709.16it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]

Reading metadata...: 259242it [00:01, 208463.22it/s]


Dataset common_voice_12_0 downloaded and prepared to /home/omzlette/BSR-Project/data/datasets/mozilla-foundation___common_voice_12_0/en/12.0.0/dd534e3c6006ee4b577c176df4a8ef23bced8b3150a3b64d2d0a7a5e3f942efb. Subsequent calls will reuse this data.


In [None]:
en_cv13.features.keys()

In [None]:
en_cv13[40000:40010]["audio"]

In [None]:
import soundfile as sf

testpath = '/media/omzlette/2ndSSD/CommonVoice_Corpus/data/extracted/260d00f3d8ffb4fd721297c4898d4c3aef3e3b2f98b992536e0e52d27bd94d90/common_voice_en_24735189.mp3' 

with sf.SoundFile(testpath) as f:
    duration = len(f) / f.samplerate
    print(duration)

Initialize hyperparameters variables

In [None]:
# Hyperparameters Variables
"""According to the DeepSpeech documentation, 
a larger beam width value generates better results 
at the cost of decoding time."""
beam_width = 100
lm_alpha = 0.75
lm_beta = 1.85

In [None]:
# Optimize lm_alpha and lm_beta
# https://deepspeech.readthedocs.io/en/v0.9.3/Scorer.html

# Code:
# Load model into memory
model = Model(model_file_path)
model.enableExternalScorer(scorer_file_path)

# Set hyperparameters
model.setScorerAlphaBeta(lm_alpha, lm_beta)
model.setBeamWidth(beam_width)

In [None]:
model.train()

In [None]:
def process_audio(audio_file):
    # Read audio file
    with wave.open(audio_file, 'rb') as wav:
        rate = wav.getframerate()
        frames = wav.getnframes()
        buffer = wav.readframes(frames)

    # Process audio file
    data16 = np.frombuffer(buffer, dtype=np.int16)
    return data16, rate

In [None]:
def transcribe(audio_file):
    # Process audio file
    data16, rate = process_audio(audio_file)

    # Transcribe audio file
    return model.stt(data16)

In [None]:
transcribe("data/cv-valid-train/sample-000000.wav")