<a href="https://colab.research.google.com/gist/jmrf/7711d1f833e49ba27e85c12edc316123/stt-exploratory-telebot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Off-line STT exploratory

We explore 3 different options:

 - [pykaldi]()
 - Facebook's [wav2letter](https://github.com/flashlight/wav2letter/)
 - OpenAI's  [whisper](https://github.com/openai/whisper)


## Setup

In [9]:
# System common deps
!apt-get install -qq \
    sox \
    mediainfo

# Python common deps
!pip install -qq -U pip
!pip install -qq ffmpeg-python sox

Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Processing triggers for mime-support (3.60ubuntu1) ...
[K     |████████████████████████████████| 2.0 MB 6.9 MB/s 
[0m

## Helpers

In [14]:
import os
import glob
import signal
import tempfile

from contextlib import contextmanager
from subprocess import Popen, PIPE, check_output


@contextmanager
def timeout(duration: int):
    def timeout_handler(signum, frame):
        raise Exception(f"Block timed out after {duration} seconds")

    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(duration)
    try:
        yield
    finally:
        signal.alarm(0)


def create_process(cmd):
    process = Popen([cmd],
                    stdin=PIPE, stdout=PIPE, stderr=PIPE,
                    shell=True, preexec_fn=os.setsid) 
    return process


def read_current_output(process):
    stt_symbol = "|P|:"
    word_separator_symbol = "|"

    transcripts = []
    output = True
    while output:
        output = process.stdout.readline().decode()
        stderr = process.stderr.readline().decode()

        if output.startswith(stt_symbol):
            output = output.replace(stt_symbol, "").split(word_separator_symbol)
            words = " ".join([w.strip().replace(" ", "") for w in output])
            transcripts.append(words)

    return transcripts

## 🤎 PyKaldi

### Setup

In [None]:
!apt-get install -qq -y --no-install-recommends \
    autoconf \
    automake \
    cmake \
    curl \
    gfortran \
    g++

!pip install -U -qq pip setuptools
!pip install -qq \
    'coloredlogs==15.0.1' \
    'numpy==1.21.4' \
    'pyaudio==0.2.11' \
    'PyYAML==6.0' \
    'rich==10.15.2' \
    'samplerate==0.1.0' \
    'scipy==1.7.3' \
    'git+https://github.com/wkentaro/gdown.git@v4.2.0#egg=gdown'

In [None]:
%%bash

cd models/

# English model
MODEL_FILE=en_160k_nnet3chain_tdnn1f_2048_sp_bi.tar.bz2
if [ ! -f $MODEL_FILE ]; then
    wget http://ltdata1.informatik.uni-hamburg.de/pykaldi/$MODEL_FILE
    tar xvfj $MODEL_FILE
    rm $MODEL_FILE
fi

cd -

## 🌊 Wav2Letter

We use Facebook's [wav2letter](https://github.com/flashlight/wav2letter/tree/main/recipes/mling_pl) and pre-trained models. wav2letter has been consolidated into [flashlight/app/asr](https://github.com/flashlight/flashlight/tree/main/flashlight/app/asr) which requires this [flashlight commit](https://github.com/flashlight/flashlight/tree/8f7af9ec1188bfd7050c47abfac528d21650890f).

> 🤓 [wav2vec-unsupervised-speech-recognition blog post](https://ai.facebook.com/blog/wav2vec-unsupervised-speech-recognition-without-supervision)

> 💡 [Install and inference colab example](https://github.com/flashlight/wav2letter/blob/main/recipes/mling_pl/mling_model.ipynb)


### Setup

In [None]:
# First, choose backend to build with
MODEL = "W2L"
backend = 'CUDA' #@param ["CPU", "CUDA"]

#### Compile

In [None]:
# Clone Flashlight
!git clone https://github.com/flashlight/flashlight.git

# install all dependencies for colab notebook
!source flashlight/scripts/colab/colab_install_deps.sh

In [None]:
# export necessary env variables
%env MKLROOT=/opt/intel/mkl
%env ArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake
%env DNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl

if backend == "CUDA":
  # Total time: ~13 minutes
  !cd flashlight && git checkout d2e1924cb2a2b32b48cc326bb7e332ca3ea54f67 && mkdir -p build && cd build && \
  cmake .. -DCMAKE_BUILD_TYPE=Release \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_ASR=ON && \
  make -j$(nproc)

elif backend == "CPU":
  # Total time: ~14 minutes
  !cd flashlight && git checkout d2e1924cb2a2b32b48cc326bb7e332ca3ea54f67 && mkdir -p build && cd build && \
  cmake .. -DFL_BACKEND=CPU \
           -DCMAKE_BUILD_TYPE=Release \
           -DFL_BUILD_TESTS=OFF \
           -DFL_BUILD_EXAMPLES=OFF \
           -DFL_BUILD_APP_ASR=ON && \
  make -j$(nproc)
  
else:
  raise ValueError(f"Unknown backend {backend}")

env: MKLROOT=/opt/intel/mkl
env: ArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake
env: DNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl
Note: checking out 'd2e1924cb2a2b32b48cc326bb7e332ca3ea54f67'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at d2e1924c Tensor any and all (#685)
-- The CXX compiler identification is GNU 7.5.0
-- The C compiler identification is GNU 7.5.0
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile f

In [None]:
%cd /content/flashlight/build
# !wget https://raw.githubusercontent.com/flashlight/wav2letter/49087d575ddf77aa5a99a01fee980fc00e92c802/recipes/mling_pl/model_with_externally_controlled_reshaping_big_lid.cpp
# !mv model_with_externally_controlled_reshaping_big_lid.cpp mling.cpp
!wget https://raw.githubusercontent.com/flashlight/wav2letter/main/recipes/mling_pl/mling_large.cpp

# !cmake .. -DFL_PLUGIN_MODULE_SRC_PATH=mling.cpp
!cmake .. -DFL_PLUGIN_MODULE_SRC_PATH=mling_large.cpp
!make
%cd -

/content/flashlight/build
--2022-03-20 01:53:21--  https://raw.githubusercontent.com/flashlight/wav2letter/main/recipes/mling_pl/mling_large.cpp
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4340 (4.2K) [text/plain]
Saving to: ‘mling_large.cpp’


2022-03-20 01:53:21 (43.3 MB/s) - ‘mling_large.cpp’ saved [4340/4340]

-- -rdynamic supported.
-- Will build flashlight libraries.
-- MKL_THREADING = OMP
-- Checking for [mkl_intel_lp64 - mkl_gnu_thread - mkl_core - gomp - pthread - m - dl]
--   Library mkl_intel_lp64: /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so
--   Library mkl_gnu_thread: /opt/intel/mkl/lib/intel64/libmkl_gnu_thread.so
--   Library mkl_core: /opt/intel/mkl/lib/intel64/libmkl_core.so
--   Library gomp: -fopenmp
--   Library pthread: /usr/lib/

#### Pre-compiled

If we have a pre-compiled flashlight, we only need to set the env varibales and install the system deps.

In [None]:
# Fetch a pre-compiled flashlight from GDrive
from google.colab import drive

drive.mount('/gdrive')

!cp '/gdrive/MyDrive/Colab Notebooks/STT-artifacts/$backend-flashlight.tar.gz' /content/

Mounted at /gdrive


In [None]:
# extract the pre-compiled flashlight
!tar xzf $backend-flashlight.tar.gz

# set env. vars
%env MKLROOT=/opt/intel/mkl
%env ArrayFire_DIR=/opt/arrayfire/share/ArrayFire/cmake
%env DNNL_DIR=/opt/dnnl/dnnl_lnx_2.0.0_cpu_iomp/lib/cmake/dnnl

# install system deps
!source flashlight/scripts/colab/colab_install_deps.sh

### Models 

In [None]:
MODELS_DIR = "models/wav2vec"

!mkdir -p $MODELS_DIR

# Downlaod the model checkpoint
!wget "https://dl.fbaipublicfiles.com/wav2letter/mling_pl/checkpoint_cv_finetune.bin" -P $MODELS_DIR -qq

# Download the tokens
!wget "https://dl.fbaipublicfiles.com/wav2letter/mling_pl/tokens-all.lst" -P $MODELS_DIR -qq

### Helpers

In [None]:
cmd = """
./flashlight/build/bin/asr/fl_asr_test \
    --test={audio_list} \
    --am={audio_model} \
    --tokens={tokens} \
    --arch={arch} \
    --lexicon={lexicon} \
    --datadir=''  \
    --emission_dir=''  \
    --show
"""


def run_inference(
    audio_fpath, 
    am_fpath="./models/wav2vec/checkpoint_cv_finetune.bin",
    tokens_fpath="./models/wav2vec/tokens-all.lst", 
    arch="./flashlight/build/mling_large.so",
    lexicon_fpath="./lexicon.txt"
):
    with tempfile.NamedTemporaryFile(mode='w', suffix='.lst') as f:
        duration = float(check_output("soxi -D " + audio_fpath, shell=True))
        f.write("%d %s %s\n" % (0, audio_fpath, duration))
        f.seek(0)  # 📣 important so the next process can read the first line!

        _cmd = cmd.format(
            audio_list=f.name,
            audio_model=am_fpath,
            tokens=tokens_fpath,
            arch=arch,
            lexicon=lexicon_fpath
        )
        proc = create_process(_cmd)
        return read_current_output(proc)


### Audio recording

In [None]:
from flashlight.scripts.colab.record import record_audio

audio_name = "/content/test_audio"
audio_fpath = f"{audio_name}.wav"
record_audio(audio_name)

with open("audio.lst", "w") as f:
    duration = float(check_output("soxi -D " + audio_fpath, shell=True))
    f.write("%d %s %s\n" % (0, audio_fpath, duration))

### Inference

In [None]:
# Create a dummy lexicon (if we use greedy decoding is not used...):
!echo 'a a |' > lexicon.txt

In [None]:
# Wrapped in python helpers
for transcript in run_inference("/content/test_audio.wav"):
    print(transcript)

This is a ie song sanet.


In [None]:
# Directly form command line
!/content/flashlight/build/bin/asr/fl_asr_test \
    --test=audio.lst \
    --am=/content/models/wav2vec/checkpoint_cv_finetune.bin \
    --tokens=/content/models/wav2vec/tokens-all.lst \
    --arch=flashlight/build/mling_large.so \
    --lexicon=lexicon.txt \
    --datadir=''  \
    --emission_dir='' \
    --show
    # --logtostderr=1 \
    # --minloglevel=0

## 🤫 OpenAI whisper

This section uses [OpenAI's whisper](https://github.com/openai/whisper) model.

This model present a series of advantages compared to the previos approaches:

 - multi-language
 - multi-task model, i.e.: detects the spoken language and direct translation
 - no-speech detection



### Setup

In [None]:
MODEL = "whisper"
!apt install -q ffmpeg
!pip install -q git+https://github.com/openai/whisper.git 

### Models


In [5]:
import whisper


model_name = 'medium' #@param ["tiny", "base", "small", "medium", "large"]

print(f"Loading whisper model '{model_name}'")
model = whisper.load_model(model_name)

Loading whisper model 'medium'


In [6]:
def run_inference(mp3_file:str):
    global model
    res = model.transcribe(mp3_file)
    return res["text"]

## 🙌 BONUS: Telegram Bot 🤖

We run a simple Telegram Bot as a PoC of TTS as a service via audio messages using [pyTelegramBotAPI](https://github.com/eternnoir/pyTelegramBotAPI).

In [None]:
!pip install -qq -U \
    pyTelegramBotAPI \
    rich

In [None]:
import datetime as dt
import telebot
import requests

from rich import print as pprint


BOT_TOKEN = "5191934564:AAG7gnBaxLRi_g_GOkKAbvKZVH7MATD17hs"

bot = telebot.TeleBot(BOT_TOKEN, parse_mode="MARKDOWN")


def handle_audio_message(message):

    now = "  ".join(dt.datetime.now().isoformat().split(".")[0].split("T"))

    if message.content_type == "voice":
        msg = f"👂 Received a {message.voice.duration}s voice note. Transcribing..."
        print(msg)
        ack_reply = bot.send_message(message.chat.id, msg)
        file_info = bot.get_file(message.voice.file_id)
    else:
        bot.reply_to(message, f"😓 Sorry can't handle audio clips yet...")
        file_info = bot.get_file(message.audio.file_id)
    
    try:
        # Fetch the audio file    
        audio_file = requests.get(
            f'https://api.telegram.org/file/bot{BOT_TOKEN}/{file_info.file_path}'
        )

        with tempfile.NamedTemporaryFile(mode='wb', suffix='.ogg') as f:
            # write audio to disk
            in_file = f.name
            f.write(audio_file.content)

            if MODEL == "W2L":
                # Convert to wav
                out_file = f.name.replace(".ogg", ".wav")
                create_process(
                    f'ffmpeg -i {in_file} -acodec pcm_s16le -ar 16000 {out_file}'
                ).wait()
            elif MODEL == "whisper":
                # Convert to mp3
                out_file = f.name.replace(".ogg", ".mp3")
                create_process(
                    f'ffmpeg -i {in_file} {out_file}'
                ).wait()

            # transcribe
            transcript = run_inference(out_file)
            if isinstance(transcript, list):
                transcript = "\n".join(transcript)

            text = f"**{now}**\n\n" + transcript

            # Delete ack message and send transcript as a reply
            bot.delete_message(message.chat.id, ack_reply.id)
            bot.reply_to(message, text)

    except Exception as e:
        print(f"🚨 Error! {e}")
        bot.reply_to(message, f"🚨 Error! {e}")


@bot.message_handler(commands=['start', 'help'])
def send_welcome(message):
	bot.reply_to(message, "Hey, let's start. What are your thoughts?")


@bot.message_handler(func=lambda message: True)
def echo_all(message):
	bot.reply_to(message, message.text)


@bot.message_handler(content_types=['audio', 'voice'])
def handle_docs_audio(message):
    handle_audio_message(message)


# getMe
me = bot.get_me()
print(f"Running bot with ID: {me.id} | Name: {me.username}")

# Run polling
bot.infinity_polling()

Running bot with ID: 5191934564 | Name: pensabox_bot
👂 Received a 2s voice note. Transcribing...
👂 Received a 3s voice note. Transcribing...
👂 Received a 3s voice note. Transcribing...
👂 Received a 9s voice note. Transcribing...
👂 Received a 13s voice note. Transcribing...


In [None]:
%%bash
for i in *.ogg; do
    ffmpeg -i "$i"-acodec pcm_s16le "${i%ogg}wav"
done