# Long-form audio transcription using Citrinet
Modified from:
- https://colab.research.google.com/gist/titu1994/a44fffd459236988ee52079ff8be1d2e/long-audio-transcription-citrinet.ipynb?pli=1#scrollTo=rZITgro3DC_v
- https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html
- [Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition](https://arxiv.org/abs/2104.01721)

> Long-form audio transcription is an interesting application of ASR. Generally, models are trained on short segments of 15-20 seconds of audio clips. If an ASR model is compatibile with streaming inference, it can then be evaluated on audio clips much longer than the training duration.

> Generally, streaming inference will incur a small increase in WER due to loss of long term context. Think of it as this - if a streaming model has a context window of a few seconds of audio, even if it streams several minute long audio clips - later transcriptions have lost some of prior context.

> In this demo, we consider the naive case of long-form audio transcription, asking the question - in the offline mode (i.e. when the model is given the entire audio sequence at once), what is maximum duration of audio that it can transcribe?

> For the purposes of this demo, we will test the limits of Citrinet models [(arxiv)](https://arxiv.org/abs/2104.01721), which are purely convolutional ASR models.

> Unlike general attention based models, convolutional models don't have a quadratic cost to their context window, but they also miss out on global context offered by the attention mechanism. Citrinet instead attains relatively long context by replacing attention with [Squeeze-and-Excitation modules](https://arxiv.org/abs/1709.01507) between its blocks.

## Install NeMo
- https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/starthere/intro.html#installation

In [1]:
# !apt-get update && apt-get install -y libsndfile1 ffmpeg
# !pip install Cython
# !pip install nemo_toolkit[all]

In [2]:
# BRANCH = 'main'
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH #egg=nemo_toolkit[all]
# !pip install inflect
# print("Finished installing nemo !")

In [3]:
import nemo.collections.asr as nemo_asr

[NeMo W 2021-12-15 13:24:07 optimizers:50] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2021-12-15 13:24:07 experimental:27] Module <function get_argmin_mat at 0x7fc3afacc160> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-12-15 13:24:07 experimental:27] Module <function getMultiScaleCosAffinityMatrix at 0x7fc3afacc1f0> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-12-15 13:24:07 experimental:27] Module <function parse_scale_configs at 0x7fc3afaccd30> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-12-15 13:24:07 experimental:27] Module <function get_embs_and_timestamps at 0x7fc3afaccdc0> is experimental, not ready for production and is not fully supported. Use at your own risk.
################################################################################
###          (pleas

## Download audio from YouTube

> In order to make the task slightly more difficult, we will attempt to transcribe an entire podcast at once. 

> Why a podcast? Podcasts are generally long verbal discussions between one or more people on a specific topic, the domain of discussion is unlikely to match the model's training corpus (unless the training corpus is vast), and possibly inclue background audio or sponsorship information in between the discussion.

**Below, please give your permission to download the audio clip and the transcript from the Screaming in the Cloud podcast mentioned above.**

In [4]:
# !pip install ffmpeg-python
# !pip install ffmpeg
# !pip install ffprobe

In [5]:
import os
from os.path import exists as path_exists

if not path_exists('transcripts'):
    !mkdir transcripts

In [6]:
YouTubeID = 'gFFLJaQbLCM' 
OutputFile = 'test_audio_youtube.m4a'

In [7]:
if not path_exists(OutputFile):
    !youtube-dl -o $OutputFile $YouTubeID --extract-audio --restrict-filenames -f 'bestaudio[ext=m4a]'

[youtube] ns42RRd9BIg: Downloading webpage
[download] data/raw_audio.mp3 has already been downloaded
[K[download] 100% of 12.27MiB
[ffmpeg] Post-process file data/raw_audio.mp3 exists, skipping


## Preprocess Audio

> We now have the raw audio file (in mp3 format) from the podcast. To make this audio file compatible with the model (monochannel, 16 KHz audio), we will use FFMPEG to preprocess this file.

In [8]:
import sys
import glob 
import subprocess

def transcode(input_dir, output_format, sample_rate, skip, duration):
    files = glob.glob(os.path.join(input_dir, "*.*"))

    # Filter out additional directories
    files = [f for f in files if not os.path.isdir(f)]

    output_dir = os.path.join(input_dir, "processed")

    if not os.path.exists(output_dir):
        print(f"Output directory {output_dir} does not exist, creating ...")
        os.makedirs(output_dir)

    for filepath in files:
        output_filename = os.path.basename(filepath)
        output_filename = os.path.splitext(output_filename)[0]

        output_filename = f"{output_filename}_processed.{output_format}"

        args = [
            'ffmpeg',
            '-i',
            str(filepath),
            '-ar',
            str(sample_rate),
            '-ac',
            str(1),
            '-y'
        ]

        if skip is not None:
            args.extend(['-ss', str(skip)])

        if duration is not None:
            args.extend(['-to', str(duration)])

        args.append(os.path.join(output_dir, output_filename))
        command = " ".join(args)
        !{command}

    print("\n")
    print(f"Finished trancoding {len(files)} audio files")


In [9]:
transcode(
        input_dir="./data/",
        output_format="wav",
        sample_rate=16000,
        skip=None,
        duration=None,
    )

ffmpeg version 4.4.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.4.0 (GCC)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1636205340875/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1636205340875/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-demuxer=dash --enable-gnutls --enable-gpl --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-vaapi --enable-libx264 --enable-libx265 --enable-libaom --enable-libsvtav1 --enable-libxml2 --enable-libvpx --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame --pkg-config=/home/conda/feedstock_root/build_artifacts/ffmpeg_1636205340875/_build_env/bin/pkg-config
  lib

# Transcribe the processed audio file

Now that we have a "ground truth" text transcript we can compare against, let's actually transcribe the podcast with a model !

## Helper methods

We define a few helper methods to enable automatic mixed precision if it is available in the colab GPU (if a GPU is being used at all)

In [10]:
import contextlib
import torch

In [11]:
# Helper for torch amp autocast
if torch.cuda.is_available():
    autocast = torch.cuda.amp.autocast
else:
    @contextlib.contextmanager
    def autocast():
        print("AMP was not available, using FP32!")
        yield

In [12]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## Instantiate a model
- https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/results.html

We choose a small model - Citrinet 256 - since it offers good transcription accuracy but is just 10 M parameters.

**Feel free to change to the medium and larger sized models !**

 - small = "stt_en_citrinet_256" (9.8 M parameters)
 - medium = "stt_en_citrinet_512" (38 M parameters)
 - large = "stt_en_citrinet_1024" (142 M parameters)

In [40]:
# model = nemo_asr.models.EncDecCTCModel.from_pretrained("stt_es_quartznet15x5", map_location=device)

In [31]:
model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained("stt_es_citrinet_512", map_location=device)

In [32]:
model = model.to(device)

## Transcribe audio

Here, we simply call the model's "transcribe()" method, which does offline transcription of a provided list of audio clips.

In [33]:
%%time

audio_path = "data/processed/raw_audio_processed.wav"
transcribed_filepath = f"transcripts/normalized/transcribed_speech.txt"

if os.path.exists(transcribed_filepath):
    print(f"File already exists, delete {transcribed_filepath} manually before re-transcribing the audio !")

else:
    with autocast():
        transcript = model.transcribe([audio_path], batch_size=1)[0]

File already exists, delete transcripts/normalized/transcribed_speech.txt manually before re-transcribing the audio !
CPU times: user 519 µs, sys: 185 µs, total: 704 µs
Wall time: 451 µs


## Write transcription

In [34]:
if not os.path.exists(os.path.dirname(transcribed_filepath)):
    os.makedirs(os.path.dirname(transcribed_filepath))

In [35]:
with open(transcribed_filepath, 'w', encoding='utf-8') as f:
    f.write(f"{transcript}\n")

In [36]:
assert False

AssertionError: 

# Compute accuracy of transcription

Now that we have a model's transcriped result, we compare the WER and CER against the "ground truth" transcription that we preprocessed earlier

In [37]:
model_transcript = transcribed_filepath

In [38]:
with open(model_transcript, 'r') as f:
    transcription_txt = f.readlines()
    transcription_txt = [text.replace("\n", "") for text in transcription_txt]

In [39]:
transcription_txt

['sois maría esperanza caszúna soy unndrés malhamut y esto es agora ocaste conversación política el diario haz buen día andrés como estás todo la me festejando qué estamos festejando es la democracia no como lo vemos al lo cortemos una oportunidad para pelearnos soy bueno a la democracia yo la veo bien vos como la vez lo que me interesa es pensar qué cosas logramos y qué cosas sin embargo parecen ser un bulímita no si puedo citar a raúl alfonsín con la democracia se come y con la democracia se educca no se con la democracia somos un montón de cosas buenas pero hay ciertas cosas que no parece que logramos poder hacer buen punto de comienzo si tengo de sientetillarlo diré que la democracia evita que nos matemos pero no garantiza que vibamos bien y esto lo que tenemos después de cuanto vamos treinta y ocho años treinta ocho años en que dejamo de matarnos por razones políticas y no solamente esto quiero ir un poco más allá porque la socie argentina sigue siendo una de las menos violentas d

-----
The model did fairly well, considering it wasn't trained on any corpus with technical terms (the train corpus is only publically available speech datasets). Furthermore, the ground truth preprocessing is not sufficient in some cases, but for a demonstration it's a reasonable effort.

# [Extra] Seeking the upper limit of audio sequence length

So we were able to transcribe a nearly 40 minute podcast and obtain a moderately accurate transcript. While this was great for a first effort, this raises the question - 

**Given 16 GB of memory on a GPU, what is the upper bound of audio duration that can be transcribed by a Citrinet model in a single forward pass?**

In [23]:
import librosa
import datetime
import math
import gc

original_duration = librosa.get_duration(filename=audio_path)
print("Original audio duration :", datetime.timedelta(seconds=original_duration))

Original audio duration : 0:12:36.459687


In order to extend the audio duration, we will concatenate the same audio clip multiple times, and then trim off any excess duration from the clip as needed.

For convenience, we provide a scalar multiplier to the original audio duration.

In [24]:
# concatenate the file multiple times
NUM_REPEATS = 3.5
new_duration = original_duration * NUM_REPEATS

# write a temp file
with open('audio_repeat.txt', 'w') as f:
    for _ in range(int(math.ceil(NUM_REPEATS))):
        f.write(f"file {audio_path}\n")

Duplicate the audio several times, then trim off the required duration from the concatenated audio clip.

In [25]:
repeated_audio_path = "data/processed/concatenated_audio.wav"

!ffmpeg -t {new_duration} -f concat -i audio_repeat.txt -c copy -t {new_duration} {repeated_audio_path} -y
print("Finished repeating audio file!")

ffmpeg version 4.4.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 9.4.0 (GCC)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1636205340875/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1636205340875/_build_env/bin/x86_64-conda-linux-gnu-cc --disable-doc --disable-openssl --enable-avresample --enable-demuxer=dash --enable-gnutls --enable-gpl --enable-hardcoded-tables --enable-libfreetype --enable-libopenh264 --enable-vaapi --enable-libx264 --enable-libx265 --enable-libaom --enable-libsvtav1 --enable-libxml2 --enable-libvpx --enable-pic --enable-pthreads --enable-shared --disable-static --enable-version3 --enable-zlib --enable-libmp3lame --pkg-config=/home/conda/feedstock_root/build_artifacts/ffmpeg_1636205340875/_build_env/bin/pkg-config
  lib

In [26]:
original_duration = librosa.get_duration(filename=audio_path)
repeated_duration = librosa.get_duration(filename=repeated_audio_path)

print("Original audio duration :", datetime.timedelta(seconds=original_duration))
print("Repeated audio duration :", datetime.timedelta(seconds=repeated_duration))

Original audio duration : 0:12:36.459687
Repeated audio duration : 0:44:07.619062


Attempt to transcribe it (Note this may OOM!)

In [27]:
# Clear up memory
torch.cuda.empty_cache()
gc.collect()

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# device = 'cpu'  # You can transcribe even longer samples on the CPU, though it will take much longer !
model = model.to(device)

In [28]:
%%time

with autocast():
    transcript_repeated = model.transcribe([repeated_audio_path], batch_size=1)[0]
    del transcript_repeated

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

CPU times: user 2.16 s, sys: 1.47 s, total: 3.62 s
Wall time: 3.76 s


Given a large amount of GPU memory, the Citrinet model can efficiently transcribe long audio segments with ease, without the need for streaming inference.

This is possible due to a simple reason - no attention mechanism is used, and Squeeze-and-Excitation mechanism does not require quadratic memory requirements yet still provides reasonable global context information.