# A tutorial on obtaining accurate speech-to-text alignment for long audio and noisy text

This tutorial consists of two parts. 
- The first part corresponds to the paper *Less Peaky And More Accurate CTC Forced Alignment by Label Priors* published in ICASSP 2024. We will desmonstrate how to obtain more accurate speech-to-text alignment compared to a standard CTC model.
- In the second part, we will provide a robust pytorch-based speech-to-text alignment library to align long audio and noisy text. For example, aligning the whole book, [Walden by Henry David Thoreau](https://www.gutenberg.org/cache/epub/205/pg205-images.html) (of 115K words), with its audiobook chapter (of 30 minutes in this demo, or even longer) in the [LibriVox project](https://librivox.org/walden-by-henry-david-thoreau/).

## Part 1: obtaining more accurate CTC alignment by label priors

## Part 2: obtaining robust alignment for long audio and noisy text

In part 1, we performed forced alignment at the utterance level. In practice, we don't usually have a small segment (e.g., 10 seconds) of audio and its corresponding exact, verbatim transcription as in a laboratory setting (e.g. [LIBRISPEECH](https://www.openslr.org/12) corpus). Instead, the audios come in long form (e.g., a whole mp3 recording of speech for an hour). The transcription for the whole recording can by noisy and non-verbatim, which may not exactly match what's been spoken in the recording. In particular, in order to use the raw speech data for machine learning, we usually need to prepare a corpus of segmented audios. In some applications, we still hope to align the long audio and text data as much as possible. In this tutorial, we will provide a python library to support such use cases.

Here, we are facing two challenges:
- The audio is long, which may not be suitable to be handled as a whole due to, e.g., limited CPU/GPU memory.
- The transcript is noisy. It can be a partial transcript with some missing words. It may have significant errors. It may also contain extra contents that's not spoken in the audio (e.g., the corresponding audio is corrupted). It can be a combination of all cases. Thus, the conventional, basic forced alignment algorithm could provide very bad alignment results, as it assumes the audio and text match exactly.

There are a few existing solutions:
- [Kaldi](https://ieeexplore.ieee.org/document/8268956), [Gentle](https://github.com/lowerquality/gentle) and [this work](https://ieeexplore.ieee.org/document/7404861) employ a weighted finite state transducer (WFST) framework to model the noisy texts. 
- [WhisperX](https://github.com/m-bain/whisperX) uses attention mechanism to propose rough time stamps for uniformly segmented audio. Then, it performs phone-level or word-level forced alignment with an external aligner.
- [MMS](https://arxiv.org/abs/2305.13516) uses a special `<star>` token to handle missing words in the transcript.
- [SailAlign](https://www.semanticscholar.org/paper/SailAlign%3A-Robust-long-speech-text-alignment-Katsamanis-Georgiou/0b7f86429641b188cc62ec32eee590e8795a3d02) iteratively idenfity reliable regions and narrow down to align the remaining unaligned regions.

This tutorial is based on WFST and thus falls in the first category. Our implementation is based on PyTorch. Any CTC model in PyTorch can be eqquiped with our library to become a robust aligner.

### Install dependencies

For WFST, our library depends on [k2](https://github.com/k2-fsa/k2/), a pytorch-based WFST library.

In [None]:
# Check python and pytorch's version
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")
!python --version

In [None]:
!pip install k2==1.24.4.dev20240223+cuda11.7.torch2.0.1-cp310 -f https://k2-fsa.github.io/k2/cpu.html
!pip install cmudict g2p_en
!pip install git+https://github.com/huangruizhe/lis.git

In [None]:
!git clone xxx
import sys
sys.path.append('xxx')

### Prepare long audio and noisy text

We will demonstrate aligning the whole book, [Walden by Henry David Thoreau](https://www.gutenberg.org/cache/epub/205/pg205-images.html) (of 115K words), with its audiobook chapter (of 30 minutes) in the [LibriVox project](https://librivox.org/walden-by-henry-david-thoreau/).

In [None]:
# Download the whole book
import requests
from bs4 import BeautifulSoup

url = "https://www.gutenberg.org/cache/epub/205/pg205-images.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

text = soup.get_text()
text = text.replace("\r\n", "\n")

In [None]:
# Download a chapter of the audio book
!wget https://ia800707.us.archive.org/20/items/walden_librivox/walden_c07.mp3

In [None]:
# Play the audio
from IPython.display import Audio
Audio("walden_c07.mp3")

In [None]:
# Preview the transcript
print(text[:1000])

In [None]:
print(text[271400: 271400+1000])

As we can see above, the audio contains a header "This is a LibriVox recording ..." which is not transcribed. On the other hand, as we have downloaded the whole book, it contains a lot of extra text that's not spoken in the audio. Obviously, the standard forced alignment algorithm will not work in this case.

### Use WFST to represent the text

### Handle long audio