# OWSM-CTC with CTCSegmentation for Irish

> "tl;dr: OWSM-CTC is good enough for alignment for Irish"

author: Jim O' Regan, KTH, Sweden

- branch: master
- toc: false
- categories: [owsm, ctc, alignment, irish]

In [2]:
import requests
from bs4 import BeautifulSoup

def get_page_text_and_audio(url, poetry=True):
    req = requests.get(url)
    if req.status_code != 200:
        return None
    soup = BeautifulSoup(req.text, 'html.parser')

    page_text = soup.find("div", {"class": "page-text"})

    audio_file = ""
    audio = page_text.find("audio")
    if audio is not None:
        source = audio.find("source")
        if source is not None:
            audio_file = "https://www.leighleat.com" + source["src"]
    audio.decompose()

    if poetry:  # what does this do?
        out_text = page_text.text.strip()
    else:
        pass

    return out_text, audio_file

In [3]:
page_text, audio_url = get_page_text_and_audio("https://www.leighleat.com/poems/26")

In [4]:
page_text

'Damhán Alla\nDamhán alla\nDamhán alla\nAr an mballa\nAr an mballa\nTháinig éan\nTháinig éan\nÓ mo léan\nÓ mo léan!'

In [5]:
audio_file = audio_url.split("/")[-1]
!wget {audio_url} -O {audio_file}

--2025-02-20 16:05:09--  https://www.leighleat.com/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaHBBdmdNIiwiZXhwIjpudWxsLCJwdXIiOiJibG9iX2lkIn19--1e2441aa5cfdfdc2ed88fafc4a1ed354739f6af6/damhan%20alla.mp3
Resolving www.leighleat.com (www.leighleat.com)... 15.197.149.68, 76.223.57.73, 3.33.241.96, ...
Connecting to www.leighleat.com (www.leighleat.com)|15.197.149.68|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://leigh-leat-sadhbh.s3.eu-west-1.amazonaws.com/cirikxcsa8kh3jlojzq05x6lsc7z?response-content-disposition=attachment%3B%20filename%3D%22damhan%20alla.mp3%22%3B%20filename%2A%3DUTF-8%27%27damhan%2520alla.mp3&response-content-type=audio%2Fmpeg&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIASIHEIVVZTT46GREG%2F20250220%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20250220T150510Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=185addc24ad6eaee980041d300b9a7c6acea913289b8388536792238e4aff2f6 [following]
--202

In [6]:
audio_file

'damhan%20alla.mp3'

In [None]:
%%capture
wav_file = audio_file.replace(".mp3", ".wav")
!ffmpeg -i {audio_file} -acodec pcm_s16le -ac 1 -ar 16000 {wav_file}

ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

This was for Colab, but the free GPU doesn't have enough RAM for `CTCSegmentation`

In [8]:
#%%capture
%pip install git+https://github.com/pyf98/espnet@owsm-ctc
%pip install espnet_model_zoo flash-attn

Collecting git+https://github.com/pyf98/espnet@owsm-ctc
  Cloning https://github.com/pyf98/espnet (to revision owsm-ctc) to /tmp/pip-req-build-_5_2mo6d
  Running command git clone --filter=blob:none --quiet https://github.com/pyf98/espnet /tmp/pip-req-build-_5_2mo6d
  Running command git checkout -b owsm-ctc --track origin/owsm-ctc
  Switched to a new branch 'owsm-ctc'
  Branch 'owsm-ctc' set up to track remote branch 'owsm-ctc' from 'origin'.
  Resolved https://github.com/pyf98/espnet to commit b4aa13f55b9a058e41fe97a9daac21ea8e8c8f83
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.
Collecting flash-attn
  Using cached flash_attn-2.7.4.post1.tar.gz (6.0 MB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1

Not a single word of this was correct

In [15]:
import soundfile as sf
import numpy as np
import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch


s2t = Speech2TextGreedySearch.from_pretrained(
    "pyf98/owsm_ctc_v3.1_1B",
    device="cpu",       # I am too poor to have a good gpu on my home pc sadg
    generate_interctc_outputs=False,
    lang_sym='<gle>',
    task_sym='<asr>',
)

speech, rate = sf.read(wav_file)

speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)


Fetching 42 files: 100%|██████████| 42/42 [00:00<00:00, 347457.14it/s]
  with autocast(False):


('<gle><asr> Dbhháanlabhháanla do bháálla ar an malala ar an malalaái na gcéan thinigh gcéan ó maléonómléan.', ['<gle>', '<asr>', '▁D', 'bh', 'há', 'an', 'la', 'bh', 'há', 'an', 'la', '▁do', '▁b', 'há', 'ál', 'la', '▁ar', '▁an', '▁mala', 'la', '▁ar', '▁an', '▁mala', 'la', 'ái', '▁na', '▁g', 'cé', 'an', '▁th', 'in', 'igh', '▁g', 'cé', 'an', '▁', 'ó', '▁ma', 'lé', 'on', 'ó', 'm', 'lé', 'an', '.'], [41, 155, 1210, 29650, 25679, 378, 459, 29650, 25679, 378, 459, 260, 1046, 25679, 7651, 459, 1354, 235, 10656, 459, 1354, 235, 10656, 459, 22351, 373, 1476, 6769, 378, 3028, 328, 15634, 1476, 6769, 378, 181, 1055, 714, 3924, 441, 1055, 232, 3924, 378, 184], 'Dbhháanlabhháanla do bháálla ar an malala ar an malalaái na gcéan thinigh gcéan ó maléonómléan.', None)


In [10]:
utt_text = [f"utt{x} {y}" for x, y in enumerate(page_text.split("\n"), start=1)]

In [16]:
!apt install git-lfs

[1;31mE: [0mCould not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)[0m
[1;31mE: [0mUnable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?[0m


In [None]:
!git lfs install

Git LFS initialized.


In [17]:
!git clone https://huggingface.co/pyf98/owsm_ctc_v3.1_1B

Cloning into 'owsm_ctc_v3.1_1B'...
remote: Enumerating objects: 109, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 109 (delta 16), reused 0 (delta 0), pack-reused 82 (from 1)[K
Receiving objects: 100% (109/109), 1.51 MiB | 1.88 MiB/s, done.
Resolving deltas: 100% (47/47), done.
Filtering content: 100% (4/4), 7.49 GiB | 11.90 MiB/s, done.
Encountered 2 file(s) that may not have been copied correctly on Windows:
	exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till40epoch.pth
	exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth

See: `git lfs help smudge` for more details.


In [18]:
!ln -sd owsm_ctc_v3.1_1B/data/
!ln -sd owsm_ctc_v3.1_1B/exp/

In [19]:
import soundfile as sf
speech, rate = sf.read(wav_file)

In [None]:
from espnet2.bin.s2t_ctc_align import CTCSegmentation

aligner = CTCSegmentation(
    s2t_model_file="exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth",
    fs=16000,
    ngpu=0,           # :,(
    batch_size=16,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="fixed",
    samples_to_frames_ratio=1280,   # 80ms time shift; don't change as it depends on the pre-trained model
    lang_sym="<gle>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
    frames_per_sec=12.5,    # 80ms time shift; don't change as it depends on the pre-trained model
)

print(f"speech duration: {len(speech) / rate : .2f} seconds")

segments = aligner(speech, utt_text)



CUDA: registered at /dev/null:173 [kernel]
Meta: registered at /dev/null:198 [kernel]
BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:194 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at /pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:503 [backend fallback]
Functionalize: registered at /pytorch/aten/src/ATen/FunctionalizeFallbackKernel.cpp:349 [backend fallback]
Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /pytorch/aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at /pytorch/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at /pytorch/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.

speech duration:  14.66 seconds



CUDA: registered at /dev/null:173 [kernel]
Meta: registered at /dev/null:198 [kernel]
BackendSelect: fallthrough registered at /pytorch/aten/src/ATen/core/BackendSelectFallbackKernel.cpp:3 [backend fallback]
Python: registered at /pytorch/aten/src/ATen/core/PythonFallbackKernel.cpp:194 [backend fallback]
FuncTorchDynamicLayerBackMode: registered at /pytorch/aten/src/ATen/functorch/DynamicLayer.cpp:503 [backend fallback]
Functionalize: registered at /pytorch/aten/src/ATen/FunctionalizeFallbackKernel.cpp:349 [backend fallback]
Named: registered at /pytorch/aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
Conjugate: registered at /pytorch/aten/src/ATen/ConjugateFallback.cpp:17 [backend fallback]
Negative: registered at /pytorch/aten/src/ATen/native/NegateFallback.cpp:18 [backend fallback]
ZeroTensor: registered at /pytorch/aten/src/ATen/ZeroTensorFallback.cpp:86 [backend fallback]
ADInplaceOrView: fallthrough registered at /pytorch/aten/src/ATen/core/VariableFallbackKernel.

In [22]:
for segment in str(segments).split("\n"):
    parts = segment.split(" ")
    print(" ".join(parts[0:5]))

utt1 utt 0.28 1.24 -1.2299
utt2 utt 3.18 4.04 -0.8518
utt3 utt 4.14 5.00 -1.3033
utt4 utt 5.18 6.12 -1.4109
utt5 utt 6.14 7.16 -1.6551
utt6 utt 7.50 8.68 -1.0598
utt7 utt 8.94 10.12 -0.9344
utt8 utt 10.46 11.96 -0.6785
utt9 utt 12.54 14.68 -0.8216



In [None]:
def segments_to_audacity(segments, filename):
    txt_segments = str(segments).split("\n")
    with open(filename, "w") as outf:
        for segment in txt_segments:
            if segment == "":
                continue
            parts = segment.split(" ")
            start = parts[2]
            end = parts[3]
            text = " ".join(parts[5:])
            outparts = "\t".join([start, end, text])
            outf.write(outparts + "\n")

In [None]:
segments_to_audacity(segments, wav_file.replace(".wav", ".tsv"))

Labels adjusted with audacity: the timings aren't perfect

In [None]:
!cat 'damhan%20alla.txt' |awk -F'\t' '{print $1 "\t" $2}'

cat: damhan%20alla.txt: No such file or directory
