MMS forced alignment backend #1185

flyingleafe · 2023-10-13T09:56:43Z

Note: MMS forced aligner uses the romanized text as input, and uses uroman for romanization. The latter is written in Perl. I had to make a wrapper package for it to avoid cringy direct downloading the of original Perl scripts somewhere. The wrapper package still calls perl in a subprocess though, which gives a significant overhead. I wonder if porting uroman fully to Python is a worthwhile effort.

Note #2: I did not change the default bundle_name for backwards-compatibility.

TODO list:

Provide a second backend for forced alignment (MMS_FA)
Word segmentation for spaceless languages (Chinese, Japanese, Thai, etc.)
Make alignment items contain original words from the supervision text (not normalized/romanized words)
Support parallel execution

flyingleafe · 2023-10-16T07:55:04Z

@desh2608 @pzelasko seems to be ready for review.

Caveats:

@rilshok 's parallel processing logic in VAD is too good to not be reused, but copypasting is evil; it is possible to make the approach general and refactor it out, but not sure if that fits into the scope of this PR;
When dealing with word tokenization in space-less languages, I spared zero thoughts and followed this guide; since there is no information on Southeast Asia spaceless languages except for Thai (which are Burmese, Lao, Khmer and probably a ton of other small SA languages), I just ignored those for the time being lol.

pzelasko · 2023-10-16T22:58:00Z

Thanks! I'll review this more carefully later. I think it's a good idea to separate out the parallelization logic to lhotse/parallel.py if you can manage to do it.

rilshok · 2023-10-18T13:39:23Z

@pzelasko I can put my class for parallel processing in an external module. Suggest where, I'll set aside time this weekend to open a new PR. As part of the PR, let's discuss my proposed changes and add this functionality to lhotse

flyingleafe · 2023-10-20T12:01:35Z

@pzelasko I did the refactor, please take a look.

flyingleafe · 2023-10-20T12:21:54Z

Also added the word tokenization libraries for Myanmar and Khmer. To the best of my knowledge, Lao is the only remaining major language which does not use spaces to divide text into words. I could not find a ready-to-use lib for Lao word tokenization, even though local guys did some research on automatizing it. I guess it would be up to some Lao contributors to add a Python library when the need arises.

pzelasko

In addition to comments below, it looks like this workflow depends on package language_data, we should check somewhere if its importable and prompt the user to install it.

Also I'm not sure if the language auto-detection works correctly, I tried to align mini LibriSpeech as a test, and got the following:

lhotse workflows align-with-torchaudio -n MMS_FA -d mps libri-train-5.jsonl.gz aligned-mms.jsonl.gz
Aligning:   0%|                                                                                                                 | 0/1519 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/bin/lhotse", line 33, in <module>
    sys.exit(load_entry_point('lhotse', 'console_scripts', 'lhotse')())
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/pzelasko/meaning/lhotse/lhotse/bin/modes/workflows.py", line 168, in align_with_torchaudio
    for cut in tqdm(
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/pzelasko/meaning/lhotse/lhotse/parallel.py", line 115, in __call__
    yield runner(item, **kwargs)
  File "/Users/pzelasko/meaning/lhotse/lhotse/workflows/forced_alignment/base.py", line 54, in __call__
    self.normalize_text(sup.text, language=sup.language)
  File "/Users/pzelasko/meaning/lhotse/lhotse/workflows/forced_alignment/mms_aligner.py", line 48, in normalize_text
    romanized_words = self._uroman(sep.join(orig_words), language=language).split(
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/uroman/__init__.py", line 14, in uroman
    language = Language.get(language).to_alpha3()
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/__init__.py", line 304, in get
    components = parse_tag(tag)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/tag_parser.py", line 212, in parse_tag
    subtag_error(subtags[0], 'a language code')
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/tag_parser.py", line 422, in subtag_error
    raise LanguageTagError(f"Expected {expected}, got {subtag!r}")
langcodes.tag_parser.LanguageTagError: Expected a language code, got 'english'

pzelasko · 2023-10-23T00:28:25Z

lhotse/parallel.py

+    """
+    A class which uses ProcessPoolExecutor to parallelize the execution of a callable class.
+    The instances of the runner class are instantiated separately in each worker process.
+    """


Coud you add an example of the usage in this doc? This API is not self-explanatory to me.

pzelasko · 2023-10-23T00:31:52Z

lhotse/workflows/forced_alignment/mms_aligner.py

+
+
+class MMSForcedAligner(ForcedAligner):
+    def __init__(self, bundle_name: str, device: str = "cpu"):


I'd remove bundle_name from param list and hardcode self.bundle_name = "MMS_FSA", this will allso allow to remove the assertion below.

I'd consider option check_language: bool = True which would warn the users about missing language field in the supervisions if detected (the message should mention how to disable the warnings as well).

pzelasko · 2023-10-23T00:38:20Z

lhotse/workflows/forced_alignment/workflow.py

+    - https://pytorch.org/audio/stable/pipelines.html
+
+    :param cuts: input CutSet.
+    :param bundle_name: name of the selected pretrained model from torchaudio.


Can you add documentation about MMS here and in the CLI (lhotse/bin/modes/workflows.py under align_with_torchaudio)? Ideally a few words about how to enable it.

pzelasko · 2023-10-23T00:47:30Z

lhotse/workflows/forced_alignment/workflow.py

+    bundle_name: str = "WAV2VEC2_ASR_BASE_960H",
+    device: str = "cpu",
+    normalize_text: bool = True,
+    num_jobs: int = 1,


This param needs to be exposed in the CLI (ideally also add a check that when num_jobs > 1, device == 'cpu')

pzelasko · 2023-10-23T00:49:15Z

lhotse/workflows/forced_alignment/base.py

+                pre_alignment = self.align(audio, transcript)
+            except FailedToAlign:
+                logging.info(
+                    f"Failed to align supervision '{sup.id}' for cut '{cut.id}'. Writing it without alignment."


It would be great to write out the original exception details here as well; e.g. I tried to turn on MPS on MacOS and got only generic "failed to align", but it worked OK with CPU; I wouldn't know why if I didn't suspect already.

pzelasko · 2023-10-23T00:54:25Z

Otherwise it seems to work well, great work @flyingleafe!

flyingleafe · 2023-10-25T11:04:47Z

@pzelasko I accounted for your comments, please check.

pzelasko

LGTM!

flyingleafe marked this pull request as ready for review October 16, 2023 07:37

flyingleafe force-pushed the forced-alignment-mms branch from e0fbc30 to 39f89c6 Compare October 16, 2023 07:39

flyingleafe added 6 commits October 20, 2023 12:15

MMS forced alignment backend

c5f1f1e

Word segmentation for spaceless languages (at leaset some of them)

6977695

Preserve original, unnormalized words in the alignment labels

2566820

Support for aligning in parallel

d23d7b7

Refactor parallel executor out

c0780a1

Support word tokenization for Myanmar and Khmer

8f86efe

flyingleafe force-pushed the forced-alignment-mms branch from 2676d4a to 8f86efe Compare October 20, 2023 12:15

pzelasko reviewed Oct 23, 2023

View reviewed changes

flyingleafe added 3 commits October 25, 2023 08:04

Fix the language normalization and include language_data import check

6928a24

Add docs and fix CLI parameters

cc1b2d7

More docs and check_language flag

b42a685

pzelasko approved these changes Oct 26, 2023

View reviewed changes

pzelasko enabled auto-merge (squash) October 26, 2023 03:11

Merge branch 'master' into forced-alignment-mms

7551af9

pzelasko added this to the v1.18 milestone Oct 26, 2023

pzelasko merged commit 2494d76 into lhotse-speech:master Oct 26, 2023
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMS forced alignment backend #1185

MMS forced alignment backend #1185

flyingleafe commented Oct 13, 2023 •

edited

Loading

flyingleafe commented Oct 16, 2023

pzelasko commented Oct 16, 2023

rilshok commented Oct 18, 2023

flyingleafe commented Oct 20, 2023

flyingleafe commented Oct 20, 2023

pzelasko left a comment

pzelasko Oct 23, 2023

pzelasko Oct 23, 2023

pzelasko Oct 23, 2023

pzelasko Oct 23, 2023

pzelasko Oct 23, 2023 •

edited

Loading

pzelasko Oct 23, 2023

pzelasko commented Oct 23, 2023

flyingleafe commented Oct 25, 2023

pzelasko left a comment



		class MMSForcedAligner(ForcedAligner):
		def __init__(self, bundle_name: str, device: str = "cpu"):

MMS forced alignment backend #1185

MMS forced alignment backend #1185

Conversation

flyingleafe commented Oct 13, 2023 • edited Loading

flyingleafe commented Oct 16, 2023

pzelasko commented Oct 16, 2023

rilshok commented Oct 18, 2023

flyingleafe commented Oct 20, 2023

flyingleafe commented Oct 20, 2023

pzelasko left a comment

Choose a reason for hiding this comment

pzelasko Oct 23, 2023

Choose a reason for hiding this comment

pzelasko Oct 23, 2023

Choose a reason for hiding this comment

pzelasko Oct 23, 2023

Choose a reason for hiding this comment

pzelasko Oct 23, 2023

Choose a reason for hiding this comment

pzelasko Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

pzelasko Oct 23, 2023

Choose a reason for hiding this comment

pzelasko commented Oct 23, 2023

flyingleafe commented Oct 25, 2023

pzelasko left a comment

Choose a reason for hiding this comment

flyingleafe commented Oct 13, 2023 •

edited

Loading

pzelasko Oct 23, 2023 •

edited

Loading