Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMS forced alignment backend #1185

Merged
merged 10 commits into from
Oct 26, 2023

Conversation

flyingleafe
Copy link
Contributor

@flyingleafe flyingleafe commented Oct 13, 2023

Closes #1120

Note: MMS forced aligner uses the romanized text as input, and uses uroman for romanization. The latter is written in Perl. I had to make a wrapper package for it to avoid cringy direct downloading the of original Perl scripts somewhere. The wrapper package still calls perl in a subprocess though, which gives a significant overhead. I wonder if porting uroman fully to Python is a worthwhile effort.

Note #2: I did not change the default bundle_name for backwards-compatibility.

TODO list:

  • Provide a second backend for forced alignment (MMS_FA)
  • Word segmentation for spaceless languages (Chinese, Japanese, Thai, etc.)
  • Make alignment items contain original words from the supervision text (not normalized/romanized words)
  • Support parallel execution

@flyingleafe flyingleafe marked this pull request as ready for review October 16, 2023 07:37
@flyingleafe
Copy link
Contributor Author

@desh2608 @pzelasko seems to be ready for review.

Caveats:

  • @rilshok 's parallel processing logic in VAD is too good to not be reused, but copypasting is evil; it is possible to make the approach general and refactor it out, but not sure if that fits into the scope of this PR;
  • When dealing with word tokenization in space-less languages, I spared zero thoughts and followed this guide; since there is no information on Southeast Asia spaceless languages except for Thai (which are Burmese, Lao, Khmer and probably a ton of other small SA languages), I just ignored those for the time being lol.

@pzelasko
Copy link
Collaborator

Thanks! I'll review this more carefully later. I think it's a good idea to separate out the parallelization logic to lhotse/parallel.py if you can manage to do it.

@rilshok
Copy link
Contributor

rilshok commented Oct 18, 2023

@pzelasko I can put my class for parallel processing in an external module. Suggest where, I'll set aside time this weekend to open a new PR. As part of the PR, let's discuss my proposed changes and add this functionality to lhotse

@flyingleafe
Copy link
Contributor Author

@pzelasko I did the refactor, please take a look.

@flyingleafe
Copy link
Contributor Author

Also added the word tokenization libraries for Myanmar and Khmer. To the best of my knowledge, Lao is the only remaining major language which does not use spaces to divide text into words. I could not find a ready-to-use lib for Lao word tokenization, even though local guys did some research on automatizing it. I guess it would be up to some Lao contributors to add a Python library when the need arises.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to comments below, it looks like this workflow depends on package language_data, we should check somewhere if its importable and prompt the user to install it.

Also I'm not sure if the language auto-detection works correctly, I tried to align mini LibriSpeech as a test, and got the following:

lhotse workflows align-with-torchaudio -n MMS_FA -d mps libri-train-5.jsonl.gz aligned-mms.jsonl.gz
Aligning:   0%|                                                                                                                 | 0/1519 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/bin/lhotse", line 33, in <module>
    sys.exit(load_entry_point('lhotse', 'console_scripts', 'lhotse')())
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/pzelasko/meaning/lhotse/lhotse/bin/modes/workflows.py", line 168, in align_with_torchaudio
    for cut in tqdm(
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/Users/pzelasko/meaning/lhotse/lhotse/parallel.py", line 115, in __call__
    yield runner(item, **kwargs)
  File "/Users/pzelasko/meaning/lhotse/lhotse/workflows/forced_alignment/base.py", line 54, in __call__
    self.normalize_text(sup.text, language=sup.language)
  File "/Users/pzelasko/meaning/lhotse/lhotse/workflows/forced_alignment/mms_aligner.py", line 48, in normalize_text
    romanized_words = self._uroman(sep.join(orig_words), language=language).split(
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/uroman/__init__.py", line 14, in uroman
    language = Language.get(language).to_alpha3()
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/__init__.py", line 304, in get
    components = parse_tag(tag)
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/tag_parser.py", line 212, in parse_tag
    subtag_error(subtags[0], 'a language code')
  File "/Users/pzelasko/miniconda3/envs/lhotse-py3.10/lib/python3.10/site-packages/langcodes/tag_parser.py", line 422, in subtag_error
    raise LanguageTagError(f"Expected {expected}, got {subtag!r}")
langcodes.tag_parser.LanguageTagError: Expected a language code, got 'english'

"""
A class which uses ProcessPoolExecutor to parallelize the execution of a callable class.
The instances of the runner class are instantiated separately in each worker process.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coud you add an example of the usage in this doc? This API is not self-explanatory to me.



class MMSForcedAligner(ForcedAligner):
def __init__(self, bundle_name: str, device: str = "cpu"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove bundle_name from param list and hardcode self.bundle_name = "MMS_FSA", this will allso allow to remove the assertion below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider option check_language: bool = True which would warn the users about missing language field in the supervisions if detected (the message should mention how to disable the warnings as well).

- https://pytorch.org/audio/stable/pipelines.html

:param cuts: input CutSet.
:param bundle_name: name of the selected pretrained model from torchaudio.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add documentation about MMS here and in the CLI (lhotse/bin/modes/workflows.py under align_with_torchaudio)? Ideally a few words about how to enable it.

bundle_name: str = "WAV2VEC2_ASR_BASE_960H",
device: str = "cpu",
normalize_text: bool = True,
num_jobs: int = 1,
Copy link
Collaborator

@pzelasko pzelasko Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This param needs to be exposed in the CLI (ideally also add a check that when num_jobs > 1, device == 'cpu')

pre_alignment = self.align(audio, transcript)
except FailedToAlign:
logging.info(
f"Failed to align supervision '{sup.id}' for cut '{cut.id}'. Writing it without alignment."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to write out the original exception details here as well; e.g. I tried to turn on MPS on MacOS and got only generic "failed to align", but it worked OK with CPU; I wouldn't know why if I didn't suspect already.

@pzelasko
Copy link
Collaborator

Otherwise it seems to work well, great work @flyingleafe!

image

@flyingleafe
Copy link
Contributor Author

@pzelasko I accounted for your comments, please check.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pzelasko pzelasko enabled auto-merge (squash) October 26, 2023 03:11
@pzelasko pzelasko added this to the v1.18 milestone Oct 26, 2023
@pzelasko pzelasko merged commit 2494d76 into lhotse-speech:master Oct 26, 2023
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New forced alignment backend
3 participants