Speech to text translation utilizing 3-way data #1099

AmirHussein96 · 2023-07-17T01:03:41Z

This is a pull request for 3-way Tunisian Arabic to English speech to text recipe from iwslt22 shared task https://iwslt.org/2022/dialect.
This is also an introduction on how to prepare a 3-way data (speech, transcription, and translation) to use in model training, e.g., Multitask learning scenario."

The idea now is to add target language translation to custom field as additional information for the supervision, since both source and target languages correspond to the same recording. See example below:

cut[10]
MonoCut(id='994203_ta_eng_20170614_214611_15072_B_010266', start=102.662, duration=2.674, channel=0, supervisions=[SupervisionSegment(id='994203_ta_eng_20170614_214611_15072_B_010266', recording_id='20170614_214611_15072_B', start=0.0, duration=2.674, channel=0, text='اه وزيد يظهرلي الببا ما يخليهاش', language='ta', speaker='994203', gender=None, custom={'tgt_lang': 'eng', 'tgt_text': "ah in addition apparently my dad won't let her"}, alignment=None)], features=None, recording=Recording(id='20170614_214611_15072_B', sources=[AudioSource(type='file', channels=[0], source='/export/common/data/corpora/LDC/LDC2022E01/data/audio/ta/20170614_214611_15072_B.sph')], sampling_rate=8000, num_samples=5120320, duration=640.04, channel_ids=[0], transforms=None), custom=None)

desh2608 · 2023-07-17T06:34:53Z

lhotse/bin/modes/recipes/.nfs0000000076da098000000013

@@ -0,0 +1,56 @@
+from typing import Optional, Sequence, Union


Perhaps this file was mistakenly added.

lhotse/bin/modes/recipes/iwslt22_ta.py

desh2608 · 2023-07-17T06:38:55Z

lhotse/dataset/speech_translation.py

+                [
+                    {
+                        "text": supervision.text,
+                        "tgt_text": supervision.custom["tgt_text"],


How would the dataset be prepared if I want to have multiple target translations in different languages?

That's an excellent question, and I hadn't considered that before. I believe a possible approach would be to concatenate the target languages together, including their language tags, and store that concatenated text in tgt_text. If you have a multilingual BPE (Byte Pair Encoding) model, you can tokenize tgt_text to include all languages and use it for training. If you have any better suggestions, please let me know. Additionally, I want to mention that I already have an Icefall recipe for a trained Zipformer speech translation model, and the results have been very promising. I plan to push that in the upcoming days.

How about a nested field in supervision.custom:

{ "translated_text": { "en": text_en, "fr": text_fr, ... } }

Perhaps the tgt_text and tgt_lang should be tuples or lists instead, where each item in the tuple is one language. But this is your choice. Users can also choose to extend this class for multi-task ST.

Edit: +1 for Piotr's suggestion

Done, thank you @pzelasko @desh2608 for the nice suggestions. I followed @pzelasko suggestion.

desh2608 · 2023-07-17T06:39:42Z

lhotse/dataset/speech_translation.py

+        return batch
+
+
+def validate_for_asr(cuts: CutSet) -> None:


You can import this function from speech_recognition.py if it is unchanged.

lhotse/recipes/iwslt22_ta.py

desh2608 · 2023-07-17T06:41:08Z

lhotse/recipes/iwslt22_ta.py

+# limitations under the License.
+
+"""
+IWSLT Tunisian is a 3-way parallel data includes 160 hours and 200k lines worth of aligned Audio, 


Could you add more details about the dataset, including citation for the original paper (and links)?

lhotse/recipes/iwslt22_ta.py

desh2608 · 2023-07-26T14:28:45Z

@AmirHussein96 please resolve conflicts and fix the tests (also remove WIP when you think it's ready for another review).

AmirHussein96 · 2023-07-31T03:41:38Z

@AmirHussein96 please resolve conflicts and fix the tests (also remove WIP when you think it's ready for another review).

@desh2608 ready for another review.

desh2608

Some minor suggestions.

desh2608 · 2023-07-31T13:15:43Z

lhotse/dataset/speech_translation.py

+from lhotse.workarounds import Hdf5MemoryIssueFix
+
+
+class K2Speech2textTranslationDataset(torch.utils.data.Dataset):


K2Speech2textTranslationDataset -> K2Speech2TextTranslationDataset

desh2608 · 2023-07-31T13:16:35Z

lhotse/dataset/speech_translation.py

+        input_strategy: BatchIO = PrecomputedFeatures(),
+    ):
+        """
+        k2 ASR IterableDataset constructor.


K2Speech2TextTranslationDataset constructor.

desh2608 · 2023-07-31T13:18:45Z

lhotse/recipes/iwslt22_ta.py

+    """
+    logging.info(
+        """
+        To obtaining this data your institution needs to have an LDC subscription.


desh2608 · 2023-07-31T13:18:58Z

lhotse/recipes/iwslt22_ta.py

+    logging.info(
+        """
+        To obtaining this data your institution needs to have an LDC subscription.
+        You also should download the predined splits with


*pre-defined

desh2608 · 2023-07-31T13:21:04Z

lhotse/recipes/iwslt22_ta.py

+    corpus_dir: Pathlike,
+    splits: Pathlike,
+    output_dir: Optional[Pathlike] = None,
+    clean: bool = False,


Ideally keep the option name here same as the one in the CLI (i.e. normalize_text). Also "clean" has several other connotations other than normalization, e.g., it can refer to resegmentation, data filtering, etc.

desh2608 · 2023-07-31T14:17:14Z

lhotse/recipes/iwslt22_ta.py

+# UO/ - uncertain + foreign
+
+
+arabic_filter = re.compile(r"[OUM]+/*|\u061F|\?|\!|\.")


(Tagging @pzelasko)

Putting the regex compilation in the global scope means that it would be run whenever users call import lhotse. Even if you are using the compiled regex several times, there is no real benefit in defining it globally since Python internally caches it anyway, so you might just compile it in the function where you are using it.

+1, cool info about regex caching, I wasn't aware of that (SO reference https://stackoverflow.com/questions/12514157/how-does-pythons-regex-pattern-caching-work)

Thanks for the catch. I fixed it.

desh2608

Just one minor change.

desh2608 · 2023-08-17T19:58:23Z

docs/corpus.rst

@@ -185,6 +185,8 @@ a CLI tool that create the manifests given a corpus directory.
    - :func:`lhotse.recipes.prepare_mgb2`
  * - XBMU-AMDO31
    - :func:`lhotse.recipes.xbmu_amdo31`
+  * - IWSLT22_Ta


This list is in alphabetic order.

I fixed it.

desh2608

LGTM!

System User added 7 commits July 16, 2023 20:27

iwslt_ta recipe and introduction to 3-way speech to text ST

f501fd6

remove testing

d1eb521

applying black

a98b8ac

remove pdb related

53132be

.

75b148a

sort the imports alphabetically

10aeca7

.

8cee861

desh2608 reviewed Jul 17, 2023

View reviewed changes

extending to multilingual ST and addressing other comments

a089d4e

System User and others added 2 commits July 30, 2023 22:33

conflicts resolved

2211218

Merge branch 'master' into st/IWSLT22TaDialect

6788c52

AmirHussein96 changed the title ~~[WIP]: Speech to text translation utilizing 3-way data~~ Speech to text translation utilizing 3-way data Jul 31, 2023

System User added 5 commits July 30, 2023 22:44

fixing tests

ff4f462

pull

3bd02d6

.

baaa994

.

283db39

langs abrvs as list of comma seperated strings

f63b9fc

desh2608 self-requested a review July 31, 2023 13:14

desh2608 reviewed Jul 31, 2023

View reviewed changes

AmirHussein96 and others added 5 commits August 16, 2023 10:45

Update speech_translation.py

1774719

Update speech_translation.py

d3cadcf

Update iwslt22_ta.py

eb271b4

Update iwslt22_ta.py

186148f

Merge branch 'master' into st/IWSLT22TaDialect

ff4f9cd

desh2608 reviewed Aug 17, 2023

View reviewed changes

Update corpus.rst

d05c147

desh2608 approved these changes Aug 17, 2023

View reviewed changes

desh2608 enabled auto-merge (squash) August 17, 2023 22:06

desh2608 merged commit c80fc07 into lhotse-speech:master Aug 17, 2023
9 of 10 checks passed

AmirHussein96 mentioned this pull request Mar 20, 2024

ASR,ST and CS recipies #1307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech to text translation utilizing 3-way data #1099

Speech to text translation utilizing 3-way data #1099

AmirHussein96 commented Jul 17, 2023

desh2608 Jul 17, 2023

AmirHussein96 Jul 17, 2023

desh2608 Jul 17, 2023

AmirHussein96 Jul 17, 2023

pzelasko Jul 17, 2023

desh2608 Jul 17, 2023 •

edited

Loading

AmirHussein96 Jul 17, 2023

desh2608 Jul 17, 2023

AmirHussein96 Jul 17, 2023

desh2608 Jul 17, 2023

AmirHussein96 Jul 17, 2023

desh2608 commented Jul 26, 2023

AmirHussein96 commented Jul 31, 2023

desh2608 left a comment

desh2608 Jul 31, 2023

AmirHussein96 Aug 16, 2023

desh2608 Jul 31, 2023

AmirHussein96 Aug 16, 2023

desh2608 Jul 31, 2023

AmirHussein96 Aug 16, 2023

desh2608 Jul 31, 2023

AmirHussein96 Aug 16, 2023

desh2608 Jul 31, 2023

AmirHussein96 Aug 16, 2023

desh2608 Jul 31, 2023

pzelasko Jul 31, 2023

AmirHussein96 Aug 16, 2023

desh2608 left a comment

desh2608 Aug 17, 2023

AmirHussein96 Aug 17, 2023

desh2608 left a comment

		@@ -0,0 +1,56 @@
		from typing import Optional, Sequence, Union

		from lhotse.workarounds import Hdf5MemoryIssueFix


		class K2Speech2textTranslationDataset(torch.utils.data.Dataset):

		# UO/ - uncertain + foreign


		arabic_filter = re.compile(r"[OUM]+/*\|\u061F\|\?\|\!\|\.")

Speech to text translation utilizing 3-way data #1099

Speech to text translation utilizing 3-way data #1099

Conversation

AmirHussein96 commented Jul 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desh2608 Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desh2608 commented Jul 26, 2023

AmirHussein96 commented Jul 31, 2023

desh2608 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desh2608 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desh2608 left a comment

Choose a reason for hiding this comment

desh2608 Jul 17, 2023 •

edited

Loading