Add multidataset #1010

yfyeung · 2023-04-18T11:10:33Z

greedy_search

test-clean & test-other	sum	config
1.9 & 4.06	5.96	epoch 30 avg 3
1.9 & 4.06	5.96	epoch 30 avg 4
1.91 & 4.06	5.97	epoch 30 avg 7
1.92 & 4.06	5.98	epoch 30 avg 5
1.93 & 4.05	5.98	epoch 30 avg 6
1.91 & 4.08	5.99	epoch 30 avg 2
1.91 & 4.1	6.01	epoch 30 avg 10
1.91 & 4.11	6.02	epoch 30 avg 8
1.91 & 4.11	6.02	epoch 30 avg 13
1.9 & 4.13	6.03	epoch 30 avg 12
1.91 & 4.12	6.03	epoch 30 avg 11
1.9 & 4.14	6.04	epoch 30 avg 1
1.92 & 4.12	6.04	epoch 30 avg 9
1.92 & 4.14	6.06	epoch 30 avg 14
1.95 & 4.2	6.15	epoch 30 avg 15
1.98 & 4.2	6.18	epoch 30 avg 16
2.0 & 4.24	6.24	epoch 30 avg 17
2.02 & 4.3	6.32	epoch 30 avg 18
2.02 & 4.32	6.34	epoch 30 avg 19
2.06 & 4.39	6.45	epoch 30 avg 20

modified_beam_search

test-clean & test-other	sum	config
1.89 & 3.99	5.88	epoch 30 avg 8
1.9 & 3.99	5.89	epoch 30 avg 7
1.88 & 4.02	5.90	epoch 30 avg 4
1.91 & 3.99	5.90	epoch 30 avg 5
1.9 & 4.0	5.90	epoch 30 avg 6
1.9 & 4.01	5.91	epoch 30 avg 3
1.89 & 4.03	5.92	epoch 30 avg 2
1.89 & 4.03	5.92	epoch 30 avg 9
1.9 & 4.03	5.93	epoch 30 avg 10
1.91 & 4.12	6.03	epoch 30 avg 1

fast_beam_search

test-clean & test-other	sum	config
1.9 & 3.98	5.88	epoch 30 avg 7
1.9 & 4.01	5.91	epoch 30 avg 6
1.9 & 4.01	5.91	epoch 30 avg 8
1.87 & 4.04	5.91	epoch 30 avg 9
1.92 & 4.0	5.92	epoch 30 avg 5
1.93 & 4.01	5.94	epoch 30 avg 4
1.92 & 4.03	5.95	epoch 30 avg 3
1.9 & 4.06	5.96	epoch 30 avg 10
1.93 & 4.05	5.98	epoch 30 avg 2
1.92 & 4.07	5.99	epoch 30 avg 1

csukuangfj · 2023-04-21T06:56:19Z

egs/librispeech/ASR/local/compute_fbank_librispeech.py

+    parser.add_argument(
+        "--perturb-speed",
+        type=str,
+        default=True,


Please use str2bool.

csukuangfj · 2023-04-21T06:56:54Z

egs/librispeech/ASR/local/compute_fbank_librispeech.py

-                    cut_set + cut_set.perturb_speed(0.9) + cut_set.perturb_speed(1.1)
-                )
+                if perturb_speed:
+                    cut_set = (


Please add a log saying it is doing speed perturb.

csukuangfj · 2023-04-21T06:59:18Z

egs/librispeech/ASR/pruned_transducer_stateless7/multidataset.py

+
+class MultiDataset:
+    def __init__(self, manifest_dir: str):
+        self.manifest_dir = Path(manifest_dir)


Please document what manifest_dir contains.

csukuangfj · 2023-04-21T07:00:29Z

egs/librispeech/ASR/pruned_transducer_stateless7/multidataset.py

+        filenames = list(
+            glob.glob(
+                f"{self.manifest_dir}/multidataset_split_1998/multidataset/multidataset_cuts_train.*.jsonl.gz"
+            )
+        )


Suggested change

filenames = list(

glob.glob(

f"{self.manifest_dir}/multidataset_split_1998/multidataset/multidataset_cuts_train.*.jsonl.gz"

)

)

filenames = glob.glob(

f"{self.manifest_dir}/multidataset_split_1998/multidataset/multidataset_cuts_train.*.jsonl.gz"

)

csukuangfj · 2023-04-21T07:00:44Z

egs/librispeech/ASR/pruned_transducer_stateless7/multidataset.py

+        )
+
+        pattern = re.compile(r"multidataset_cuts_train.([0-9]+).jsonl.gz")
+        idx_filenames = [(int(pattern.search(f).group(1)), f) for f in filenames]


Suggested change

idx_filenames = [(int(pattern.search(f).group(1)), f) for f in filenames]

idx_filenames = ((int(pattern.search(f).group(1)), f) for f in filenames)

csukuangfj · 2023-04-21T07:01:50Z

egs/librispeech/ASR/pruned_transducer_stateless7/multidataset.py

+        idx_filenames = [(int(pattern.search(f).group(1)), f) for f in filenames]
+        idx_filenames = sorted(idx_filenames, key=lambda x: x[0])
+
+        sorted_filenames = [f[1] for f in idx_filenames]


Suggested change

sorted_filenames = [f[1] for f in idx_filenames]

sorted_filenames = (f[1] for f in idx_filenames)

Fix all

csukuangfj · 2023-04-21T07:31:30Z

egs/librispeech/ASR/local/compute_fbank_librispeech.py

@@ -64,7 +64,7 @@ def get_args():
    parser.add_argument(
        "--perturb-speed",
        type=str,
-        default=True,
+        default=str2bool,


please refer to multidataset.py for how to use str2bool.

That's just a mistake by accident...

csukuangfj · 2023-04-21T14:10:25Z

egs/librispeech/ASR/pruned_transducer_stateless7/multidataset.py

+
+        logging.info(f"Loading {len(sorted_filenames)} splits")
+
+        return lhotse.combine(lhotse.load_manifest_lazy(p) for p in sorted_filenames)


Please use
lhotse-speech/lhotse#565

We only need to combine splits from the same dataset.

Yifan Yang and others added 15 commits April 18, 2023 12:41

Add Common Voice for multidataset

762b05f

Add prepare_multidataset.sh

e75985d

Add dataset mixing

dbf2e25

Fix for black

28e4f38

Fix for black

3fff5dc

Update prepare_multidataset.sh

954e4d7

Update prepare_multidataset.sh

f98e121

Update prepare_giga_speech.sh

ceef1cb

update comments

0330cab

Add split and shuffle mechanism

dddb9f7

Add split and shuffle mechanism

6646088

Merge branch 'k2-fsa:master' into multi

80dd778

Add multidataset train

a905dca

Fix for delete

5af68c1

Fix for modify

a8bca53

yfyeung requested a review from csukuangfj April 21, 2023 04:35

csukuangfj reviewed Apr 21, 2023

View reviewed changes

csukuangfj previously requested changes Apr 21, 2023

View reviewed changes

Yifan Yang added 3 commits April 21, 2023 15:16

Add comments

081d751

Change type for perturb_speed

69977a7

Fix for style check

f81d75f

yfyeung requested a review from csukuangfj April 21, 2023 07:23

csukuangfj reviewed Apr 21, 2023

View reviewed changes

Yifan Yang added 3 commits April 21, 2023 15:38

Small fix

307eaa6

Small fix

eb4533f

Add filter

ebc6b97

csukuangfj approved these changes Apr 21, 2023

View reviewed changes

Remove warning

02d5231

csukuangfj approved these changes Apr 21, 2023

View reviewed changes

yfyeung merged commit d67a49a into k2-fsa:master Apr 21, 2023
3 checks passed

yfyeung deleted the multi branch April 21, 2023 10:09

csukuangfj reviewed Apr 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multidataset #1010

Add multidataset #1010

yfyeung commented Apr 18, 2023 •

edited

csukuangfj Apr 21, 2023

yfyeung Apr 21, 2023

csukuangfj Apr 21, 2023

yfyeung Apr 21, 2023

csukuangfj Apr 21, 2023

yfyeung Apr 21, 2023

csukuangfj Apr 21, 2023

yfyeung Apr 21, 2023

csukuangfj Apr 21, 2023

yfyeung Apr 21, 2023

csukuangfj Apr 21, 2023

yfyeung Apr 21, 2023

csukuangfj Apr 21, 2023

yfyeung Apr 21, 2023 •

edited

csukuangfj Apr 21, 2023

	idx_filenames = [(int(pattern.search(f).group(1)), f) for f in filenames]
	idx_filenames = ((int(pattern.search(f).group(1)), f) for f in filenames)

	sorted_filenames = [f[1] for f in idx_filenames]
	sorted_filenames = (f[1] for f in idx_filenames)


		logging.info(f"Loading {len(sorted_filenames)} splits")

		return lhotse.combine(lhotse.load_manifest_lazy(p) for p in sorted_filenames)

Add multidataset #1010

Add multidataset #1010

Conversation

yfyeung commented Apr 18, 2023 • edited

greedy_search

modified_beam_search

fast_beam_search

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yfyeung Apr 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yfyeung commented Apr 18, 2023 •

edited

yfyeung Apr 21, 2023 •

edited