making the kaldi import more robust #1129

KarelVesely84 · 2023-08-23T16:33:35Z

get_duration():

recover if audio file cannot be loaded for get_duration(), drop such recordings...
use chunksize for ProcessPoolExecutor::map (avoid hanging of ProcessPoolExecutor for large RecordingSets)

KarelVesely84 · 2023-08-23T16:38:05Z

Hi Piotr, what would you think about this change ?
Let's talk... :-D

(I found this issue while importing per-utterance flac files for Chime challenge)
Cheers,
Karel

pzelasko

Looks good to me, I think I found one bug though (see the other comment) -- can you test it?

pzelasko · 2023-08-23T16:59:00Z

lhotse/kaldi.py

        durations = dict(zip(recordings.keys(), dur_vals))

+    # remove recordings with 'None' duration (i.e. there was a read error)
+    for recording_id, duration in durations.items():
+        if durations == None:


Suggested change

if durations == None:

if duration is None:

csukuangfj · 2023-08-23T20:42:55Z

lhotse/kaldi.py

+            logging.warning(
+                f"[{recording_id}] Could not get duration. "
+                f"Failed to read audio from `{recordings[recording_id]}`. "
+                f"Dropping the recording from manifest."


Suggested change

f"Dropping the recording from manifest."

"Dropping the recording from manifest."

csukuangfj · 2023-08-23T20:43:32Z

lhotse/kaldi.py

        durations = dict(zip(recordings.keys(), dur_vals))

+    # remove recordings with 'None' duration (i.e. there was a read error)
+    for recording_id, dur_value in durations.items():
+        if dur_value == None:


To fix the style issue, we can use

if dur_value is None:

to replace

if dur_value == None:

get_duration(): - recover if audio file cannot be loaded for get_duration(), drop such recordings... - use chunksize for ProcessPoolExecutor::map (avoid hanging of ProcessPoolExecutor for large RecordingSets)

KarelVesely84 · 2023-08-24T08:00:00Z

Ok, both suggested changes are done. I also added a new sanity check...
Cheers,
K.

- not more than 20% utterances can be dropped on `kaldi import`

pzelasko

Thanks, LGTM

pzelasko reviewed Aug 23, 2023

View reviewed changes

pzelasko added this to the v1.17 milestone Aug 23, 2023

csukuangfj reviewed Aug 23, 2023

View reviewed changes

making the kaldi import more robust

3b94fc7

get_duration(): - recover if audio file cannot be loaded for get_duration(), drop such recordings... - use chunksize for ProcessPoolExecutor::map (avoid hanging of ProcessPoolExecutor for large RecordingSets)

KarelVesely84 force-pushed the kaldi_import_get_duration branch from 087c5ef to 4c1f3cf Compare August 24, 2023 07:58

incorporating the PR comments, adding sanity check

c70b20b

- not more than 20% utterances can be dropped on `kaldi import`

KarelVesely84 force-pushed the kaldi_import_get_duration branch from 4c1f3cf to c70b20b Compare August 24, 2023 08:01

pzelasko approved these changes Aug 24, 2023

View reviewed changes

pzelasko merged commit c6fa990 into lhotse-speech:master Aug 24, 2023
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

making the kaldi import more robust #1129

making the kaldi import more robust #1129

KarelVesely84 commented Aug 23, 2023

KarelVesely84 commented Aug 23, 2023 •

edited

Loading

pzelasko left a comment

pzelasko Aug 23, 2023

csukuangfj Aug 23, 2023

csukuangfj Aug 23, 2023

KarelVesely84 commented Aug 24, 2023

pzelasko left a comment

	f"Dropping the recording from manifest."
	"Dropping the recording from manifest."

making the kaldi import more robust #1129

making the kaldi import more robust #1129

Conversation

KarelVesely84 commented Aug 23, 2023

KarelVesely84 commented Aug 23, 2023 • edited Loading

pzelasko left a comment

Choose a reason for hiding this comment

pzelasko Aug 23, 2023

Choose a reason for hiding this comment

csukuangfj Aug 23, 2023

Choose a reason for hiding this comment

csukuangfj Aug 23, 2023

Choose a reason for hiding this comment

KarelVesely84 commented Aug 24, 2023

pzelasko left a comment

Choose a reason for hiding this comment

KarelVesely84 commented Aug 23, 2023 •

edited

Loading