Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making the kaldi import more robust #1129

Merged

Conversation

KarelVesely84
Copy link
Contributor

get_duration():

  • recover if audio file cannot be loaded for get_duration(), drop such recordings...
  • use chunksize for ProcessPoolExecutor::map (avoid hanging of ProcessPoolExecutor for large RecordingSets)

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Aug 23, 2023

Hi Piotr, what would you think about this change ?
Let's talk... :-D

(I found this issue while importing per-utterance flac files for Chime challenge)
Cheers,
Karel

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I think I found one bug though (see the other comment) -- can you test it?

lhotse/kaldi.py Outdated
durations = dict(zip(recordings.keys(), dur_vals))

# remove recordings with 'None' duration (i.e. there was a read error)
for recording_id, duration in durations.items():
if durations == None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if durations == None:
if duration is None:

@pzelasko pzelasko added this to the v1.17 milestone Aug 23, 2023
lhotse/kaldi.py Outdated
logging.warning(
f"[{recording_id}] Could not get duration. "
f"Failed to read audio from `{recordings[recording_id]}`. "
f"Dropping the recording from manifest."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"Dropping the recording from manifest."
"Dropping the recording from manifest."

lhotse/kaldi.py Outdated
durations = dict(zip(recordings.keys(), dur_vals))

# remove recordings with 'None' duration (i.e. there was a read error)
for recording_id, dur_value in durations.items():
if dur_value == None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the style issue, we can use

        if dur_value is None:

to replace

        if dur_value == None:

get_duration():
- recover if audio file cannot be loaded for get_duration(), drop such recordings...
- use chunksize for ProcessPoolExecutor::map (avoid hanging of ProcessPoolExecutor for large RecordingSets)
@KarelVesely84
Copy link
Contributor Author

Ok, both suggested changes are done. I also added a new sanity check...
Cheers,
K.

- not more than 20% utterances can be dropped on `kaldi import`
Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@pzelasko pzelasko merged commit c6fa990 into lhotse-speech:master Aug 24, 2023
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants