Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kaldi export of lhotse supervisions manifest is incorrect #811

Closed
ngoel17 opened this issue Sep 14, 2022 · 11 comments · May be fixed by #693
Closed

Kaldi export of lhotse supervisions manifest is incorrect #811

ngoel17 opened this issue Sep 14, 2022 · 11 comments · May be fixed by #693
Labels
bug Something isn't working

Comments

@ngoel17
Copy link

ngoel17 commented Sep 14, 2022

It appears to me that the kaldi export of lhotse supervisions has a bug.
Consider the fisher supervisions manifest below.

{"id": "fe_03_00001-000", "recording_id": "fe_03_00001", "start": 3.76, "duration": 1.78, "channel": 0, "text": "and i generally prefer", "language": "English", "speaker": "2602"}

wav.scp encodes the channel as
fe_03_00001_0

The channel 0 should be appended to the segments file so that the output is like
fe_03_00001-000 fe_03_00001_0 3.76 5.54

but the actual output is
fe_03_00001-000 fe_03_00001 3.76 5.54

@pzelasko
Copy link
Collaborator

Oh, good point, I don't think this particular case was tested against multi-channel exports. @jtrmal would you be interested to help with this one?

@ngoel17
Copy link
Author

ngoel17 commented Sep 14, 2022

# segments
save_kaldi_text_mapping(
data={
sup.id: f"{sup.recording_id}_{sup.channel} {sup.start} {sup.end}"
for sup in supervisions
},
path=output_dir / "segments",

Something like this OK in kaldi.py?

@ngoel17
Copy link
Author

ngoel17 commented Sep 14, 2022

If one were trying to use the kaldi format data, there turns out to be one more issue. the segment-id's are not in the correct format - prefixed by speaker ID, so fix_data_dir.sh will keep complaining that utt2spk is not sorted. I fixed this manually but in general I don't know if its worthwhile fixing.

@jtrmal
Copy link
Collaborator

jtrmal commented Sep 15, 2022 via email

@desh2608 desh2608 added the bug Something isn't working label Sep 22, 2022
@jtrmal
Copy link
Collaborator

jtrmal commented Sep 26, 2022

Sorry for the delay, I had some family stuff happening. I'm on it.

@jtrmal
Copy link
Collaborator

jtrmal commented Sep 26, 2022

I have a feeling this change was already once reverted in the past, because IIRC Piotr was saying their infrastructure relies on the formatting as it already was?

@pzelasko
Copy link
Collaborator

Oh yeah the comment was about being able to export and re import with the same IDs. How about we make the new behavior optional but enabled by default? This way it does the right thing for everybody.

@jtrmal
Copy link
Collaborator

jtrmal commented Sep 26, 2022

one option might be implement/reimplement the custom formatters I have the PR on
another option is to start adding options/switches, I guess

@pzelasko
Copy link
Collaborator

I am OK with both options, up to you

@jtrmal
Copy link
Collaborator

jtrmal commented Sep 30, 2022

this is being addressed in #831 and I apologize @ngoel17 and everyone else I didn't read the report carefully enough -- I discovered this on my own and realized only later on its a dupe

@jtrmal
Copy link
Collaborator

jtrmal commented Sep 30, 2022

resolved via #831

@jtrmal jtrmal closed this as completed Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants