Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The mismatch between split and split-lazy, specifically counting from 0 or 1 #1152

Closed
yfyeung opened this issue Sep 16, 2023 · 6 comments · Fixed by #1156
Closed

The mismatch between split and split-lazy, specifically counting from 0 or 1 #1152

yfyeung opened this issue Sep 16, 2023 · 6 comments · Fixed by #1156

Comments

@yfyeung
Copy link
Contributor

yfyeung commented Sep 16, 2023

I note that lhotse split will generate splits that counting from 1, while lhotse split-lazy will generate splits that counting from 0.
In some icefall recipes like gigaspeech, we initially use lhotse split and then change to use lhotse split-lazy. However, we didn't consider this mismatch, resulting in some bugs.

@yfyeung
Copy link
Contributor Author

yfyeung commented Sep 16, 2023

For lhotse split:

def split_sequence(

For lhotse split-lazy:
def split_manifest_lazy(

@pzelasko
Copy link
Collaborator

Sorry for that, I think initially I wanted to be compatible with Kaldi's convention of 1-based splits for data directories, and later I didn't remember that when introducing the lazy thing. We should fix that. I probably prefer to adopt 0-based counting everywhere, seems more consistent with the rest of the and library and Python in general. WDYT @yfyeung @csukuangfj @desh2608 @danpovey?

@yfyeung
Copy link
Contributor Author

yfyeung commented Sep 17, 2023

Thanks. I agree with adhering to a 0-based counting system.

Best regards.

@desh2608
Copy link
Collaborator

Submitting jobs on an SGE cluster requires indices starting at 1, which I suppose is why 1-based indexing is used in Kaldi. I would suggest adding a "start_index" option to those functions, which defaults to 0.

@desh2608
Copy link
Collaborator

The above is also why we added the num_digits option in the lazy version. In the original implementation, I think padding was by default, but this made it hard to use the split manifests in array job submissions.

pzelasko added a commit that referenced this issue Sep 18, 2023
pzelasko added a commit that referenced this issue Sep 18, 2023
* Tutorial materials in main readme page

* Fixes for #1152 #1153 and #1154

* Fix isinstance use in Python 3.7-3.9
@pzelasko
Copy link
Collaborator

Thanks, should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants