Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize save_audios() #1131

Merged
merged 2 commits into from
Sep 5, 2023

Conversation

KarelVesely84
Copy link
Contributor

Hi Piotr and Fangjun,
i have also some improvements for the CutSet::save_audios() function.
This is mainly to improve the cache hit ratio for long audio files with
many utterances in each audio file.

I have also a calling script for that, would you be interested in having that too ?

Cheres,
Karel

  • added sorted method : CutSet::sort_by_recording_id()
  • allow to disable shuffling of CutSet inside CutSet::save_audios()
    • both changes improve cache hit ratio
  • CutSet::save_audios() : show in log if caching was active
  • caching.py : replace Union[] by Optional[]

@KarelVesely84
Copy link
Contributor Author

Hi @pzelasko @csukuangfj , did you have a chance to "peek" into this PR ?
Thank you, Karel

@csukuangfj
Copy link
Contributor

It looks good to me

👍 💯

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Karel! Somehow this slipped through. I left a couple of comments.

@@ -2374,6 +2387,11 @@ def save_audios(
)
executor = None

logging.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't emit logs in lhotse methods (for better or worse), can we remove it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, actually, by default the logging.info(.) level does not print anything.
The first "active" verbosity level by default is : logging.warn(.).

So, technically, it does not print anything, unless a debug mode is started by lhotse user...
Piotr, do you still want to remove it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If lhotse is used in icefall and if icefall uses a logging level >= info, I think this message would still be printed out.

Copy link
Contributor Author

@KarelVesely84 KarelVesely84 Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is also true, and info is the default log-level in icefall.
https://github.com/k2-fsa/icefall/blob/master/icefall/utils.py#L113

@@ -1413,6 +1414,17 @@ def combine_same_recording_channels(self) -> "CutSet":
groups = groupby(lambda cut: (cut.recording.id, cut.start, cut.end), self)
return CutSet.from_cuts(MultiCut.from_mono(*cuts) for cuts in groups.values())

def sort_by_recording_id(self, ascending: bool = True) -> "CutSet":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a unit test to cover this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to that.

I found a similar unit-test test_cut_set_sort_by_duration().
Could you point me to the place, where this test is called ?

I did not find any code calling test_cut_set_sort_by_duration() in the /test
folder and there are some arguments necessary to be filled...
Thx, Karel

@@ -2326,6 +2338,7 @@ def save_audios(
executor: Optional[Executor] = None,
augment_fn: Optional[AugmentFn] = None,
progress_bar: bool = True,
shuffle_on_split: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option is undocumented

@pzelasko pzelasko added this to the v1.17 milestone Aug 31, 2023
- added sorted method : CutSet::sort_by_recording_id()
- allow to disable shuffling of CutSet inside `CutSet::save_audios()`
   - both changes improve cache hit ratio
- `CutSet::save_audios()` : show in log if caching was active
- caching.py : replace Union[] by Optional[]
@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Sep 4, 2023

The code is updated, but the unit-test is not yet complete (it is missing the calling code).

@pzelasko
Copy link
Collaborator

pzelasko commented Sep 4, 2023

We're using pytest which auto-detects all test functions (by default every function with prefix test_) and calls them for you. In order to make the tests run you'll need to decorate your test with a parametrize decorator that specifies the grid of parameters over which you want to test the function, see e.g. the function you based your test on test_cut_set_sort_by_duration -- every tuple corresponds to a set of keyword arguments and gets a separate function invocation. Also for clarity, the first argument that seems to be missing is a fixture (i.e. an object returned by another function decorated with @pytest.fixture) that is created once and re-used in all tests (to locate it search for def cut_set_with_mixed_cut).

- adding unit-test, removing logging.info(), documenting `shuffle_on_split`
@KarelVesely84
Copy link
Contributor Author

Hi @pzelasko @csukuangfj ,
all the remarks are integrated, and the unit-tests are passing.
It seems to be ready.
Karel

@pzelasko
Copy link
Collaborator

pzelasko commented Sep 5, 2023

Cool, thanks!

@pzelasko pzelasko merged commit fdf042b into lhotse-speech:master Sep 5, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants