Optimize `cut_into_windows` for long cuts #1150

flyingleafe · 2023-09-15T07:40:59Z

Even though the ability to optimize truncate calls using IntervalTree index on supervisions has been present for quite a while, it hasn't been used in the cut_into_windows method... I don't know why.

I used this optimization when the numbers of windows and supervisions were not very small and achieved dramatic (~50x) improvements for cutting sets of very long recordings into windows:

In [0]:
cs.describe()

Out[0]:
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count:               │ 36       │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 35:29:42 │
├───────────────────────────┼──────────┤
│ mean                      │ 3549.5   │
├───────────────────────────┼──────────┤
│ std                       │ 136.8    │
├───────────────────────────┼──────────┤
│ min                       │ 3355.4   │
├───────────────────────────┼──────────┤
│ 25%                       │ 3477.2   │
├───────────────────────────┼──────────┤
│ 50%                       │ 3539.7   │
├───────────────────────────┼──────────┤
│ 75%                       │ 3558.1   │
├───────────────────────────┼──────────┤
│ 99%                       │ 4023.6   │
├───────────────────────────┼──────────┤
│ 99.5%                     │ 4037.8   │
├───────────────────────────┼──────────┤
│ 99.9%                     │ 4049.1   │
├───────────────────────────┼──────────┤
│ max                       │ 4052.0   │
├───────────────────────────┼──────────┤
│ Recordings available:     │ 36       │
├───────────────────────────┼──────────┤
│ Features available:       │ 0        │
├───────────────────────────┼──────────┤
│ Supervisions available:   │ 9339     │
╘═══════════════════════════╧══════════╛
Speech duration statistics:
╒══════════════════════════════╤══════════╤═════════════════════╕
│ Total speech duration        │ 34:38:04 │ 97.58% of recording │
├──────────────────────────────┼──────────┼─────────────────────┤
│ Total speaking time duration │ 34:38:01 │ 97.57% of recording │
├──────────────────────────────┼──────────┼─────────────────────┤
│ Total silence duration       │ 00:51:38 │ 2.42% of recording  │
╘══════════════════════════════╧══════════╧═════════════════════╛

In [1]:
%%time
# without the change
windowed = cs.cut_into_windows(duration=30.0).to_eager()
len(windowed)

Out [1]:
CPU times: user 2min 51s, sys: 135 ms, total: 2min 52s
Wall time: 2min 52s

4278

In [2]:
%%time
# with the change
windowed = cs.cut_into_windows(duration=30.0).to_eager()
len(windowed)

Out [2]:
CPU times: user 3.7 s, sys: 44.2 ms, total: 3.74 s
Wall time: 3.7 s

4278

(Tested on Intel Xeon Gold 5315Y CPU @ 3.20GHz)

pzelasko

Great catch!! Perhaps we can simplify the code to always build the tree, I doubt the overhead with small cuts would be significant.

flyingleafe · 2023-09-16T06:08:09Z

@pzelasko I actually would ultimately opt for always having the supervisions (and tracks in MixCut, and alignments inside a supervision) stored in an interval tree, maybe a thin wrapper around one from the library which can fall back to linear splits for small tree sizes... But that would be a major codebase overhaul, which should be further discussed, though I anticipate that such an approach would make lots of operations more efficient while making lots of code much simpler.

In the meanwhile, I removed the condition, and the IntervalTree is built every time.

pzelasko · 2023-09-16T18:44:17Z

Interesting point. Do you see any immediate benefits of transitioning to interval tree to hold the supervisions? Except for arrays/tensors, I generally wanted to keep the APIs free of third-party classes (they're practically limited to built-in types and a few Lhotse basic types).

flyingleafe · 2023-09-18T07:09:59Z

@pzelasko I do not see particular benefits of keeping limited to Python built-in types. Tbh, I would rather migrate everything related to manifests and their sets into some kind of performant pure immutable persistent trees and sequences with laziness by default (i.e. rewrite everything in Haskell lol).

To keep it real, I think it would be quite nice to have a kind of Splittable interface with methods split_right and split_left, which are always <= O(log n). Both methods split the interval-based object (and the set of all its descendants) into two around the point and returns a pair of left and right objects. If the object is in fact atomic (i.e. a Supervision without alignments and with text or custom data), it is a no-op, and the original object is returned either as left or right result (depending on the method called).

Split is more simple and fundamental than truncate, and truncate and all other operations can be implemented using it. Some operations (cut_into_windows being the example) are implemented more efficiently using individual splits (and also in a potentially easily parallelizable way, if we cut dichotomically starting from the middle).

If any type of manifest supports Splittable efficiently, then doing all other operations efficiently seems almost trivial. We can define proper __repr__, __getitem__ and __iter__ for our generic IntervalTree based container which supports splits so that it can be treated completely similarly to a simple list, if a user wants. I do not see particular upsides in keeping strictly to the builtin data structures.

pzelasko · 2023-09-19T19:35:42Z

@pzelasko I do not see particular benefits of keeping limited to Python built-in types. Tbh, I would rather migrate everything related to manifests and their sets into some kind of performant pure immutable persistent trees and sequences with laziness by default (i.e. rewrite everything in Haskell lol).

At some point I contemplated moving basic data types to C++ structs for smaller memory footprint and faster (de)serialization, and maybe actual immutability, but it would have drastically complicated maintenance and further development of the library and limited the number of people who could help out. I like the idea in principle but it's likely not worth the trouble.

To keep it real, I think it would be quite nice to have a kind of Splittable interface with methods split_right and split_left, which are always <= O(log n). Both methods split the interval-based object (and the set of all its descendants) into two around the point and returns a pair of left and right objects. If the object is in fact atomic (i.e. a Supervision without alignments and with text or custom data), it is a no-op, and the original object is returned either as left or right result (depending on the method called).

Split is more simple and fundamental than truncate, and truncate and all other operations can be implemented using it. Some operations (cut_into_windows being the example) are implemented more efficiently using individual splits (and also in a potentially easily parallelizable way, if we cut dichotomically starting from the middle).

If any type of manifest supports Splittable efficiently, then doing all other operations efficiently seems almost trivial. We can define proper __repr__, __getitem__ and __iter__ for our generic IntervalTree based container which supports splits so that it can be treated completely similarly to a simple list, if a user wants. I do not see particular upsides in keeping strictly to the builtin data structures.

I understand the technical solution, but what would be the impact/motivation?

flyingleafe · 2023-09-20T09:12:34Z

@pzelasko The motivation is to unify and heavily simplify all logic around cutting stuff recursively, throwing out large amounts of repetitive code and stopping caring about performance optimizations on a case-by-case basis. I.e. imagine if MixedCut was like

class MixedCut(Cut):
     id: str
     tracks: OurIntervalTree[MixTrack]
     transforms: ...
     
     ...

and Cut is supposed to implement Splittable, and OurIntervalTree is Splittable, such that it continues to split a border element further if the element is splittable itself, and MixTrack is trivially also Splittable, being just a thin wrapper around Cut. A correct split method implementation (from which truncate is derived), would then be just:

def split(self, offset: Seconds, side: Side = LEFT, preserve_id: Optional[Side] = None) -> Tuple[Cut, Cut]:
    l_tracks, r_tracks = self.tracks.split(offset, side, preserve_id)
    l_id = self.id if preserve_id == LEFT else str(uuid4())
    r_id = self.id if preserve_id == RIGHT else str(uuid4())
    return (
        self._fastcopy(id=l_id, tracks=l_tracks, duration=offset),
        self._fastcopy(id=r_id, tracks=r_tracks, duration=self.duration - offset)
    )

where _fastcopy handles the edge cases of no tracks left and SNR fix:

def _fastcopy(self, id, tracks, duration=None):
   np_tracks = [t for t in tracks if not isinstance(t, PaddingCut)]
   duration = self.duration if duration is None else duration
   if len(np_tracks) == 0:
      return PaddingCut(id=id, sampling_rate=self.sampling_rate, duration=duration, ...)
   if all(t.snr is not None for t in np_tracks):
      tracks = ... //... do this snr resetting
   return fastcopy(self, tracks=tracks)

For the DataCut with supervisions inside an interval tree, split implementation would look almost exactly the same. Well, actually, it can be even exactly the same and be shared inside the Cut definition, if we:

abstract away general fastcopy as an overridable method;
abstract away a list of objects' properties which should be split further (e.g. supervisions for DataCut, tracks for MixedCut);
handle start and duration properly for DataCut, ignoring them in MixedCuts implementation of fastcopy.

Imagine how much code could be thrown out and how much smaller the codebase could become.
So basically what I say is that such an approach can be a starting point in a major (yet very doable) refactoring which would result in Lhotse's codebase becoming much smaller, more maintainable and more performant.

danpovey · 2023-09-20T09:52:25Z

I think splitting very long utterances into windows is kind of an unusual use-case that shouldn't drive the overall design.
My feeling is that in general if the design needs to change it should be in the direction of being simpler and more self-explanatory.
Also I think cutting something into windows is something that basically shouldn't take that long anyway, and shouldn't require any fancy implementation.

Optimize cut_into_windows for long cuts

46edcd2

pzelasko reviewed Sep 15, 2023

View reviewed changes

Build IntervalTree always

5971aba

desh2608 approved these changes Sep 16, 2023

View reviewed changes

desh2608 merged commit 567ba29 into lhotse-speech:master Sep 16, 2023
10 checks passed

pzelasko added this to the v1.17 milestone Oct 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `cut_into_windows` for long cuts #1150

Optimize `cut_into_windows` for long cuts #1150

flyingleafe commented Sep 15, 2023

pzelasko left a comment

flyingleafe commented Sep 16, 2023

pzelasko commented Sep 16, 2023

flyingleafe commented Sep 18, 2023 •

edited

Loading

pzelasko commented Sep 19, 2023

flyingleafe commented Sep 20, 2023

danpovey commented Sep 20, 2023

Optimize cut_into_windows for long cuts #1150

Optimize cut_into_windows for long cuts #1150

Conversation

flyingleafe commented Sep 15, 2023

pzelasko left a comment

Choose a reason for hiding this comment

flyingleafe commented Sep 16, 2023

pzelasko commented Sep 16, 2023

flyingleafe commented Sep 18, 2023 • edited Loading

pzelasko commented Sep 19, 2023

flyingleafe commented Sep 20, 2023

danpovey commented Sep 20, 2023

Optimize `cut_into_windows` for long cuts #1150

Optimize `cut_into_windows` for long cuts #1150

flyingleafe commented Sep 18, 2023 •

edited

Loading