Initial draft of how the CutSet might look like #16

pzelasko · 2020-05-23T02:10:13Z

The fun part begins.

I've started thinking about the Cuts part design and I've written down some thoughts and sketches. I wouldn't say it's there yet - actually I might want to sketch the Dataset too to better understand how we'll use the CutSet first. Anyway, I'm pushing it out so that you might comment in this early stage if you like.

danpovey · 2020-05-23T06:38:11Z

lhotse/cut.py

+
+    supervisions: List[SupervisionSegment]
+
+    # The features can span longer than the actual cut.


Maybe we can note here that the Features object "knows" its start and end time within
the underlying recording, and that we expect the interval [begin,begin+duration]
to be a subset of the interval represented in features.

danpovey · 2020-05-23T06:39:51Z

lhotse/cut.py

+        raise NotImplementedError()
+
+    def overlay(self, other: 'Cut', offset: Seconds = 0.0) -> 'Cut':
+        """


This might require some kind of way to specify the SNR, in case the other one is noise?

danpovey · 2020-05-23T06:40:13Z

lhotse/cut.py

+
+@dataclass
+class Cut:
+    id: str


Let's document it to clarify what it is.. I'm thinking something like:
""" A Cut is a single segment that we'll train on. It contains the features corresponding to
a piece of a recording, with zero or more SupervisionSegments.
"""

…me methods

… scenario

pzelasko · 2020-05-29T03:51:19Z

I pushed some changes focusing on the source separation use-case, as Manuel was interested in testing Lhotse with Asteroid. It's not ready yet, but I'd appreciate any comments at this point.

danpovey · 2020-05-29T05:25:23Z

lhotse/features.py

@@ -122,20 +122,56 @@ class Features:
    channel_id: int
    start: Seconds
    duration: Seconds
+
+    # The Features object must know its frame length and shift to trim the features matrix when loading.


Incidentally: I hope to use the 'snip_edges: false' option for Kaldi-like feature extraction.. this makes it so that the num-frames does not depend on the frame-shift, which is a nice simplification. We could possibly make this requirement.

... in any case I think we need to be clear about what exactly 'start' and 'duration' mean.
E.g. is it the 1st and last+1 sample which is covered by the features? Is it the center of the 1st frame until the center of the last-plus-one frame? Perhaps it could start at n*frame_shift. [Note: the center of each frame would be n*frame_shift + 0.5].

Also regarding snip_edges: for consistency when operating inside the file, we could possibly do it differently for internal segments? Maybe make snip_edges false but whenever we actually extract features for a segment, we pad with zeros or with the appropriate audio context in calling code?

Incidentally: I hope to use the 'snip_edges: false' option for Kaldi-like feature extraction.. this makes it so that the num-frames does not depend on the frame-shift, which is a nice simplification. We could possibly make this requirement.

I think you meant frame-length, right? I had to double-check with the Kaldi docs. Could you elaborate on how it makes things easier?

... in any case I think we need to be clear about what exactly 'start' and 'duration' mean.

As I understand it currently, the 1st sample will start with start. It's less clear with duration - with snip_edges = False (default), the last sample is either at the end or after it; with snip_edges = True, it is either at the end or before it. You lost me at the n*frame_shift idea.

Also regarding snip_edges: for consistency when operating inside the file, we could possibly do it differently for internal segments? Maybe make snip_edges false but whenever we actually extract features for a segment, we pad with zeros or with the appropriate audio context in calling code?

Yeah that's a good remark. When I add the option to extract features for specified segments I'll take care to either pass extra context samples or pad with zeros.

frame-length, yes.
I think by n*frame_shift, I may have been meaning that duration == n * frame_shift where n == number of frames.

It would probably make sense to have duration always be n * frame_shift, but understand that this is kind of approximate and actually it may use slightly more context than that?

I'd rather make it so that downstream, these cuts behave quite simply w.r.t. the relationship of duration and num-frames, and the user doesn't have to worry about the frame-length and how it may impact the num-frames. That was a rabbit-hole in Kaldi itself.

OK, so you're asking that based on the SupervisionSegments and the Features, we create the Cuts so that their duration is a multiple of frame_shift. I'm thinking we can easily do that for datasets where the recording is a stream (e.g. 15-minute phone conversation) as we have a lot of control there. For datasets where the recording is a segment already (like LibriMix and probably many others) it basically means removing a very small part of the end of the recording. Are we okay with the trade-off that we slightly modify the segment boundaries in Cut to have duration-num_frames consistency?

mpariente · 2020-05-29T07:58:23Z

lhotse/dataset/source_separation.py

+        mixture = torch.from_numpy(mixture_cut.load_features(root_dir=self.root_dir))
+        sources = torch.stack(
+            [torch.from_numpy(source_cut.load_features(root_dir=self.root_dir)) for source_cut in source_cuts],
+            dim=0
+        )


What features would you expect here?
The sum of the sources's features will probably not sum to the mixture right?
If we'd like another type of supervision, for example a mask or something like this, we'd compute on the fly or use the supervisions from CutSet?

I think that maybe is it related to overlay_fbank ? You can just mix fbanks with no log no ?

I looked at one of the datasets in Asteroid, LibriMix, and saw that you were returning a list of sources as well as the mix, so I mimicked it for the first take. We can make this (or another Dataset variant) return a mask instead if you'd like. Can you point me to an example of how do you obtain a mask as supervision?

I intended to stick to log-mel energies for now, unless it doesn't make sense for source separation?

I intended to stick to log-mel energies for now, unless it doesn't make sense for source separation?

It's unusual to perform separation on log-mel indeed, but we can try.

In Asteroid, we only feed the waveforms at the ouput of the Dataset, features are computed on the fly, as well as training targets.
That would be one place where we compute binary masks and ratio masks.
Granted, this is not the most efficient as those computation could be done in masked time but in this case, we chose to do it in the model rather than in the dataset

We can add masks and support spectrograms in a future PR

pzelasko · 2020-06-04T04:19:26Z

I have a prototype implementation for CutSet/Cut/MixedCut working, as well as two simple CLI modes (make-trivial-cuts, make-overlayed-cuts) to test it on mini_librispeech as I'm developing it. The next steps will be making a pytorch Dataset for speech separation from the cuts based on the draft I already checked in, and then probably making a recipe for some dataset that @mpariente or @popcornell recommend.

Also, just to verify that the method I used to overlay the log-mel energies makes sense, I've looked at some of the resulting spectrograms (you can also see how the "low-level" cut manipulation code usage looks like). Note that the actual length of the feature matrix changes depending on the offsets used. See attached:

danpovey · 2020-06-04T06:29:19Z

fantastic progress!!

…

On Thu, Jun 4, 2020 at 12:19 PM Piotr Żelasko ***@***.***> wrote: I have a prototype implementation for CutSet/Cut/MixedCut working, as well as two simple CLI modes (make-trivial-cuts, make-overlayed-cuts) to test it on mini_librispeech as I'm developing it. The next steps will be making a pytorch Dataset for speech separation from the cuts based on the draft I already checked in, and then probably making a recipe for some dataset that @mpariente <https://github.com/mpariente> or @popcornell <https://github.com/popcornell> recommend. Also, just to verify that the method I used to overlay the log-mel energies makes sense, I've looked at some of the resulting spectrograms (you can also see how the "low-level" cut manipulation code usage looks like). Note that the actual length of the feature matrix changes depending on the offsets used. See attached: [image: image] <https://user-images.githubusercontent.com/15930688/83714474-e1622700-a5f8-11ea-809a-b41a0635cb9b.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOZAFE246VA637QV44DRU4OFVANCNFSM4NIHRSQQ> .

mpariente · 2020-06-04T07:22:10Z

I didn't read all the code but this looks very nice !

The next steps will be making a pytorch Dataset for speech separation from the cuts based on the draft I already checked in, and then probably making a recipe for some dataset that @mpariente or @popcornell recommend.

I'd recommend using LibriMix (scripts available on GitHub), from which you can also find a MiniVersion on Zenodo

popcornell · 2020-06-04T23:33:31Z

That is great !! Can't wait to see this integrated in Asteroid, it looks very promising. I second LibriMix and let's see how a recipe will turn out.
In fact I think you don't even need the overlay method for that one as basically the mixture .wav files are already pre-computed. It is however super useful for doing on-the-fly data-augmentation.
Also about log-mels it makes sense to go with them for now. It is unusual for source separation but could be more convenient here. Especially if downstream task is ASR, and in fact could be interesting to see how easy it would be to set up separation, doing separation and producing another CutSet suitable for ASR.

…n recipe; mix stereo cuts CLI

pzelasko · 2020-06-09T00:59:21Z

I'm super happy as with today's commits we have a first "end-to-end" data processing pipeline working. I was able to download MiniLibriMix using asteroid, then create manifests and features using a few CLI tools, and finally, create a SourceSeparationDataset from that. The current usage looks like this:

The CLI commands I used to create the data:

# convert CSV to audio and supervision manifests
lhotse recipe librimix ../asteroid/wordkir/MiniLibriMix/metadata/mixture_train_mix_clean.csv librimix

# extract features
lhotse make-feats librimix/audio.yml librimix -j 8 --root-dir /Users/pzelasko/jhu/asteroid/wordkir 

# create mono mix "feature overlay" style
lhotse cut stereo-overlayed librimix/supervisions.yml librimix/feature_manifest.yml librimix

It is still a prototype that's likely going to undergo some significant changes, but it's nice to have something working. I think I might merge this (after looking again through the comments to make sure they've been resolved) and then extend it a bit. In particular, as @popcornell mentioned, I could create an alternative LibriMix dataset version that uses the signals mixed in the time domain - it will be good to test both the flexibility of the library and the correctness of the feature overlaying.

Feel free to review!

pzelasko · 2020-06-09T01:05:39Z

Oh yeah and I think in the next PRs we can make a variant of this dataset that computes the mask for separation - I just don't want to make this PR larger than it already is.

danpovey · 2020-06-09T14:45:48Z

lhotse/cut.py

+            until: Optional[Seconds] = None,
+            keep_excessive_supervisions: bool = True
+    ) -> 'Cut':
+        new_start = self.start + offset


I'm not sure how I'd find documentation?

.. and I don't understand the '*'

Hmm yeah in some places I might've forgotten about the docs. Let me revise the PR before you go further, I'll add relevant docstrings.

The '*' was introduced in Python 3 and means keyword-only arguments are allowed after it; i.e. f(10) won't work, you have to use f(until=10). The reason why I wanted to use that is that there are multiple arguments of the same type and it's easy to mix them up accidentally. As you can see I generally have a strong preference for using keyword args to avoid mixups (especially when refactoring) and increase readability.

pzelasko · 2020-06-09T18:42:49Z

test/fixtures/mix_cut_test/overlayed_cut_manifest.yml

@@ -0,0 +1,57 @@
+- id: mixed-cut-id


Note that the CutSet is currently a heterogeneous data structure, as Cut and MixedCut don't have 100% identical interfaces. I'm not sure at the moment whether we should have a separate MixedCutSet to avoid this or not.

The Cut and MixedCut semantics are actually a bit different - they do share the ability to load features, provide supervision, the duration attribute - but for MixedCut it doesn't make sense to specify e.g. the start or the channel, like for Cut.

pzelasko · 2020-06-09T18:47:16Z

@danpovey I documented most parts of the code and responded to all the comments in here so far.

pzelasko · 2020-06-11T11:15:13Z

@danpovey any chance you could have a look?

danpovey · 2020-06-12T09:33:07Z

lhotse/bin/modes/cut.py

+    and features supplied by FEATURE_MANIFEST. It first creates a trivial CutSet, splits it into two equal, randomized
+    parts and overlays their features to create a mix.
+    The parameters of the mix are controlled via SNR_RANGE and OFFSET_RANGE.
+    """


So will the mix's duration always cover both entire original recordings? Are the random pairs selected without regard to length?
For some separation tasks, ragged edges may make it artificially easier.

These're good questions. When building the Cut/CutSet I wanted to have a starting point that performs the mix in some way to test it out - so I figured out let's start with random.

You are correct that in this mode the mix duration always covers both recordings, as the underlying mix/overlay code pads both cuts with low energy frames so that they can be added together. We could add another parameter like max_length (or max_length_range) to truncate (and that would likely create ragged edges).

The random pairs are indeed selected without regard to length. I didn't want to make any assumptions at this point; I'm happy to modify it whichever way you, @mpariente or @popcornell suggest. We can always add more parameters or modes to create mixes in different ways.

In predefined separation datasets (wsj0-2mix, LibriMix), the sources are summed without offset and mainly without considering the respective length of the sources.
The mixture can be used in the min mode where the mixture is truncated to the shortest source or the max mode, where the shortest is padded with zeros.
For training separation, we usually use the min mode because more challenging.

We can always add more parameters or modes to create mixes in different ways.

I agree with that, this is not critical and it looks fine for now.

Thanks, that's great feedback. In the next PR that I'm preparing I will add the option to mix in min/max modes.

danpovey · 2020-06-12T17:06:18Z

OK no problem, thanks. I don't follow the separation literature so closely, so perhaps those guys could better say how it's normally done.

…

On Fri, Jun 12, 2020 at 10:12 PM Piotr Żelasko ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lhotse/bin/modes/cut.py <#16 (comment)>: > + help='Range of relative offset values (0 - 1), which will offset the "right" signal by this many times ' + 'the duration of the "left" signal. It is uniformly sampled for each overlay operation.') +def random_overlayed( + supervision_manifest: Pathlike, + feature_manifest: Pathlike, + output_cut_manifest: Pathlike, + random_seed: int, + snr_range: Tuple[float, float], + offset_range: Tuple[float, float] +): + """ + Create a CutSet stored in OUTPUT_CUT_MANIFEST that contains supervision regions from SUPERVISION_MANIFEST + and features supplied by FEATURE_MANIFEST. It first creates a trivial CutSet, splits it into two equal, randomized + parts and overlays their features to create a mix. + The parameters of the mix are controlled via SNR_RANGE and OFFSET_RANGE. + """ These're good questions. When building the Cut/CutSet I wanted to have a starting point that performs the mix in *some* way to test it out - so I figured out let's start with random. You are correct that in this mode the mix duration always covers both recordings, as the underlying mix/overlay code pads both cuts with low energy frames so that they can be added together. We could add another parameter like max_length (or max_length_range) to truncate (and that would likely create ragged edges). The random pairs are indeed selected without regard to length. I didn't want to make any assumptions at this point; I'm happy to modify it whichever way you, @mpariente <https://github.com/mpariente> or @popcornell <https://github.com/popcornell> suggest. We can always add more parameters or modes to create mixes in different ways. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO3MVYGLZ2Y475O3YMLRWIZTPANCNFSM4NIHRSQQ> .

danpovey · 2020-06-12T20:10:52Z

Let me think about that.. may be OK. I guess for me there is a slight question of what this duration really is, i.e. does it relate to the features or the underlying wave data. Maybe there should be a distinction? Sorry its late, gotta go.

…

On Sat, Jun 13, 2020 at 4:08 AM Piotr Żelasko ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lhotse/features.py <#16 (comment)>: > @@ -122,20 +122,56 @@ class Features: channel_id: int start: Seconds duration: Seconds + + # The Features object must know its frame length and shift to trim the features matrix when loading. OK, so you're asking that based on the SupervisionSegments and the Features, we create the Cuts so that their duration is a multiple of frame_shift. I'm thinking we can easily do that for datasets where the recording is a stream (e.g. 15-minute phone conversation) as we have a lot of control there. For datasets where the recording is a segment already (like LibriMix and probably many others) it basically means removing a very small part of the end of the recording. Are we okay with the trade-off that we slightly modify the segment boundaries in Cut to have duration-num_frames consistency? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#16 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOYWZSGDN5Y4URMKOQLRWKDM3ANCNFSM4NIHRSQQ> .

pzelasko · 2020-06-14T02:48:42Z

Let me think about that.. may be OK. I guess for me there is a slight question of what this duration really is, i.e. does it relate to the features or the underlying wave data. Maybe there should be a distinction? Sorry its late, gotta go.

@danpovey If you're ok with that, I will open a separate issue to address the consistency of duration vs num_frames * frame_shift, so we can proceed with merging this one (unless there are other issues).

danpovey

OK for me.

pzelasko added 2 commits May 22, 2020 21:56

Initial draft of how the CutSet might look like

472ddcf

More considerations for the initial CutSet draft

d5420c3

danpovey reviewed May 23, 2020

View reviewed changes

pzelasko added 6 commits May 26, 2020 19:59

Progress on Cuts design - SingleCut/MultiCut; started implementing so…

dcd5a9a

…me methods

Return to simpler Cut idea; Add dataset mockup for source separation

d4ad0f8

Refactor Features to containg frame shift/length info

0b54f8f

Unit tests for Cut truncation

63fe0bb

Fix feature extraction script after refactor

aac31bf

Add overlay function for fbank features and a "test" for the simplest…

1ececa8

… scenario

danpovey reviewed May 29, 2020

View reviewed changes

mpariente reviewed May 29, 2020

View reviewed changes

pzelasko added 3 commits June 2, 2020 22:41

CLI for creating a trivial CutSet

547dc06

First take at overlayed cut creation (mix of two signals)

d80bdae

MixedCut seems to be working

7be5141

pzelasko and others added 4 commits June 5, 2020 23:21

Refactor CLI modes into separate files

7d241f8

CLI entry point for adding data preparation recipes

44eab0c

Librimix source separation Audio and Supervision manifests preparatio…

f6a8f46

…n recipe; mix stereo cuts CLI

Make specific dataset types importable from lhotse.dataset

a4b74d7

danpovey reviewed Jun 9, 2020

View reviewed changes

pzelasko added 2 commits June 9, 2020 14:15

Documentation for CutSet; fix truncate bugs

42faf69

Documentation for some feature extraction related methods and classes

3352972

Documentation for utility functions

d5c3461

pzelasko commented Jun 9, 2020

View reviewed changes

pzelasko mentioned this pull request Jun 11, 2020

Cuts: dynamic and pre-mixed dataset, mixing multiple cuts, example scripts #19

Merged

danpovey reviewed Jun 12, 2020

View reviewed changes

danpovey reviewed Jun 14, 2020

View reviewed changes

pzelasko mentioned this pull request Jun 14, 2020

Consistency between duration and number of frames #20

Closed

pzelasko merged commit aadec3c into master Jun 14, 2020

pzelasko deleted the feature/cut branch July 29, 2020 02:31


		supervisions: List[SupervisionSegment]

		# The features can span longer than the actual cut.

Initial draft of how the CutSet might look like #16

Initial draft of how the CutSet might look like #16

Conversation

pzelasko commented May 23, 2020

danpovey May 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko commented May 29, 2020

Choose a reason for hiding this comment

danpovey May 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpariente May 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko commented Jun 4, 2020

danpovey commented Jun 4, 2020 via email

mpariente commented Jun 4, 2020 • edited Loading

popcornell commented Jun 4, 2020

pzelasko commented Jun 9, 2020

pzelasko commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko commented Jun 9, 2020

pzelasko commented Jun 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danpovey commented Jun 12, 2020 via email

danpovey commented Jun 12, 2020 via email

pzelasko commented Jun 14, 2020

danpovey left a comment

Choose a reason for hiding this comment

danpovey May 23, 2020 •

edited

Loading

danpovey May 29, 2020 •

edited

Loading

mpariente May 29, 2020 •

edited

Loading

mpariente commented Jun 4, 2020 •

edited

Loading