Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of how the CutSet might look like #16

Merged
merged 18 commits into from
Jun 14, 2020
Merged

Conversation

pzelasko
Copy link
Collaborator

The fun part begins.

I've started thinking about the Cuts part design and I've written down some thoughts and sketches. I wouldn't say it's there yet - actually I might want to sketch the Dataset too to better understand how we'll use the CutSet first. Anyway, I'm pushing it out so that you might comment in this early stage if you like.

lhotse/cut.py Outdated

supervisions: List[SupervisionSegment]

# The features can span longer than the actual cut.
Copy link
Collaborator

@danpovey danpovey May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can note here that the Features object "knows" its start and end time within
the underlying recording, and that we expect the interval [begin,begin+duration]
to be a subset of the interval represented in features.

raise NotImplementedError()

def overlay(self, other: 'Cut', offset: Seconds = 0.0) -> 'Cut':
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might require some kind of way to specify the SNR, in case the other one is noise?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@dataclass
class Cut:
id: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document it to clarify what it is.. I'm thinking something like:
""" A Cut is a single segment that we'll train on. It contains the features corresponding to
a piece of a recording, with zero or more SupervisionSegments.
"""

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pzelasko
Copy link
Collaborator Author

I pushed some changes focusing on the source separation use-case, as Manuel was interested in testing Lhotse with Asteroid. It's not ready yet, but I'd appreciate any comments at this point.

@@ -122,20 +122,56 @@ class Features:
channel_id: int
start: Seconds
duration: Seconds

# The Features object must know its frame length and shift to trim the features matrix when loading.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally: I hope to use the 'snip_edges: false' option for Kaldi-like feature extraction.. this makes it so that the num-frames does not depend on the frame-shift, which is a nice simplification. We could possibly make this requirement.

Copy link
Collaborator

@danpovey danpovey May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... in any case I think we need to be clear about what exactly 'start' and 'duration' mean.
E.g. is it the 1st and last+1 sample which is covered by the features? Is it the center of the 1st frame until the center of the last-plus-one frame? Perhaps it could start at n*frame_shift. [Note: the center of each frame would be n*frame_shift + 0.5].

Also regarding snip_edges: for consistency when operating inside the file, we could possibly do it differently for internal segments? Maybe make snip_edges false but whenever we actually extract features for a segment, we pad with zeros or with the appropriate audio context in calling code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally: I hope to use the 'snip_edges: false' option for Kaldi-like feature extraction.. this makes it so that the num-frames does not depend on the frame-shift, which is a nice simplification. We could possibly make this requirement.

I think you meant frame-length, right? I had to double-check with the Kaldi docs. Could you elaborate on how it makes things easier?

... in any case I think we need to be clear about what exactly 'start' and 'duration' mean.

As I understand it currently, the 1st sample will start with start. It's less clear with duration - with snip_edges = False (default), the last sample is either at the end or after it; with snip_edges = True, it is either at the end or before it. You lost me at the n*frame_shift idea.

Also regarding snip_edges: for consistency when operating inside the file, we could possibly do it differently for internal segments? Maybe make snip_edges false but whenever we actually extract features for a segment, we pad with zeros or with the appropriate audio context in calling code?

Yeah that's a good remark. When I add the option to extract features for specified segments I'll take care to either pass extra context samples or pad with zeros.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frame-length, yes.
I think by n*frame_shift, I may have been meaning that duration == n * frame_shift where n == number of frames.

It would probably make sense to have duration always be n * frame_shift, but understand that this is kind of approximate and actually it may use slightly more context than that?

I'd rather make it so that downstream, these cuts behave quite simply w.r.t. the relationship of duration and num-frames, and the user doesn't have to worry about the frame-length and how it may impact the num-frames. That was a rabbit-hole in Kaldi itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so you're asking that based on the SupervisionSegments and the Features, we create the Cuts so that their duration is a multiple of frame_shift. I'm thinking we can easily do that for datasets where the recording is a stream (e.g. 15-minute phone conversation) as we have a lot of control there. For datasets where the recording is a segment already (like LibriMix and probably many others) it basically means removing a very small part of the end of the recording. Are we okay with the trade-off that we slightly modify the segment boundaries in Cut to have duration-num_frames consistency?

Comment on lines +23 to +27
mixture = torch.from_numpy(mixture_cut.load_features(root_dir=self.root_dir))
sources = torch.stack(
[torch.from_numpy(source_cut.load_features(root_dir=self.root_dir)) for source_cut in source_cuts],
dim=0
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What features would you expect here?
The sum of the sources's features will probably not sum to the mixture right?
If we'd like another type of supervision, for example a mask or something like this, we'd compute on the fly or use the supervisions from CutSet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that maybe is it related to overlay_fbank ? You can just mix fbanks with no log no ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at one of the datasets in Asteroid, LibriMix, and saw that you were returning a list of sources as well as the mix, so I mimicked it for the first take. We can make this (or another Dataset variant) return a mask instead if you'd like. Can you point me to an example of how do you obtain a mask as supervision?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to stick to log-mel energies for now, unless it doesn't make sense for source separation?

Copy link

@mpariente mpariente May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to stick to log-mel energies for now, unless it doesn't make sense for source separation?

It's unusual to perform separation on log-mel indeed, but we can try.

In Asteroid, we only feed the waveforms at the ouput of the Dataset, features are computed on the fly, as well as training targets.
That would be one place where we compute binary masks and ratio masks.
Granted, this is not the most efficient as those computation could be done in masked time but in this case, we chose to do it in the model rather than in the dataset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add masks and support spectrograms in a future PR

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 4, 2020

I have a prototype implementation for CutSet/Cut/MixedCut working, as well as two simple CLI modes (make-trivial-cuts, make-overlayed-cuts) to test it on mini_librispeech as I'm developing it. The next steps will be making a pytorch Dataset for speech separation from the cuts based on the draft I already checked in, and then probably making a recipe for some dataset that @mpariente or @popcornell recommend.

Also, just to verify that the method I used to overlay the log-mel energies makes sense, I've looked at some of the resulting spectrograms (you can also see how the "low-level" cut manipulation code usage looks like). Note that the actual length of the feature matrix changes depending on the offsets used. See attached:
image

@danpovey
Copy link
Collaborator

danpovey commented Jun 4, 2020 via email

@mpariente
Copy link

mpariente commented Jun 4, 2020

I didn't read all the code but this looks very nice !

The next steps will be making a pytorch Dataset for speech separation from the cuts based on the draft I already checked in, and then probably making a recipe for some dataset that @mpariente or @popcornell recommend.

I'd recommend using LibriMix (scripts available on GitHub), from which you can also find a MiniVersion on Zenodo

@popcornell
Copy link
Contributor

That is great !! Can't wait to see this integrated in Asteroid, it looks very promising. I second LibriMix and let's see how a recipe will turn out.
In fact I think you don't even need the overlay method for that one as basically the mixture .wav files are already pre-computed. It is however super useful for doing on-the-fly data-augmentation.
Also about log-mels it makes sense to go with them for now. It is unusual for source separation but could be more convenient here. Especially if downstream task is ASR, and in fact could be interesting to see how easy it would be to set up separation, doing separation and producing another CutSet suitable for ASR.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2020

I'm super happy as with today's commits we have a first "end-to-end" data processing pipeline working. I was able to download MiniLibriMix using asteroid, then create manifests and features using a few CLI tools, and finally, create a SourceSeparationDataset from that. The current usage looks like this:

image

The CLI commands I used to create the data:

# convert CSV to audio and supervision manifests
lhotse recipe librimix ../asteroid/wordkir/MiniLibriMix/metadata/mixture_train_mix_clean.csv librimix

# extract features
lhotse make-feats librimix/audio.yml librimix -j 8 --root-dir /Users/pzelasko/jhu/asteroid/wordkir 

# create mono mix "feature overlay" style
lhotse cut stereo-overlayed librimix/supervisions.yml librimix/feature_manifest.yml librimix

It is still a prototype that's likely going to undergo some significant changes, but it's nice to have something working. I think I might merge this (after looking again through the comments to make sure they've been resolved) and then extend it a bit. In particular, as @popcornell mentioned, I could create an alternative LibriMix dataset version that uses the signals mixed in the time domain - it will be good to test both the flexibility of the library and the correctness of the feature overlaying.

Feel free to review!

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2020

Oh yeah and I think in the next PRs we can make a variant of this dataset that computes the mask for separation - I just don't want to make this PR larger than it already is.

until: Optional[Seconds] = None,
keep_excessive_supervisions: bool = True
) -> 'Cut':
new_start = self.start + offset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I'd find documentation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. and I don't understand the '*'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah in some places I might've forgotten about the docs. Let me revise the PR before you go further, I'll add relevant docstrings.

The '*' was introduced in Python 3 and means keyword-only arguments are allowed after it; i.e. f(10) won't work, you have to use f(until=10). The reason why I wanted to use that is that there are multiple arguments of the same type and it's easy to mix them up accidentally. As you can see I generally have a strong preference for using keyword args to avoid mixups (especially when refactoring) and increase readability.

@@ -0,0 +1,57 @@
- id: mixed-cut-id
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the CutSet is currently a heterogeneous data structure, as Cut and MixedCut don't have 100% identical interfaces. I'm not sure at the moment whether we should have a separate MixedCutSet to avoid this or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Cut and MixedCut semantics are actually a bit different - they do share the ability to load features, provide supervision, the duration attribute - but for MixedCut it doesn't make sense to specify e.g. the start or the channel, like for Cut.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2020

@danpovey I documented most parts of the code and responded to all the comments in here so far.

@pzelasko
Copy link
Collaborator Author

@danpovey any chance you could have a look?

and features supplied by FEATURE_MANIFEST. It first creates a trivial CutSet, splits it into two equal, randomized
parts and overlays their features to create a mix.
The parameters of the mix are controlled via SNR_RANGE and OFFSET_RANGE.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will the mix's duration always cover both entire original recordings? Are the random pairs selected without regard to length?
For some separation tasks, ragged edges may make it artificially easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These're good questions. When building the Cut/CutSet I wanted to have a starting point that performs the mix in some way to test it out - so I figured out let's start with random.

You are correct that in this mode the mix duration always covers both recordings, as the underlying mix/overlay code pads both cuts with low energy frames so that they can be added together. We could add another parameter like max_length (or max_length_range) to truncate (and that would likely create ragged edges).

The random pairs are indeed selected without regard to length. I didn't want to make any assumptions at this point; I'm happy to modify it whichever way you, @mpariente or @popcornell suggest. We can always add more parameters or modes to create mixes in different ways.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In predefined separation datasets (wsj0-2mix, LibriMix), the sources are summed without offset and mainly without considering the respective length of the sources.
The mixture can be used in the min mode where the mixture is truncated to the shortest source or the max mode, where the shortest is padded with zeros.
For training separation, we usually use the min mode because more challenging.

We can always add more parameters or modes to create mixes in different ways.

I agree with that, this is not critical and it looks fine for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's great feedback. In the next PR that I'm preparing I will add the option to mix in min/max modes.

@danpovey
Copy link
Collaborator

danpovey commented Jun 12, 2020 via email

@danpovey
Copy link
Collaborator

danpovey commented Jun 12, 2020 via email

@pzelasko
Copy link
Collaborator Author

Let me think about that.. may be OK. I guess for me there is a slight question of what this duration really is, i.e. does it relate to the features or the underlying wave data. Maybe there should be a distinction? Sorry its late, gotta go.

@danpovey If you're ok with that, I will open a separate issue to address the consistency of duration vs num_frames * frame_shift, so we can proceed with merging this one (unless there are other issues).

Copy link
Collaborator

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me.

@pzelasko pzelasko merged commit aadec3c into master Jun 14, 2020
@pzelasko pzelasko deleted the feature/cut branch July 29, 2020 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants