-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial draft of how the CutSet might look like #16
Conversation
lhotse/cut.py
Outdated
|
||
supervisions: List[SupervisionSegment] | ||
|
||
# The features can span longer than the actual cut. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can note here that the Features object "knows" its start and end time within
the underlying recording, and that we expect the interval [begin,begin+duration]
to be a subset of the interval represented in features
.
raise NotImplementedError() | ||
|
||
def overlay(self, other: 'Cut', offset: Seconds = 0.0) -> 'Cut': | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might require some kind of way to specify the SNR, in case the other one is noise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
@dataclass | ||
class Cut: | ||
id: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document it to clarify what it is.. I'm thinking something like:
""" A Cut is a single segment that we'll train on. It contains the features corresponding to
a piece of a recording, with zero or more SupervisionSegments.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
I pushed some changes focusing on the source separation use-case, as Manuel was interested in testing Lhotse with Asteroid. It's not ready yet, but I'd appreciate any comments at this point. |
@@ -122,20 +122,56 @@ class Features: | |||
channel_id: int | |||
start: Seconds | |||
duration: Seconds | |||
|
|||
# The Features object must know its frame length and shift to trim the features matrix when loading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally: I hope to use the 'snip_edges: false' option for Kaldi-like feature extraction.. this makes it so that the num-frames does not depend on the frame-shift, which is a nice simplification. We could possibly make this requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... in any case I think we need to be clear about what exactly 'start' and 'duration' mean.
E.g. is it the 1st and last+1 sample which is covered by the features? Is it the center of the 1st frame until the center of the last-plus-one frame? Perhaps it could start at n*frame_shift
. [Note: the center of each frame would be n*frame_shift + 0.5].
Also regarding snip_edges: for consistency when operating inside the file, we could possibly do it differently for internal segments? Maybe make snip_edges false but whenever we actually extract features for a segment, we pad with zeros or with the appropriate audio context in calling code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incidentally: I hope to use the 'snip_edges: false' option for Kaldi-like feature extraction.. this makes it so that the num-frames does not depend on the frame-shift, which is a nice simplification. We could possibly make this requirement.
I think you meant frame-length
, right? I had to double-check with the Kaldi docs. Could you elaborate on how it makes things easier?
... in any case I think we need to be clear about what exactly 'start' and 'duration' mean.
As I understand it currently, the 1st sample will start with start
. It's less clear with duration
- with snip_edges = False
(default), the last sample is either at the end or after it; with snip_edges = True
, it is either at the end or before it. You lost me at the n*frame_shift
idea.
Also regarding snip_edges: for consistency when operating inside the file, we could possibly do it differently for internal segments? Maybe make snip_edges false but whenever we actually extract features for a segment, we pad with zeros or with the appropriate audio context in calling code?
Yeah that's a good remark. When I add the option to extract features for specified segments I'll take care to either pass extra context samples or pad with zeros.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frame-length, yes.
I think by n*frame_shift
, I may have been meaning that duration == n * frame_shift
where n == number of frames.
It would probably make sense to have duration
always be n * frame_shift
, but understand that this is kind of approximate and actually it may use slightly more context than that?
I'd rather make it so that downstream, these cuts behave quite simply w.r.t. the relationship of duration and num-frames, and the user doesn't have to worry about the frame-length and how it may impact the num-frames. That was a rabbit-hole in Kaldi itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so you're asking that based on the SupervisionSegment
s and the Features
, we create the Cuts
so that their duration is a multiple of frame_shift
. I'm thinking we can easily do that for datasets where the recording is a stream (e.g. 15-minute phone conversation) as we have a lot of control there. For datasets where the recording is a segment already (like LibriMix and probably many others) it basically means removing a very small part of the end of the recording. Are we okay with the trade-off that we slightly modify the segment boundaries in Cut
to have duration-num_frames consistency?
mixture = torch.from_numpy(mixture_cut.load_features(root_dir=self.root_dir)) | ||
sources = torch.stack( | ||
[torch.from_numpy(source_cut.load_features(root_dir=self.root_dir)) for source_cut in source_cuts], | ||
dim=0 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What features would you expect here?
The sum of the sources's features will probably not sum to the mixture right?
If we'd like another type of supervision, for example a mask or something like this, we'd compute on the fly or use the supervisions
from CutSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that maybe is it related to overlay_fbank
? You can just mix fbanks with no log no ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at one of the datasets in Asteroid, LibriMix, and saw that you were returning a list of sources as well as the mix, so I mimicked it for the first take. We can make this (or another Dataset variant) return a mask instead if you'd like. Can you point me to an example of how do you obtain a mask as supervision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to stick to log-mel energies for now, unless it doesn't make sense for source separation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to stick to log-mel energies for now, unless it doesn't make sense for source separation?
It's unusual to perform separation on log-mel indeed, but we can try.
In Asteroid, we only feed the waveforms at the ouput of the Dataset, features are computed on the fly, as well as training targets.
That would be one place where we compute binary masks and ratio masks.
Granted, this is not the most efficient as those computation could be done in masked time but in this case, we chose to do it in the model rather than in the dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add masks and support spectrograms in a future PR
I have a prototype implementation for Also, just to verify that the method I used to overlay the log-mel energies makes sense, I've looked at some of the resulting spectrograms (you can also see how the "low-level" cut manipulation code usage looks like). Note that the actual length of the feature matrix changes depending on the offsets used. See attached: |
fantastic progress!!
…On Thu, Jun 4, 2020 at 12:19 PM Piotr Żelasko ***@***.***> wrote:
I have a prototype implementation for CutSet/Cut/MixedCut working, as
well as two simple CLI modes (make-trivial-cuts, make-overlayed-cuts) to
test it on mini_librispeech as I'm developing it. The next steps will be
making a pytorch Dataset for speech separation from the cuts based on the
draft I already checked in, and then probably making a recipe for some
dataset that @mpariente <https://github.com/mpariente> or @popcornell
<https://github.com/popcornell> recommend.
Also, just to verify that the method I used to overlay the log-mel
energies makes sense, I've looked at some of the resulting spectrograms
(you can also see how the "low-level" cut manipulation code usage looks
like). Note that the actual length of the feature matrix changes depending
on the offsets used. See attached:
[image: image]
<https://user-images.githubusercontent.com/15930688/83714474-e1622700-a5f8-11ea-809a-b41a0635cb9b.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOZAFE246VA637QV44DRU4OFVANCNFSM4NIHRSQQ>
.
|
I didn't read all the code but this looks very nice !
I'd recommend using LibriMix (scripts available on GitHub), from which you can also find a MiniVersion on Zenodo |
That is great !! Can't wait to see this integrated in Asteroid, it looks very promising. I second LibriMix and let's see how a recipe will turn out. |
I'm super happy as with today's commits we have a first "end-to-end" data processing pipeline working. I was able to download MiniLibriMix using The CLI commands I used to create the data:
It is still a prototype that's likely going to undergo some significant changes, but it's nice to have something working. I think I might merge this (after looking again through the comments to make sure they've been resolved) and then extend it a bit. In particular, as @popcornell mentioned, I could create an alternative LibriMix dataset version that uses the signals mixed in the time domain - it will be good to test both the flexibility of the library and the correctness of the feature overlaying. Feel free to review! |
Oh yeah and I think in the next PRs we can make a variant of this dataset that computes the mask for separation - I just don't want to make this PR larger than it already is. |
until: Optional[Seconds] = None, | ||
keep_excessive_supervisions: bool = True | ||
) -> 'Cut': | ||
new_start = self.start + offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I'd find documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. and I don't understand the '*'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah in some places I might've forgotten about the docs. Let me revise the PR before you go further, I'll add relevant docstrings.
The '*' was introduced in Python 3 and means keyword-only arguments are allowed after it; i.e. f(10) won't work, you have to use f(until=10). The reason why I wanted to use that is that there are multiple arguments of the same type and it's easy to mix them up accidentally. As you can see I generally have a strong preference for using keyword args to avoid mixups (especially when refactoring) and increase readability.
@@ -0,0 +1,57 @@ | |||
- id: mixed-cut-id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the CutSet
is currently a heterogeneous data structure, as Cut
and MixedCut
don't have 100% identical interfaces. I'm not sure at the moment whether we should have a separate MixedCutSet
to avoid this or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Cut
and MixedCut
semantics are actually a bit different - they do share the ability to load features, provide supervision, the duration attribute - but for MixedCut
it doesn't make sense to specify e.g. the start or the channel, like for Cut
.
@danpovey I documented most parts of the code and responded to all the comments in here so far. |
@danpovey any chance you could have a look? |
and features supplied by FEATURE_MANIFEST. It first creates a trivial CutSet, splits it into two equal, randomized | ||
parts and overlays their features to create a mix. | ||
The parameters of the mix are controlled via SNR_RANGE and OFFSET_RANGE. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So will the mix's duration always cover both entire original recordings? Are the random pairs selected without regard to length?
For some separation tasks, ragged edges may make it artificially easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These're good questions. When building the Cut/CutSet I wanted to have a starting point that performs the mix in some way to test it out - so I figured out let's start with random.
You are correct that in this mode the mix duration always covers both recordings, as the underlying mix/overlay code pads both cuts with low energy frames so that they can be added together. We could add another parameter like max_length
(or max_length_range
) to truncate (and that would likely create ragged edges).
The random pairs are indeed selected without regard to length. I didn't want to make any assumptions at this point; I'm happy to modify it whichever way you, @mpariente or @popcornell suggest. We can always add more parameters or modes to create mixes in different ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In predefined separation datasets (wsj0-2mix, LibriMix), the sources are summed without offset and mainly without considering the respective length of the sources.
The mixture can be used in the min
mode where the mixture is truncated to the shortest source or the max
mode, where the shortest is padded with zeros.
For training separation, we usually use the min
mode because more challenging.
We can always add more parameters or modes to create mixes in different ways.
I agree with that, this is not critical and it looks fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's great feedback. In the next PR that I'm preparing I will add the option to mix in min/max modes.
OK no problem, thanks.
I don't follow the separation literature so closely, so perhaps those guys
could better say how it's normally done.
…On Fri, Jun 12, 2020 at 10:12 PM Piotr Żelasko ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lhotse/bin/modes/cut.py
<#16 (comment)>:
> + help='Range of relative offset values (0 - 1), which will offset the "right" signal by this many times '
+ 'the duration of the "left" signal. It is uniformly sampled for each overlay operation.')
+def random_overlayed(
+ supervision_manifest: Pathlike,
+ feature_manifest: Pathlike,
+ output_cut_manifest: Pathlike,
+ random_seed: int,
+ snr_range: Tuple[float, float],
+ offset_range: Tuple[float, float]
+):
+ """
+ Create a CutSet stored in OUTPUT_CUT_MANIFEST that contains supervision regions from SUPERVISION_MANIFEST
+ and features supplied by FEATURE_MANIFEST. It first creates a trivial CutSet, splits it into two equal, randomized
+ parts and overlays their features to create a mix.
+ The parameters of the mix are controlled via SNR_RANGE and OFFSET_RANGE.
+ """
These're good questions. When building the Cut/CutSet I wanted to have a
starting point that performs the mix in *some* way to test it out - so I
figured out let's start with random.
You are correct that in this mode the mix duration always covers both
recordings, as the underlying mix/overlay code pads both cuts with low
energy frames so that they can be added together. We could add another
parameter like max_length (or max_length_range) to truncate (and that
would likely create ragged edges).
The random pairs are indeed selected without regard to length. I didn't
want to make any assumptions at this point; I'm happy to modify it
whichever way you, @mpariente <https://github.com/mpariente> or
@popcornell <https://github.com/popcornell> suggest. We can always add
more parameters or modes to create mixes in different ways.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3MVYGLZ2Y475O3YMLRWIZTPANCNFSM4NIHRSQQ>
.
|
Let me think about that.. may be OK. I guess for me there is a slight
question of what this duration really is, i.e. does it relate to the
features or the underlying
wave data. Maybe there should be a distinction? Sorry its late, gotta go.
…On Sat, Jun 13, 2020 at 4:08 AM Piotr Żelasko ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lhotse/features.py
<#16 (comment)>:
> @@ -122,20 +122,56 @@ class Features:
channel_id: int
start: Seconds
duration: Seconds
+
+ # The Features object must know its frame length and shift to trim the features matrix when loading.
OK, so you're asking that based on the SupervisionSegments and the
Features, we create the Cuts so that their duration is a multiple of
frame_shift. I'm thinking we can easily do that for datasets where the
recording is a stream (e.g. 15-minute phone conversation) as we have a lot
of control there. For datasets where the recording is a segment already
(like LibriMix and probably many others) it basically means removing a very
small part of the end of the recording. Are we okay with the trade-off that
we slightly modify the segment boundaries in Cut to have
duration-num_frames consistency?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYWZSGDN5Y4URMKOQLRWKDM3ANCNFSM4NIHRSQQ>
.
|
@danpovey If you're ok with that, I will open a separate issue to address the consistency of duration vs num_frames * frame_shift, so we can proceed with merging this one (unless there are other issues). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for me.
The fun part begins.
I've started thinking about the Cuts part design and I've written down some thoughts and sketches. I wouldn't say it's there yet - actually I might want to sketch the Dataset too to better understand how we'll use the CutSet first. Anyway, I'm pushing it out so that you might comment in this early stage if you like.