VadDataset example #726

entn-at · 2022-05-29T23:37:15Z

Hi,
I saw the VadDataset class and it is mentioned in the Readme and elsewhere. Do you know of an example setup/recipe (perhaps in other repos?) that uses it to train a VAD/segmentation model?
Thanks!

pzelasko · 2022-05-30T01:50:00Z

CC @desh2608 you might have some recipes using those.

I wonder if I should replace VadDataset with some other one as the flagship example. At the time, we had no ASR recipes and VAD seemed both pretty well defined and conceptually simple to showcase.

armusc · 2023-04-25T21:32:01Z

Hi
I'm bumping this old thread to ask if there is an obvious way in lhotse for applying a VAD mask to the features of a cut (in the sense of removing the unvoiced framed according to a VAD mask of 0/1 for each frame)
basically, this would be like the "select-voiced-frames" from kaldi that would apply a "vad.scp" to a "feats.scp" to then use the filtered features (possibly as input of an i/x-vector nnet)

thank you

pzelasko · 2023-04-25T23:26:12Z

Would something like this work, assuming you have a VAD model in Python? Or are you looking for something different?

features = cut.load_features()
mask = compute_vad_mask(vad_model, features)
features = features[mask]

pzelasko · 2023-04-25T23:27:36Z

In terms of which VAD to apply, you can use e.g. SileroVAD: https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#examples

Actually a workflow/integration into Lhotse would be nice if somebody is willing to contribute that.

armusc · 2023-04-26T16:53:48Z

Would something like this work, assuming you have a VAD model in Python? Or are you looking for something different?
features = cut.load_features()
mask = compute_vad_mask(vad_model, features)
features = features[mask]

thanks for your answer
I would like to be able to batch features of Cuts and apply the VAD mask on the fly, like it was a transform; the transformed features (i.e. masked) would be the input of a nnet, like a batch formed with a k2SpeechRecognitionDataset
Now, a VAD on a Cut would modify start, duration of the Cut, like perturb_speed, but features cannot be extracted from [start, start + duration], since there are holes (unvoiced frames) within
it does not seem to me the input_transform would be feasible either, because the number of frames would be modified after batching
I'm not sure it makes sense what I wrote above, though; maybe I don't have a proper understanding of the framework

I was also thinking: since I would like to have a concatenation of voiced frame features for a recording, could I just use a Cut with multiple supervisions, each representing the start-duration for each voiced segment? would this do the work when batches are formed or do I need strictly one supervision per cut? keep in mind that I don't have to train anything, I don't need a target, just to extract an embedding from a nnet

pzelasko · 2023-04-26T17:33:28Z

I think the simplest way to get that is to write your own dataset class like this:

from lhotse.dataset.collation import collate_matrices

class EmbeddingWithVadDataset(torch.utils.data.Dataset):
  def __init__(self, ...):
    self.vad = load_vad()
  def __getitem__(self, cuts: CutSet) -> dict:
    batch_feats = []
    for cut in cuts:
      feats = cut.load_features()
      voiced_mask = self.vad(feats)
      batch_feats.append(feats[voiced_mask])
   batch_feats = collate_matrices(batch_feats)
   return {"features": batch_feats, "cuts": cuts}

It's also possible to add supervisions to indicate voiced segments, but you'll still need to add some logic that will do sth like append_cuts(cut.trim_to_supervisions()) - you could do this transform either before creating the sampler, or again inside the dataset class.

desh2608 · 2023-04-26T17:44:57Z

It seems like you have some pre-computed VAD, and you want to apply it on-the-fly on the input features, possibly in the data-loader. I am assuming you have some kind of speaker ID system and you want to compute embeddings for full utterances (possibly containing silences) without the silence frames. Suppose you have a CutSet where each cut represents 1 utterance (or recording).

Here are 2 ways to do it:

Case 1: frame-level VAD

If you have pre-computed features for the cuts, and frame-level VAD decisions on these features, you can store the VAD decisions as a TemporalArray as follows:

with CutSet.open_writer(manifest_path) as cut_writer, LilcomChunkyWriter(
    storage_path
) as vad_writer:
    for cut in cuts:
        vad_decisions = vad.run(cut) # vad_decisions is an np.ndarray
        cut.vad = vad_writer.store_array(cut.id, vad_decisions)
        cut_writer.write(cut)

Then, in your data-loader, these can be loaded by calling cut.load_vad() (thanks to the magic of the custom attribute), and applied on the feats loaded from cut.load_features() using the indexing that Piotr described.

Case 2: Segment-level VAD

It may happen that your VAD generates segments (in the form <start, end>) instead of frame-level decisions. You can create SupervisionSegments from each such segment (for each recording), and put them in cut.supervisions. Note that each cut can contain multiple supervisions.

Then, in your data-loader, you can do something like the following:

cut_segments = cut.trim_to_supervisions(keep_overlapping=False)
feats = []
for c in cut_segments:
    feats.append(c.load_features())
feats = np.concatenate(feats, axis=0)

Note that this assumes that the segments returned by your VAD model are non-overlapping (it doesn't really make sense to have overlapping VAD segments anyway).

armusc · 2023-04-27T07:56:47Z

great, thanks for the suggestions

desh2608 mentioned this issue Apr 25, 2023

VAD workflow with Silero #1041

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VadDataset example #726

VadDataset example #726

entn-at commented May 29, 2022

pzelasko commented May 30, 2022

armusc commented Apr 25, 2023

pzelasko commented Apr 25, 2023

pzelasko commented Apr 25, 2023

armusc commented Apr 26, 2023

pzelasko commented Apr 26, 2023

desh2608 commented Apr 26, 2023

armusc commented Apr 27, 2023

VadDataset example #726

VadDataset example #726

Comments

entn-at commented May 29, 2022

pzelasko commented May 30, 2022

armusc commented Apr 25, 2023

pzelasko commented Apr 25, 2023

pzelasko commented Apr 25, 2023

armusc commented Apr 26, 2023

pzelasko commented Apr 26, 2023

desh2608 commented Apr 26, 2023

Case 1: frame-level VAD

Case 2: Segment-level VAD

armusc commented Apr 27, 2023