Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VadDataset example #726

Open
entn-at opened this issue May 29, 2022 · 8 comments
Open

VadDataset example #726

entn-at opened this issue May 29, 2022 · 8 comments

Comments

@entn-at
Copy link
Contributor

entn-at commented May 29, 2022

Hi,
I saw the VadDataset class and it is mentioned in the Readme and elsewhere. Do you know of an example setup/recipe (perhaps in other repos?) that uses it to train a VAD/segmentation model?
Thanks!

@pzelasko
Copy link
Collaborator

CC @desh2608 you might have some recipes using those.

I wonder if I should replace VadDataset with some other one as the flagship example. At the time, we had no ASR recipes and VAD seemed both pretty well defined and conceptually simple to showcase.

@armusc
Copy link
Contributor

armusc commented Apr 25, 2023

Hi
I'm bumping this old thread to ask if there is an obvious way in lhotse for applying a VAD mask to the features of a cut (in the sense of removing the unvoiced framed according to a VAD mask of 0/1 for each frame)
basically, this would be like the "select-voiced-frames" from kaldi that would apply a "vad.scp" to a "feats.scp" to then use the filtered features (possibly as input of an i/x-vector nnet)

thank you

@pzelasko
Copy link
Collaborator

Would something like this work, assuming you have a VAD model in Python? Or are you looking for something different?

features = cut.load_features()
mask = compute_vad_mask(vad_model, features)
features = features[mask]

@pzelasko
Copy link
Collaborator

In terms of which VAD to apply, you can use e.g. SileroVAD: https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#examples

Actually a workflow/integration into Lhotse would be nice if somebody is willing to contribute that.

@armusc
Copy link
Contributor

armusc commented Apr 26, 2023

Would something like this work, assuming you have a VAD model in Python? Or are you looking for something different?

features = cut.load_features()
mask = compute_vad_mask(vad_model, features)
features = features[mask]

thanks for your answer
I would like to be able to batch features of Cuts and apply the VAD mask on the fly, like it was a transform; the transformed features (i.e. masked) would be the input of a nnet, like a batch formed with a k2SpeechRecognitionDataset
Now, a VAD on a Cut would modify start, duration of the Cut, like perturb_speed, but features cannot be extracted from [start, start + duration], since there are holes (unvoiced frames) within
it does not seem to me the input_transform would be feasible either, because the number of frames would be modified after batching
I'm not sure it makes sense what I wrote above, though; maybe I don't have a proper understanding of the framework

I was also thinking: since I would like to have a concatenation of voiced frame features for a recording, could I just use a Cut with multiple supervisions, each representing the start-duration for each voiced segment? would this do the work when batches are formed or do I need strictly one supervision per cut? keep in mind that I don't have to train anything, I don't need a target, just to extract an embedding from a nnet

@pzelasko
Copy link
Collaborator

I think the simplest way to get that is to write your own dataset class like this:

from lhotse.dataset.collation import collate_matrices

class EmbeddingWithVadDataset(torch.utils.data.Dataset):
  def __init__(self, ...):
    self.vad = load_vad()
  def __getitem__(self, cuts: CutSet) -> dict:
    batch_feats = []
    for cut in cuts:
      feats = cut.load_features()
      voiced_mask = self.vad(feats)
      batch_feats.append(feats[voiced_mask])
   batch_feats = collate_matrices(batch_feats)
   return {"features": batch_feats, "cuts": cuts}

It's also possible to add supervisions to indicate voiced segments, but you'll still need to add some logic that will do sth like append_cuts(cut.trim_to_supervisions()) - you could do this transform either before creating the sampler, or again inside the dataset class.

@desh2608
Copy link
Collaborator

It seems like you have some pre-computed VAD, and you want to apply it on-the-fly on the input features, possibly in the data-loader. I am assuming you have some kind of speaker ID system and you want to compute embeddings for full utterances (possibly containing silences) without the silence frames. Suppose you have a CutSet where each cut represents 1 utterance (or recording).

Here are 2 ways to do it:

Case 1: frame-level VAD

If you have pre-computed features for the cuts, and frame-level VAD decisions on these features, you can store the VAD decisions as a TemporalArray as follows:

with CutSet.open_writer(manifest_path) as cut_writer, LilcomChunkyWriter(
    storage_path
) as vad_writer:
    for cut in cuts:
        vad_decisions = vad.run(cut) # vad_decisions is an np.ndarray
        cut.vad = vad_writer.store_array(cut.id, vad_decisions)
        cut_writer.write(cut)

Then, in your data-loader, these can be loaded by calling cut.load_vad() (thanks to the magic of the custom attribute), and applied on the feats loaded from cut.load_features() using the indexing that Piotr described.

Case 2: Segment-level VAD

It may happen that your VAD generates segments (in the form <start, end>) instead of frame-level decisions. You can create SupervisionSegments from each such segment (for each recording), and put them in cut.supervisions. Note that each cut can contain multiple supervisions.

Then, in your data-loader, you can do something like the following:

cut_segments = cut.trim_to_supervisions(keep_overlapping=False)
feats = []
for c in cut_segments:
    feats.append(c.load_features())
feats = np.concatenate(feats, axis=0)

Note that this assumes that the segments returned by your VAD model are non-overlapping (it doesn't really make sense to have overlapping VAD segments anyway).

@armusc
Copy link
Contributor

armusc commented Apr 27, 2023

great, thanks for the suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants