-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VadDataset example #726
Comments
CC @desh2608 you might have some recipes using those. I wonder if I should replace VadDataset with some other one as the flagship example. At the time, we had no ASR recipes and VAD seemed both pretty well defined and conceptually simple to showcase. |
Hi thank you |
Would something like this work, assuming you have a VAD model in Python? Or are you looking for something different? features = cut.load_features()
mask = compute_vad_mask(vad_model, features)
features = features[mask] |
In terms of which VAD to apply, you can use e.g. SileroVAD: https://github.com/snakers4/silero-vad/wiki/Examples-and-Dependencies#examples Actually a workflow/integration into Lhotse would be nice if somebody is willing to contribute that. |
thanks for your answer I was also thinking: since I would like to have a concatenation of voiced frame features for a recording, could I just use a Cut with multiple supervisions, each representing the start-duration for each voiced segment? would this do the work when batches are formed or do I need strictly one supervision per cut? keep in mind that I don't have to train anything, I don't need a target, just to extract an embedding from a nnet |
I think the simplest way to get that is to write your own dataset class like this: from lhotse.dataset.collation import collate_matrices
class EmbeddingWithVadDataset(torch.utils.data.Dataset):
def __init__(self, ...):
self.vad = load_vad()
def __getitem__(self, cuts: CutSet) -> dict:
batch_feats = []
for cut in cuts:
feats = cut.load_features()
voiced_mask = self.vad(feats)
batch_feats.append(feats[voiced_mask])
batch_feats = collate_matrices(batch_feats)
return {"features": batch_feats, "cuts": cuts} It's also possible to add supervisions to indicate voiced segments, but you'll still need to add some logic that will do sth like |
It seems like you have some pre-computed VAD, and you want to apply it on-the-fly on the input features, possibly in the data-loader. I am assuming you have some kind of speaker ID system and you want to compute embeddings for full utterances (possibly containing silences) without the silence frames. Suppose you have a CutSet where each cut represents 1 utterance (or recording). Here are 2 ways to do it: Case 1: frame-level VADIf you have pre-computed features for the cuts, and frame-level VAD decisions on these features, you can store the VAD decisions as a with CutSet.open_writer(manifest_path) as cut_writer, LilcomChunkyWriter(
storage_path
) as vad_writer:
for cut in cuts:
vad_decisions = vad.run(cut) # vad_decisions is an np.ndarray
cut.vad = vad_writer.store_array(cut.id, vad_decisions)
cut_writer.write(cut) Then, in your data-loader, these can be loaded by calling Case 2: Segment-level VADIt may happen that your VAD generates segments (in the form <start, end>) instead of frame-level decisions. You can create Then, in your data-loader, you can do something like the following: cut_segments = cut.trim_to_supervisions(keep_overlapping=False)
feats = []
for c in cut_segments:
feats.append(c.load_features())
feats = np.concatenate(feats, axis=0) Note that this assumes that the segments returned by your VAD model are non-overlapping (it doesn't really make sense to have overlapping VAD segments anyway). |
great, thanks for the suggestions |
Hi,
I saw the VadDataset class and it is mentioned in the Readme and elsewhere. Do you know of an example setup/recipe (perhaps in other repos?) that uses it to train a VAD/segmentation model?
Thanks!
The text was updated successfully, but these errors were encountered: