SpeechBrain embedding compatibility #34

nefastosaturo · 2022-04-13T21:12:06Z

Hello there,

first, thank you for your great work!

I was trying to reproduce the result that I obtain from https://huggingface.co/spaces/pyannote/pretrained-pipelines but using the diart modules to achieve the same but in real time/streaming fashion.

That pipeline uses the speechbrain embedding system. I tried to kinda reverse engineer a bit but without success :)

I used the example on the README for setting up a pipeline. I create a custom embedding module:

from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
class SpeechBrainEmbedding():
    def __init__(self, model):
        self.model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb")

    def __call__(self, waveform, weights):
        with torch.no_grad():
            chunk = torch.from_numpy(waveform.data.T).float()
            inputs = chunk.unsqueeze(0)
            # the problem is here!
            # weights has shape (num_local_speakers, num_frames)
            weights = torch.from_numpy(weights.data.T).float().to("cpu")
            inputs = inputs.repeat(weights.shape[0], 1, 1)
            output = self.model(inputs,weights).to("cpu")

        return torch.from_numpy(output).float()

and then the other modules like the example:

segmentation = fn.FrameWiseModel("pyannote/segmentation")
embedding = SpeechBrainEmbedding("speechbrain/spkrec-ecapa-voxceleb")
osp = fn.OverlappedSpeechPenalty(gamma=3, beta=10)
normalization = fn.EmbeddingNormalization(norm=1)

so basically I give those weighs to the model __call__ that triggers line 173 here setting my embeddings to nan

So surely I didn't understand what are those weights coming from the OverlappedSpeechPenalty module means and if are useful for this speechbrain embedding module described here.

Could you give me some hints? Maybe is better to try to build something with Gradio as noted in #30 ?

The text was updated successfully, but these errors were encountered:

juanmc2005 · 2022-04-15T09:03:51Z

Hi @nefastosaturo and thank you :)

From a quick look at SpeechBrainPretrainedSpeakerEmbedding in pyannote.audio, it looks like the __call__ method is expecting masks and not weights. It should have a similar effect but I wanted to bring this up just in case.

Overlapped speech penalty should still be useful because it happens before calling the embedding model and because the weights it computes can also work as a mask.
However, if I understand the code correctly, what may be happening is that your weights are not correctly scaled between 0 and 1. If they are too low (which is probably the case due to OverlappedSpeechPenalty's formula), this line makes them 0.

Are you sure you're not having NaN's because of this line instead?

In any case I would be extremely interested in adding a working SpeechBrainEmbedding module to diart. If you want to make a contribution and open a PR with this feature once it works that would be awesome :)

juanmc2005 · 2022-04-20T17:30:39Z

I implemented a working version using speechbrain embeddings and it seems to work well with OSP weights as masks, even without normalization.

I don't know if this is the same error you were getting (can you post a stacktrace?) but I was also getting nans in some embeddings, which makes sense because of the mask being lower than 0.5 for some speakers.
If you replace line 163 in mapping.py with this: self.mapping_matrix = np.nan_to_num(mapping_matrix, nan=objective.invalid_value), any active speakers with a nan embedding will be ignored during clustering.
Maybe this fix can backfire in other situations but I can't think of any right now.

However, from my side (on cpu) speechbrain is way slower than pyannote/embedding.
I got about 70ms for a forward pass of pyannote/embedding versus 2s for speechbrain.
I'm not super familiar with the speechbrain model but 2s is a huge latency for real-time applications.

It could still be interesting to have compatibility with the speechbrain model anyway for other applications where real-time latency is not critical.

hbredin · 2022-04-21T07:15:22Z

Despite this huge slowdown, did you estimate the gain in accuracy?
How much faster is it on GPU?

juanmc2005 · 2022-04-21T07:55:14Z

I didn't have time or access to a gpu but I'll take a look at that when I have some free time

nefastosaturo · 2022-04-26T17:03:59Z

Hello @juanmc2005 thank you for your answer.

@hbredin , using a single wav file as a test, diart pipeline with the speechbrain embeddings on my gpu are computed quite fast (audio file: ~24s, pipeline computation time with my laptop GPU Nvidia 1050: ~5s)

@juanmc2005,
about the weights I see at this line that the AttentiveStatisticsPooling must be replaced with the pyannote one to kinda replicate the workflow of your paper . I will try to figure it out in these days :)

The thing about mask is that, as I read here that they are used to ignore padding inside a batch of waveforms. So do you think that using the weights as masks, you kinda "penalize" one waveforms against another in a batch?

juanmc2005 · 2022-04-26T19:15:47Z

@hbredin , using a single wav file as a test, diart pipeline with the speechbrain embeddings on my gpu are computed quite fast (audio file: ~24s, pipeline computation time with my laptop GPU Nvidia 1050: ~5s)

Could you clarify these numbers?
Am I right in assuming that 24s is the time it takes for the pipeline to process the entire audio file on your GPU? How long is that file?
What do you mean by "pipeline computation time"? Is it the time it takes to process a single chunk on your GPU?

about the weights I see at this line that the AttentiveStatisticsPooling must be replaced with the pyannote one to kinda replicate the workflow of your paper . I will try to figure it out in these days :)

I'm not familiar with how the AttentiveStatisticsPooling layer works, but I don't think you can replace that without retraining speaker embeddings because the following layers may be dependant on its behavior. The alternative would be to adapt this attentive pooling layer to take weights in the same manner it's done in pyannote.audio. I don't know if this is possible though, but very curious to know.

The thing about mask is that, as I read here that they are used to ignore padding inside a batch of waveforms. So do you think that using the weights as masks, you kinda "penalize" one waveforms against another in a batch?

If the only use of the mask is to ignore padding then I agree that it may not be the smartest way to apply OSP. That said, "ignoring" frames with small weights is also kind of what we're aiming at, so I don't think it's the worst idea either. I guess we'll know for sure once we test it against pyannote/embedding.

Right now I'm working on a faster implementation of the pipeline that pre-calculates the segmentation and embeddings in batches before streaming. This should speed things up a lot for evaluation (issue #35). I also want to add an RxPY operator to benchmark real-time latency if I have the time.

nefastosaturo · 2022-04-30T19:16:13Z

Could you clarify these numbers?

yep, sorry I was typing in a hurry :) I run your example code in the README adapted using the RegularAudioFileReader with 5s and 0.5s of window and step duration. Here some code snippets.
The embeddings are calculated by the SpeechBrainEmbedding class:

class SpeechBrainEmbedding():
    def __init__(self, model):
        self.model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb","cuda:0")

    def __call__(self, waveform, weights):

        with torch.no_grad():
            chunk = torch.from_numpy(waveform.data.T)
            inputs = chunk.unsqueeze(0)
            inputs = inputs.repeat(4, 1, 1) # no weights, 4 possible speakers
            output = self.model(inputs)

         return torch.from_numpy(output)

the clustering step, taken from here :

    clustering = fn.OnlineSpeakerClustering(0.555, 0.422, 1.517)

    aggregation = fn.DelayedAggregation(
        0.5, 0.5, strategy="hamming", stream_end=None
    )

    pipeline = rx.zip(segmentation_stream, embedding_stream).pipe(
        ops.starmap(clustering),
        # Buffer 'num_overlapping' sliding chunks with a step of 1 chunk
        myops.buffer_slide(aggregation.num_overlapping_windows),
        # Aggregate overlapping output windows
        ops.map(aggregation),
        # Binarize output
        ops.map(fn.Binarize(uri, 0.5)),
    )

my audio file contains 3 non overlapping speakers and is 24.473s long.

The diarization pipeline above processes the audio file in 5.550771630001691 seconds. As I said, I run that code on my laptop with a very basic Nvidia 1050, nvidia-smi command returns:
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4

torch installed with: conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch on Ubuntu 18.04.6

The alternative would be to adapt this attentive pooling layer to take weights in the same manner it's done in pyannote.audio.

yes that it could be a nice move. I will try :)

But right now I will focus to just obtain the same result that I got from the offline hugginface pipeline on that test audio that I'm using. I think that the crucial step is to focus on the OnlineSpeakerClustering.

May I ask you what range of values are valid for tau_active, rho_update, delta_new? I would like to try to optimize for different audio files of mine, thanks!

juanmc2005 · 2022-05-02T08:29:50Z

The diarization pipeline above processes the audio file in 5.550771630001691 seconds.

Ok thanks for the info. So if I'm not mistaken this is about 140ms per chunk for the whole pipeline, which is still good but a bit high. Have you benchmarked it with pyannote/embedding?

yes that it could be a nice move. I will try :)

Awesome, please don't forget to contribute with a PR when you get a working version :)

I think that the crucial step is to focus on the OnlineSpeakerClustering.

It is indeed important to tune clustering hyper-parameters, but beware that since your implementation doesn't apply OSP, it's going to fail at tracking overlapping speakers because the 4 embeddings you extract from the chunk are identical. Right now that's not a problem for your 24s file from what I understand, but it's something to keep in mind.

May I ask you what range of values are valid for tau_active, rho_update, delta_new?

Sure:

0 <= tau_active <= 1  # threshold on speaker probabilities
0 <= rho_update <= 1  # ratio of speech in a chunk
0 <= delta_new <= 2  # threshold on cosine distance

juanmc2005 · 2022-06-17T11:18:13Z

@nefastosaturo any news on this?

Recently I've been working a lot on the possibility to add custom models (#43), to optimize thresholds (#53) and to run a faster batched inference (#35), which should all help integrate and tune speechbrain embeddings.

Custom models and optimization should be good to go for version 0.4 (next release).

juanmc2005 · 2022-07-26T10:22:07Z

@zaouk we talked about this some days ago

juanmc2005 · 2023-10-31T13:40:51Z

SpeechBrain embedding compatibility in progress as part of #188

juanmc2005 added the question Further information is requested label Apr 15, 2022

juanmc2005 added the feature New feature or request label Apr 15, 2022

juanmc2005 mentioned this issue Apr 21, 2022

Add batched mode for faster inference on pre-recorded conversations #35

Closed

juanmc2005 changed the title ~~Reproduce pyannote huggingface pipeline result~~ SpeechBrain embedding compatibility May 8, 2022

juanmc2005 mentioned this issue Jun 1, 2022

Question about changing model in system and real-time issue #58

Closed

ethanzrd mentioned this issue Jul 31, 2023

Implemented Real-time Audio Transcription with Speaker Diarization saharmor/whisper-playground#23

Merged

juanmc2005 added this to the Version 0.9 milestone Nov 9, 2023

juanmc2005 closed this as completed Nov 9, 2023

juanmc2005 mentioned this issue Nov 18, 2023

Version 0.9 #217

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpeechBrain embedding compatibility #34

SpeechBrain embedding compatibility #34

nefastosaturo commented Apr 13, 2022

juanmc2005 commented Apr 15, 2022

juanmc2005 commented Apr 20, 2022

hbredin commented Apr 21, 2022

juanmc2005 commented Apr 21, 2022

nefastosaturo commented Apr 26, 2022

juanmc2005 commented Apr 26, 2022

nefastosaturo commented Apr 30, 2022

juanmc2005 commented May 2, 2022

juanmc2005 commented Jun 17, 2022

juanmc2005 commented Jul 26, 2022

juanmc2005 commented Oct 31, 2023

SpeechBrain embedding compatibility #34

SpeechBrain embedding compatibility #34

Comments

nefastosaturo commented Apr 13, 2022

juanmc2005 commented Apr 15, 2022

juanmc2005 commented Apr 20, 2022

hbredin commented Apr 21, 2022

juanmc2005 commented Apr 21, 2022

nefastosaturo commented Apr 26, 2022

juanmc2005 commented Apr 26, 2022

nefastosaturo commented Apr 30, 2022

juanmc2005 commented May 2, 2022

juanmc2005 commented Jun 17, 2022

juanmc2005 commented Jul 26, 2022

juanmc2005 commented Oct 31, 2023