Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpeechBrain embedding compatibility #34

Closed
nefastosaturo opened this issue Apr 13, 2022 · 11 comments
Closed

SpeechBrain embedding compatibility #34

nefastosaturo opened this issue Apr 13, 2022 · 11 comments
Labels
feature New feature or request question Further information is requested
Milestone

Comments

@nefastosaturo
Copy link

Hello there,

first, thank you for your great work!

I was trying to reproduce the result that I obtain from https://huggingface.co/spaces/pyannote/pretrained-pipelines but using the diart modules to achieve the same but in real time/streaming fashion.

That pipeline uses the speechbrain embedding system. I tried to kinda reverse engineer a bit but without success :)

I used the example on the README for setting up a pipeline. I create a custom embedding module:

from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
class SpeechBrainEmbedding():
    def __init__(self, model):
        self.model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb")

    def __call__(self, waveform, weights):
        with torch.no_grad():
            chunk = torch.from_numpy(waveform.data.T).float()
            inputs = chunk.unsqueeze(0)
            # the problem is here!
            # weights has shape (num_local_speakers, num_frames)
            weights = torch.from_numpy(weights.data.T).float().to("cpu")
            inputs = inputs.repeat(weights.shape[0], 1, 1)
            output = self.model(inputs,weights).to("cpu")

        return torch.from_numpy(output).float()

and then the other modules like the example:

segmentation = fn.FrameWiseModel("pyannote/segmentation")
embedding = SpeechBrainEmbedding("speechbrain/spkrec-ecapa-voxceleb")
osp = fn.OverlappedSpeechPenalty(gamma=3, beta=10)
normalization = fn.EmbeddingNormalization(norm=1)

so basically I give those weighs to the model __call__ that triggers line 173 here setting my embeddings to nan

So surely I didn't understand what are those weights coming from the OverlappedSpeechPenalty module means and if are useful for this speechbrain embedding module described here.

Could you give me some hints? Maybe is better to try to build something with Gradio as noted in #30 ?

@juanmc2005 juanmc2005 added the question Further information is requested label Apr 15, 2022
@juanmc2005
Copy link
Owner

Hi @nefastosaturo and thank you :)

From a quick look at SpeechBrainPretrainedSpeakerEmbedding in pyannote.audio, it looks like the __call__ method is expecting masks and not weights. It should have a similar effect but I wanted to bring this up just in case.

Overlapped speech penalty should still be useful because it happens before calling the embedding model and because the weights it computes can also work as a mask.
However, if I understand the code correctly, what may be happening is that your weights are not correctly scaled between 0 and 1. If they are too low (which is probably the case due to OverlappedSpeechPenalty's formula), this line makes them 0.

Are you sure you're not having NaN's because of this line instead?

In any case I would be extremely interested in adding a working SpeechBrainEmbedding module to diart. If you want to make a contribution and open a PR with this feature once it works that would be awesome :)

@juanmc2005 juanmc2005 added the feature New feature or request label Apr 15, 2022
@juanmc2005
Copy link
Owner

I implemented a working version using speechbrain embeddings and it seems to work well with OSP weights as masks, even without normalization.

I don't know if this is the same error you were getting (can you post a stacktrace?) but I was also getting nans in some embeddings, which makes sense because of the mask being lower than 0.5 for some speakers.
If you replace line 163 in mapping.py with this: self.mapping_matrix = np.nan_to_num(mapping_matrix, nan=objective.invalid_value), any active speakers with a nan embedding will be ignored during clustering.
Maybe this fix can backfire in other situations but I can't think of any right now.

However, from my side (on cpu) speechbrain is way slower than pyannote/embedding.
I got about 70ms for a forward pass of pyannote/embedding versus 2s for speechbrain.
I'm not super familiar with the speechbrain model but 2s is a huge latency for real-time applications.

It could still be interesting to have compatibility with the speechbrain model anyway for other applications where real-time latency is not critical.

@hbredin
Copy link
Collaborator

hbredin commented Apr 21, 2022

Despite this huge slowdown, did you estimate the gain in accuracy?
How much faster is it on GPU?

@juanmc2005
Copy link
Owner

I didn't have time or access to a gpu but I'll take a look at that when I have some free time

@nefastosaturo
Copy link
Author

Hello @juanmc2005 thank you for your answer.

@hbredin , using a single wav file as a test, diart pipeline with the speechbrain embeddings on my gpu are computed quite fast (audio file: ~24s, pipeline computation time with my laptop GPU Nvidia 1050: ~5s)

@juanmc2005,
about the weights I see at this line that the AttentiveStatisticsPooling must be replaced with the pyannote one to kinda replicate the workflow of your paper . I will try to figure it out in these days :)

The thing about mask is that, as I read here that they are used to ignore padding inside a batch of waveforms. So do you think that using the weights as masks, you kinda "penalize" one waveforms against another in a batch?

@juanmc2005
Copy link
Owner

@hbredin , using a single wav file as a test, diart pipeline with the speechbrain embeddings on my gpu are computed quite fast (audio file: ~24s, pipeline computation time with my laptop GPU Nvidia 1050: ~5s)

Could you clarify these numbers?
Am I right in assuming that 24s is the time it takes for the pipeline to process the entire audio file on your GPU? How long is that file?
What do you mean by "pipeline computation time"? Is it the time it takes to process a single chunk on your GPU?

about the weights I see at this line that the AttentiveStatisticsPooling must be replaced with the pyannote one to kinda replicate the workflow of your paper . I will try to figure it out in these days :)

I'm not familiar with how the AttentiveStatisticsPooling layer works, but I don't think you can replace that without retraining speaker embeddings because the following layers may be dependant on its behavior. The alternative would be to adapt this attentive pooling layer to take weights in the same manner it's done in pyannote.audio. I don't know if this is possible though, but very curious to know.

The thing about mask is that, as I read here that they are used to ignore padding inside a batch of waveforms. So do you think that using the weights as masks, you kinda "penalize" one waveforms against another in a batch?

If the only use of the mask is to ignore padding then I agree that it may not be the smartest way to apply OSP. That said, "ignoring" frames with small weights is also kind of what we're aiming at, so I don't think it's the worst idea either. I guess we'll know for sure once we test it against pyannote/embedding.

Right now I'm working on a faster implementation of the pipeline that pre-calculates the segmentation and embeddings in batches before streaming. This should speed things up a lot for evaluation (issue #35). I also want to add an RxPY operator to benchmark real-time latency if I have the time.

@nefastosaturo
Copy link
Author

Could you clarify these numbers?

yep, sorry I was typing in a hurry :) I run your example code in the README adapted using the RegularAudioFileReader with 5s and 0.5s of window and step duration. Here some code snippets.
The embeddings are calculated by the SpeechBrainEmbedding class:

class SpeechBrainEmbedding():
    def __init__(self, model):
        self.model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb","cuda:0")

    def __call__(self, waveform, weights):

        with torch.no_grad():
            chunk = torch.from_numpy(waveform.data.T)
            inputs = chunk.unsqueeze(0)
            inputs = inputs.repeat(4, 1, 1) # no weights, 4 possible speakers
            output = self.model(inputs)

         return torch.from_numpy(output)

the clustering step, taken from here :

    clustering = fn.OnlineSpeakerClustering(0.555, 0.422, 1.517)

    aggregation = fn.DelayedAggregation(
        0.5, 0.5, strategy="hamming", stream_end=None
    )

    pipeline = rx.zip(segmentation_stream, embedding_stream).pipe(
        ops.starmap(clustering),
        # Buffer 'num_overlapping' sliding chunks with a step of 1 chunk
        myops.buffer_slide(aggregation.num_overlapping_windows),
        # Aggregate overlapping output windows
        ops.map(aggregation),
        # Binarize output
        ops.map(fn.Binarize(uri, 0.5)),
    )

my audio file contains 3 non overlapping speakers and is 24.473s long.

The diarization pipeline above processes the audio file in 5.550771630001691 seconds. As I said, I run that code on my laptop with a very basic Nvidia 1050, nvidia-smi command returns:
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4

torch installed with: conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch on Ubuntu 18.04.6

The alternative would be to adapt this attentive pooling layer to take weights in the same manner it's done in pyannote.audio.

yes that it could be a nice move. I will try :)

But right now I will focus to just obtain the same result that I got from the offline hugginface pipeline on that test audio that I'm using. I think that the crucial step is to focus on the OnlineSpeakerClustering.

May I ask you what range of values are valid for tau_active, rho_update, delta_new? I would like to try to optimize for different audio files of mine, thanks!

@juanmc2005
Copy link
Owner

The diarization pipeline above processes the audio file in 5.550771630001691 seconds.

Ok thanks for the info. So if I'm not mistaken this is about 140ms per chunk for the whole pipeline, which is still good but a bit high. Have you benchmarked it with pyannote/embedding?

yes that it could be a nice move. I will try :)

Awesome, please don't forget to contribute with a PR when you get a working version :)

I think that the crucial step is to focus on the OnlineSpeakerClustering.

It is indeed important to tune clustering hyper-parameters, but beware that since your implementation doesn't apply OSP, it's going to fail at tracking overlapping speakers because the 4 embeddings you extract from the chunk are identical. Right now that's not a problem for your 24s file from what I understand, but it's something to keep in mind.

May I ask you what range of values are valid for tau_active, rho_update, delta_new?

Sure:

0 <= tau_active <= 1  # threshold on speaker probabilities
0 <= rho_update <= 1  # ratio of speech in a chunk
0 <= delta_new <= 2  # threshold on cosine distance

@juanmc2005 juanmc2005 changed the title Reproduce pyannote huggingface pipeline result SpeechBrain embedding compatibility May 8, 2022
@juanmc2005
Copy link
Owner

@nefastosaturo any news on this?

Recently I've been working a lot on the possibility to add custom models (#43), to optimize thresholds (#53) and to run a faster batched inference (#35), which should all help integrate and tune speechbrain embeddings.

Custom models and optimization should be good to go for version 0.4 (next release).

@juanmc2005
Copy link
Owner

@zaouk we talked about this some days ago

@juanmc2005
Copy link
Owner

SpeechBrain embedding compatibility in progress as part of #188

@juanmc2005 juanmc2005 added this to the Version 0.9 milestone Nov 9, 2023
@juanmc2005 juanmc2005 mentioned this issue Nov 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants