Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diart vs whisperx diarization accuracy #226

Closed
nurgel opened this issue Dec 25, 2023 · 4 comments
Closed

diart vs whisperx diarization accuracy #226

nurgel opened this issue Dec 25, 2023 · 4 comments

Comments

@nurgel
Copy link

nurgel commented Dec 25, 2023

trying the whisper_diart example here (https://gist.github.com/juanmc2005/ed6413e697e176cb36a149d8c40a3a5b) on a remote WebsocketAudioSource on an A100 with whisper large. encountering the following issues in the process. diart:

  • appears to forget the old speakers and assigns a new label even if a speaker was present in the recording earlier
  • repeats the initial phrase

this did not happen in whisperx out of the box. however, realtime capabilities of diart is very tempting for a realtime app. are there any parameters that could be tweaked to improve/match the performance?

@thaokimctu
Copy link

thaokimctu commented Dec 25, 2023

I think the problem is within the identify_speakers function:

        # Assign a speaker to the segment based on diarization
        speakers = dia.labels()
        num_speakers = len(speakers)
        if num_speakers == 0:
            # No speakers were detected
            caption = (-1, segment["text"])
        elif num_speakers == 1:
            # Only one speaker is active in this segment
            spk_id = int(speakers[0].split("speaker")[1])
            caption = (spk_id, segment["text"])
        else:
            # Multiple speakers, select the one that speaks the most
            max_speaker = int(np.argmax([
                dia.label_duration(spk) for spk in speakers
            ]))
            caption = (max_speaker, segment["text"])
        speaker_captions.append(caption)

    return speaker_captions`

The max_speaker = int(np.argmax([dia.label_duration(spk) for spk in speakers])) would return the index of the speaker with longest speaking duration so I think caption = (max_speaker, segment["text"]) would be caption = (speakers[max_speaker], segment["text"])

About tweaking parameters you could check out this issue

@nurgel
Copy link
Author

nurgel commented Dec 25, 2023

thank you for a response. tried your suggestion. however, the issue seems to be lower level.
overall, this library does not appear to be production-ready for now.

@nurgel nurgel closed this as completed Dec 25, 2023
@juanmc2005
Copy link
Owner

however, the issue seems to be lower level.

@nurgel could you explain what you mean by "lower level"?

Remember that offline diarization works with the entire context of a pre-recorded conversation, which is why most state-of-the-art systems nowadays will be way better at determining the number of speakers in a recording.

In streaming diarization, you need to discover speakers as you go, and with little context available (to fulfill real-time requirements). This makes the task considerably more complicated. Streaming diarization is unfortunately not at the level of offline diarization yet.

Moreover, as @thaokimctu correctly suggested, you should consider diart's hyper-parameters, in particular delta_new if you find it tends to create too many speakers (I suggest you try increasing it). These hyper-parameters should be tuned to conversations that are similar to what you expect to see in production, and you may need to collect some data to do this, as with anything in machine learning, there is no free lunch. Additionally, you may try the many new models that are now compatible as part of v0.9.

On the other hand, the gist combining diart and whisper is supposed to be a demo of the composability power of diart, not a production-ready solution. In fact, the transcription feature is still a work in progress and hasn't been released officially. Many improvements can be made to the solution I shared, certainly more than my free time allows to develop.

If you find something could be improved, I would gladly welcome ideas and contributions.

@nurgel
Copy link
Author

nurgel commented Dec 25, 2023

thank you for an insightful response @juanmc2005

by ‘lower level’ i meant not related to the code given in the gist, but related to the modules or the model weights used.


the difficulty of realtime diarization is clear considering that there is no viable alternative to diart. i am rushing deadlines, so was mostly looking for a free lunch that is general enough that it works magically with minimal effort on my side (somewhat sounds like AGI) :) looking forward to SpeakerAwareTranscription if/when you decide to share it with the world. all the best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants