-
-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diart vs whisperx diarization accuracy #226
Comments
I think the problem is within the identify_speakers function:
The About tweaking parameters you could check out this issue |
thank you for a response. tried your suggestion. however, the issue seems to be lower level. |
@nurgel could you explain what you mean by "lower level"? Remember that offline diarization works with the entire context of a pre-recorded conversation, which is why most state-of-the-art systems nowadays will be way better at determining the number of speakers in a recording. In streaming diarization, you need to discover speakers as you go, and with little context available (to fulfill real-time requirements). This makes the task considerably more complicated. Streaming diarization is unfortunately not at the level of offline diarization yet. Moreover, as @thaokimctu correctly suggested, you should consider diart's hyper-parameters, in particular On the other hand, the gist combining diart and whisper is supposed to be a demo of the composability power of diart, not a production-ready solution. In fact, the transcription feature is still a work in progress and hasn't been released officially. Many improvements can be made to the solution I shared, certainly more than my free time allows to develop. If you find something could be improved, I would gladly welcome ideas and contributions. |
thank you for an insightful response @juanmc2005 by ‘lower level’ i meant not related to the code given in the gist, but related to the modules or the model weights used. the difficulty of realtime diarization is clear considering that there is no viable alternative to diart. i am rushing deadlines, so was mostly looking for a free lunch that is general enough that it works magically with minimal effort on my side (somewhat sounds like AGI) :) looking forward to SpeakerAwareTranscription if/when you decide to share it with the world. all the best! |
trying the whisper_diart example here (https://gist.github.com/juanmc2005/ed6413e697e176cb36a149d8c40a3a5b) on a remote WebsocketAudioSource on an A100 with whisper large. encountering the following issues in the process. diart:
this did not happen in whisperx out of the box. however, realtime capabilities of diart is very tempting for a realtime app. are there any parameters that could be tweaked to improve/match the performance?
The text was updated successfully, but these errors were encountered: