Enhancement: Possible to determine different speakers? #9

holynuts · 2022-12-21T17:58:05Z

Is it possible to add the functions of determining the different speakers in a conversation and identifying them in the subtitles?

machinelearnear · 2022-12-22T14:38:19Z

You could run this https://huggingface.co/spaces/dwarkesh/whisper-speaker-recognition (see app.py under Files and versions) and merge both outputs to get the timestamps.

m-bain · 2022-12-22T15:06:47Z

I am currently working on a robust way to do this, and there are a few ways to do it, you can use pyannote-audio for diarization.
https://github.com/pyannote/pyannote-audio

Either you merge the outputs, assign words to speakers by looking up the diarization timestamps.
or you feed each diarization segment as input to whisper independently.

Infinitay · 2022-12-22T19:22:58Z

I'm very much looking forward to this addition. Thanks for all your work.

MahmoudAshraf97 · 2022-12-22T22:40:01Z

Nvidia NeMo has a tutorial that does speaker diarization and ASR, basically it produces tokens with timestamps which are then grouped according to the RTTM timestamps, diarization timestamps are more accurate than pyannote but their ASR is awful compared to whisper, I'm currently experimenting how to merge alligned word-level transcriptions from whisperX with diarization from NeMo

m-bain · 2022-12-23T00:43:56Z

@MahmoudAshraf97 ah yes I saw this tutorial -- i didnt know their diarization is better! I will test it on my data -- i thought pyannote was the current best, thank you for letting me know.

MahmoudAshraf97 · 2022-12-23T01:35:51Z

@m-bain I made a minimal example of my idea using my fork of this repo, it saves the words+timestamps using pickle and modifies it to fit NeMo format.
it runs well locally if you face problems with tcmalloc using colab

https://colab.research.google.com/drive/1fhjGIr_S_vERE9_F1UL033g5shVD-q5K

Fcabla · 2023-01-20T12:54:09Z

What do you think about using speech/source separation models to produce an audio track for each speaker. Then use whisperx to transcribe the result with precise timestamps. Finally put all the transcriptions together and sort them according to timestamp.
I understand that this approach would be an alternative to using pyannote or any other package to identify the speakers.
I have tried to implement this idea with some models from the espnet2 library without any satisfactory result. Do you think it is feasible?

As pointed out in here whisper generally transcribes only one speaker when there is overlap. I think this approach could be a solution to this problem.

m-bain · 2023-01-21T10:16:15Z

@Fcabla I am not sure if speech separation is needed unless you have a lot of overlapping speakers. I have good results so far using:

Run whisperX and diarization separately. For each word, look if its timestamp lies within a diarization segment, if so, assign speaker label to that word.

However this assumes the word timestamps are 100% accurate, which is not always the case due to the current whisperX assumption that whisper timestamps are correct +/- 2 seconds

Fcabla · 2023-01-22T11:41:24Z

@m-bain I agree, for most applications it is not necessary. I am currently using a practically identical pipeline with quite good results. However, when there is some overlapping I have encountered the following two problems.

Whisper fails to transcribe several dialogues occurring simultaneously, for a given timestamp there is only one token. However, diarization models can identify multiple speakers for a given timestamp.

For example, imagine that the diarization model says that SPEAKER_00 speaks from second 5:00 to second 13:46 and SPEAKER_01 speaks from second 7:35 to second 8:02 -> there is overlap. To whom is assigned the token of the transcript that whisper generates and that starts at second 7:50.

It would be amazing if whisper were able to transcribe several tokens occurring at the same time, but currently it is not. Hence the idea of using speech separation as an alternative to diarization.

m-bain · 2023-01-24T15:04:17Z

@Fcabla I see yes overlapping speech is a difficult problem, probably worth using speech separation only for overlapping segments

m-bain · 2023-01-24T15:04:39Z

@holynuts first attempt at including diarization in the recent commit d395c21

holynuts · 2023-01-25T19:35:35Z

Thank you and I am existed with it, I did a quick run and having the error "ModuleNotFoundError: No module named 'whisperx.diarize'" occurred, will do a few more test later on this week. thank you again.

ran another one, facing errors as below, I did use not english file, it might be the reason, will try again .

arcticlinux · 2023-01-26T04:42:16Z

@m-bain

@holynuts first attempt at including diarization in the recent commit d395c21

The README lists --diarization but the code uses --diarize

holynuts · 2023-01-26T22:00:42Z

Failed to align segment: no characters in this segment found in model dictionary, resorting to original...
Performing diarization...
Traceback (most recent call last):
File "/usr/local/bin/whisperx", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.8/dist-packages/whisperx/transcribe.py", line 469, in cli
diarize_segments = diarize_pipeline(audio_path, min_speakers=min_speakers, max_speakers=max_speakers)
TypeError: 'NoneType' object is not callable

alexgo84 · 2023-02-07T19:26:53Z

I've tested with Russian audio and everything worked extremely well, apart for Diarization. Even when exact number of speakers is provided, the text is often attributed to the wrong speaker. Let me know if I can support work on Diarization.

arcticlinux · 2023-02-07T20:02:21Z

I added the diarization to the Text output from utils.write_txt in a fork, as extracting/viewing it from the other formats wasn't very easy. I could add a pull request for a more basic version of that, I made my version only output the speaker's name when the speaker changes.

…

On Tue, Feb 7, 2023 at 12:27 PM Alex Gordon ***@***.***> wrote: I've tested with Russian audio and everything worked extremely well, apart for Diarization. Even when exact number of speakers is provided, the text is often attributed to the wrong speaker. Let me know if I can support work on Diarization. — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAXOL2UZ5WB5A4V6DC6YS3WWKOYPANCNFSM6AAAAAATF4KZRA> . You are receiving this because you commented.Message ID: ***@***.***>

gabrilator · 2023-02-08T16:59:11Z

Hey folks!
First of all, thank you for the great work.
about the diarization, is there any example of the python implementation? I just see the cmd line example, and I would like to test it with python.

Thanks!

gabrilator · 2023-02-08T17:01:17Z

Also, how does one add the Hugging Face token in the python implementation?

Cheers

pdahiya · 2023-02-15T08:34:04Z

Hey folks! First of all, thank you for the great work. about the diarization, is there any example of the python implementation? I just see the cmd line example, and I would like to test it with python.

Thanks!

Hey ! you can refer transcribe.py file in whisperx git repo if you want to create your own python script for diarization.

pdahiya · 2023-02-15T08:35:20Z

Also, how does one add the Hugging Face token in the python implementation?

Cheers

you can add it in this way:
diarize_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
use_auth_token=<your_hf_token>)

hf token can be obtained from : https://huggingface.co/settings/tokens

pdahiya · 2023-02-15T09:08:35Z

i tried the diarization along with VAD-filter (which gives better results). Thanks @m-bain for adding this.
Adding the generated .ass here.
One issue i can see is that although there were 2 speakers in the audio file. model identified 4 speakers. The audio is downloaded from https://huggingface.co/spaces/ml6team/Speaker-Diarization page
hf copy.wav.ass.txt

mezaros · 2023-03-09T00:20:30Z

@m-bain I made a minimal example of my idea using my fork of this repo, it saves the words+timestamps using pickle and modifies it to fit NeMo format. it runs well locally if you face problems with tcmalloc using colab

https://colab.research.google.com/drive/1lb-mh2TP4iPb5AHSSYZXySxVcyJw4LZA

Thank you for your script!

I've had trouble running NeMo locally, and really want to investigate NeMo as an alternative to Pyannote. So I wanted to try your Collab—but also couldn't run this on Colab Pro. NeMo successfully installs but it chokes when it tries to import the method, can't find it. Have you seen any similar issues? Is this notebook still functioning? Thanks!!

ubanning · 2023-03-09T17:13:09Z

@m-bain I made a minimal example of my idea using my fork of this repo, it saves the words+timestamps using pickle and modifies it to fit NeMo format. it runs well locally if you face problems with tcmalloc using colab
https://colab.research.google.com/drive/1lb-mh2TP4iPb5AHSSYZXySxVcyJw4LZA

Thank you for your script!

I've had trouble running NeMo locally, and really want to investigate NeMo as an alternative to Pyannote. So I wanted to try your Collab—but also couldn't run this on Colab Pro. NeMo successfully installs but it chokes when it tries to import the method, can't find it. Have you seen any similar issues? Is this notebook still functioning? Thanks!!

I have exactly this problem, the error when I try to import. Could someone fix it? Thanks

MahmoudAshraf97 · 2023-03-09T17:32:07Z

@m-bain I made a minimal example of my idea using my fork of this repo, it saves the words+timestamps using pickle and modifies it to fit NeMo format. it runs well locally if you face problems with tcmalloc using colab
https://colab.research.google.com/drive/1lb-mh2TP4iPb5AHSSYZXySxVcyJw4LZA

Thank you for your script!
I've had trouble running NeMo locally, and really want to investigate NeMo as an alternative to Pyannote. So I wanted to try your Collab—but also couldn't run this on Colab Pro. NeMo successfully installs but it chokes when it tries to import the method, can't find it. Have you seen any similar issues? Is this notebook still functioning? Thanks!!

I have exactly this problem, the error when I try to import. Could someone fix it? Thanks

@ubanning Try this updated notebook, it's working as expected
https://colab.research.google.com/drive/1fhjGIr_S_vERE9_F1UL033g5shVD-q5K

ubanning · 2023-03-09T20:54:13Z

@m-bain I made a minimal example of my idea using my fork of this repo, it saves the words+timestamps using pickle and modifies it to fit NeMo format. it runs well locally if you face problems with tcmalloc using colab
https://colab.research.google.com/drive/1lb-mh2TP4iPb5AHSSYZXySxVcyJw4LZA

Thank you for your script!
I've had trouble running NeMo locally, and really want to investigate NeMo as an alternative to Pyannote. So I wanted to try your Collab—but also couldn't run this on Colab Pro. NeMo successfully installs but it chokes when it tries to import the method, can't find it. Have you seen any similar issues? Is this notebook still functioning? Thanks!!

I have exactly this problem, the error when I try to import. Could someone fix it? Thanks

@ubanning Try this updated notebook, it's working as expected https://colab.research.google.com/drive/1fhjGIr_S_vERE9_F1UL033g5shVD-q5K

@MahmoudAshraf97 Hello, thanks.
I was the one who opened a discussion in your repository, about limiting the size of subtitle lines and the problem when there are more than 2 speakers. MahmoudAshraf97/whisper-diarization#12
🙂

mezaros · 2023-03-09T23:32:21Z

@m-bain I made a minimal example of my idea using my fork of this repo, it saves the words+timestamps using pickle and modifies it to fit NeMo format. it runs well locally if you face problems with tcmalloc using colab
https://colab.research.google.com/drive/1lb-mh2TP4iPb5AHSSYZXySxVcyJw4LZA

Thank you for your script!
I've had trouble running NeMo locally, and really want to investigate NeMo as an alternative to Pyannote. So I wanted to try your Collab—but also couldn't run this on Colab Pro. NeMo successfully installs but it chokes when it tries to import the method, can't find it. Have you seen any similar issues? Is this notebook still functioning? Thanks!!

I have exactly this problem, the error when I try to import. Could someone fix it? Thanks

@ubanning Try this updated notebook, it's working as expected https://colab.research.google.com/drive/1fhjGIr_S_vERE9_F1UL033g5shVD-q5K

Fails at #Reading timestamps <> Speaker Labels mapping

NameError
Traceback (most recent
call last)
in

12
speaker_ts.append([s, e, int (line list[11].split(" ")[-11)1)
13
-> 14 wsm
= get words speaker mapping(result aligned[ "word segments"], speaker ts,
"start")
NameFrror: name
'result aligned' is not defined

MahmoudAshraf97 · 2023-03-10T04:29:40Z

@m-bain I made a minimal example of my idea using my fork of this repo, it saves the words+timestamps using pickle and modifies it to fit NeMo format. it runs well locally if you face problems with tcmalloc using colab
https://colab.research.google.com/drive/1lb-mh2TP4iPb5AHSSYZXySxVcyJw4LZA

Thank you for your script!
I've had trouble running NeMo locally, and really want to investigate NeMo as an alternative to Pyannote. So I wanted to try your Collab—but also couldn't run this on Colab Pro. NeMo successfully installs but it chokes when it tries to import the method, can't find it. Have you seen any similar issues? Is this notebook still functioning? Thanks!!

I have exactly this problem, the error when I try to import. Could someone fix it? Thanks

@ubanning Try this updated notebook, it's working as expected https://colab.research.google.com/drive/1fhjGIr_S_vERE9_F1UL033g5shVD-q5K

Fails at #Reading timestamps <> Speaker Labels mapping

NameError Traceback (most recent call last) in 12 speaker_ts.append([s, e, int (line list[11].split(" ")[-11)1) 13 -> 14 wsm = get words speaker mapping(result aligned[ "word segments"], speaker ts, "start") NameFrror: name 'result aligned' is not defined

result_aligned is defined in the whisperx cell, make sure that it ran successfully

mezaros · 2023-03-10T16:40:05Z

Thanks, it was loaded so not sure what was going on, but glitch now resolved. However, no matter what I throw at, even files very easily handled by Pyannote, it always shows one speaker. Can't believe Nemo is this bad, something else has to be going on.

m-bain · 2023-05-09T12:37:51Z

You might find this better than pyannote on your data:
https://github.com/JaesungHuh/SimpleDiarization

But depends, and ought to be constrained to whisperx sentences, i.e. Appendix Sec. A (page 13) of https://www.robots.ox.ac.uk/~vgg/publications/2023/Han23/han23.pdf

Co-authored-by: github-actions <github-actions@github.com>

m-bain added the enhancement New feature or request label Dec 22, 2022

m-bain added this to the MVP with pyannote milestone Jan 29, 2023

m-bain pinned this issue Jan 29, 2023

m-bain unpinned this issue Jun 26, 2023

C0RE1312 pushed a commit to RapidCode-Dev/whisperX that referenced this issue Apr 16, 2024

updating submodules (m-bain#9)

aaa76a1

Co-authored-by: github-actions <github-actions@github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Possible to determine different speakers? #9

Enhancement: Possible to determine different speakers? #9

holynuts commented Dec 21, 2022

machinelearnear commented Dec 22, 2022

m-bain commented Dec 22, 2022

Infinitay commented Dec 22, 2022

MahmoudAshraf97 commented Dec 22, 2022

m-bain commented Dec 23, 2022 •

edited

Loading

MahmoudAshraf97 commented Dec 23, 2022 •

edited

Loading

Fcabla commented Jan 20, 2023 •

edited

Loading

m-bain commented Jan 21, 2023

Fcabla commented Jan 22, 2023 •

edited

Loading

m-bain commented Jan 24, 2023

m-bain commented Jan 24, 2023

holynuts commented Jan 25, 2023 •

edited

Loading

arcticlinux commented Jan 26, 2023

holynuts commented Jan 26, 2023

alexgo84 commented Feb 7, 2023

arcticlinux commented Feb 7, 2023 via email

gabrilator commented Feb 8, 2023

gabrilator commented Feb 8, 2023

pdahiya commented Feb 15, 2023

pdahiya commented Feb 15, 2023

pdahiya commented Feb 15, 2023

mezaros commented Mar 9, 2023

ubanning commented Mar 9, 2023

MahmoudAshraf97 commented Mar 9, 2023 •

edited

Loading

ubanning commented Mar 9, 2023

mezaros commented Mar 9, 2023

MahmoudAshraf97 commented Mar 10, 2023

mezaros commented Mar 10, 2023

m-bain commented May 9, 2023

Enhancement: Possible to determine different speakers? #9

Enhancement: Possible to determine different speakers? #9

Comments

holynuts commented Dec 21, 2022

machinelearnear commented Dec 22, 2022

m-bain commented Dec 22, 2022

Infinitay commented Dec 22, 2022

MahmoudAshraf97 commented Dec 22, 2022

m-bain commented Dec 23, 2022 • edited Loading

MahmoudAshraf97 commented Dec 23, 2022 • edited Loading

Fcabla commented Jan 20, 2023 • edited Loading

m-bain commented Jan 21, 2023

Fcabla commented Jan 22, 2023 • edited Loading

m-bain commented Jan 24, 2023

m-bain commented Jan 24, 2023

holynuts commented Jan 25, 2023 • edited Loading

arcticlinux commented Jan 26, 2023

holynuts commented Jan 26, 2023

alexgo84 commented Feb 7, 2023

arcticlinux commented Feb 7, 2023 via email

gabrilator commented Feb 8, 2023

gabrilator commented Feb 8, 2023

pdahiya commented Feb 15, 2023

pdahiya commented Feb 15, 2023

pdahiya commented Feb 15, 2023

mezaros commented Mar 9, 2023

ubanning commented Mar 9, 2023

MahmoudAshraf97 commented Mar 9, 2023 • edited Loading

ubanning commented Mar 9, 2023

mezaros commented Mar 9, 2023

MahmoudAshraf97 commented Mar 10, 2023

mezaros commented Mar 10, 2023

m-bain commented May 9, 2023

m-bain commented Dec 23, 2022 •

edited

Loading

MahmoudAshraf97 commented Dec 23, 2022 •

edited

Loading

Fcabla commented Jan 20, 2023 •

edited

Loading

Fcabla commented Jan 22, 2023 •

edited

Loading

holynuts commented Jan 25, 2023 •

edited

Loading

MahmoudAshraf97 commented Mar 9, 2023 •

edited

Loading