Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diart.stream microphone detects only 2 speakers #133

Closed
kaleaniket opened this issue Mar 23, 2023 · 6 comments
Closed

diart.stream microphone detects only 2 speakers #133

kaleaniket opened this issue Mar 23, 2023 · 6 comments
Labels
question Further information is requested

Comments

@kaleaniket
Copy link

Hello,

I am using diart.stream microphone from command line for inference but it is not detecting for more than 2 speakers even if there are.

For ex. if I play the recording of 3 people speaking (1 female and 2 male) then, It considers 2 male speakers as 1 speaker. and If I'm playing the recording of 2 male speakers or 2 female then it is working fine.

I've explored files from https://github.com/juanmc2005/StreamingSpeakerDiarization/tree/main/src/diart/blocks to see if there is anything metioned related to num_speakers and in most of the files found out about the max_speakers = 20.

Do I have to make changes to any part of the code for more number of speakers?

@kaleaniket kaleaniket changed the title diart.stream mictophone detects only 2 speakers diart.stream microphone detects only 2 speakers Mar 24, 2023
@juanmc2005 juanmc2005 added the question Further information is requested label Mar 24, 2023
@juanmc2005
Copy link
Owner

Hi @kaleaniket,

This has been discussed briefly in issue #4:

New speaker detection is affected by the hyper-parameter delta, which is a threshold on the cosine distance between a speaker's embedding and its closest centroid. A distance lower than delta assigns the speaker to that centroid (reidentification of a known speaker), whereas a distance higher than delta assigns the speaker to a new centroid (new speaker detection).

It may be possible that the delta value you're using is not adapted to your recordings. If you find that new speakers are not being detected, my first suggestion would be to lower delta.

If you're using diart.stream you can change it with --delta=1.0, and if you're in python you can set in PipelineConfig:

from diart import PipelineConfig, OnlineSpeakerDiarization

config = PipelineConfig(delta_new=1.0)
diarization = OnlineSpeakerDiarization(config)

Note that there's a tradeoff here between recognizing too few or too many speakers.
Currently, this threshold strategy is a bit too simple, it is a key area of improvement on which I'm currently working.

@someonewating
Copy link

someonewating commented Apr 20, 2023

Hi @kaleaniket,

This has been discussed briefly in issue #4:

New speaker detection is affected by the hyper-parameter , which is a threshold on the cosine distance between a speaker's embedding and its closest centroid. A distance lower than assigns the speaker to that centroid (reidentification of a known speaker), whereas a distance higher than assigns the speaker to a new centroid (new speaker detection).delta``delta``delta

It may be possible that the value you're using is not adapted to your recordings. If you find that new speakers are not being detected, my first suggestion would be to lower .delta``delta

If you're using you can change it with , and if you're in python you can set in :diart.stream``--delta=1.0``PipelineConfig

from diart import PipelineConfig, OnlineSpeakerDiarization

config = PipelineConfig(delta_new=1.0)
diarization = OnlineSpeakerDiarization(config)

Note that there's a tradeoff here between recognizing too few or too many speakers. Currently, this threshold strategy is a bit too simple, it is a key area of improvement on which I'm currently working.

Hi there. Thank you for your solution. However, I tried to set delta_new=0.1, delta_new=0.5, and delta_new=0.01, but the result didn't change. Would you mind to letting me know other suggestions?

@juanmc2005
Copy link
Owner

Hi @someonewating,

Could you post your code and results? Also, can you provide more information about your audio file? Like duration, number of speakers, expected RTTM, etc?

@someonewating
Copy link

Hi @someonewating,

Could you post your code and results? Also, can you provide more information about your audio file? Like duration, number of speakers, expected RTTM, etc?

Hi @juanmc2005,

Thank you for your reply. I checked my code again since I don't want to ask a stupid question here, and right now I successfully fixed the problem. This is because of my incorrect code. Right now everything works well. Thank you.😀

@someonewating
Copy link

Hi @someonewating,

Could you post your code and results? Also, can you provide more information about your audio file? Like duration, number of speakers, expected RTTM, etc?

And by the way, would you mind letting me know what does the beta, gamma, rho_update, and tau_active means in PipelineConfig file? Thank you.

@juanmc2005
Copy link
Owner

Glad it's working!

And by the way, would you mind letting me know what does the beta, gamma, rho_update, and tau_active means in PipelineConfig file? Thank you.

Sure, I actually wrote an intuitive explanation of tau_active, rho_update and delta_new in this post (see "Creating the speaker diarization module").

They essentially regulate the sensitivity of speaker recognition:

  • tau_active=0.5: Only recognize speakers whose probability of speech is higher than 50%.
  • rho_update=0.1: Diart automatically gathers information from speakers to improve itself. Here we only use speech longer than 100ms per speaker for self-improvement.
  • delta_new=0.57: This is an internal threshold between 0 and 2 that regulates new speaker detection. The lower the value, the more sensitive the system will be to differences in voices.

On the other hand, beta and gamma regulate overlap-aware speaker embedding extraction (see Equation 2 in the paper).

  • The higher the value of gamma, the more the embedding model ignores audio regions where the segmentation model is not confident.
  • beta acts as a temperature parameter on per-frame speaker probabilities to determine the predominant speaker (with softmax)

I know this sounds a bit obscure but I think it's better explained with the figures in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants