Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uis-rnn gives different result on broken audios and continuous audios #76

Closed
ashu170292 opened this issue Jul 23, 2020 · 5 comments
Closed
Labels
question Further information is requested

Comments

@ashu170292
Copy link

Describe the question

I trained the uis-rnn model on embeddings obtained on timit data. I am calculating embedding over a 240 ms window with 50 % overlap. I am using this uis-rnn model to obtain speaker ids from real time audios. For this I am using TOEFL test audios which have 3-4 speakers per recording. Each recording is 50-60 seconds long

When I use the model on continuous audios, I get only one or two speaker ids. But, if I break down the audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in a sequence, I get different and fairly accurate cluster ids. This would mean that the model is performing differently on continuous and broken audios. For the continuous audios, I tried different level of overlap and different number of embeddings per second , but no improvement.

Attached is the TOEFL audio and the same audio broken into three parts that I am using to test the model. The three parts broken audios are corresponding to different speakers in the continuous audioi
toefl_continuous_recording.wav.zip
broken_toefl_3_spk_recording.zip

My background

Have I read the README.md file?

  • yes

Have I searched for similar questions from closed issues?

  • yes

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

  • yes

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

  • yes

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

  • yes
@ashu170292 ashu170292 added the question Further information is requested label Jul 23, 2020
@wq2012
Copy link
Member

wq2012 commented Jul 23, 2020

How did you train your uis-rnn network?

Did you also train it on continuous audio?

@ashu170292
Copy link
Author

Q)How did you train your uis-rnn network?
Answer) To train the uis-rnn network, I made a train sequence as a single 2 -dim numpy array. I used around ~4000 utterances of timit. Each utterance has only one speaker. Each utterance is 4 seconds to 6 seconds long. For each utterance, embeddings were calculated and was appended in the train sequence.

Q)Did you also train it on continuous audio?
Answer) No

@wq2012
Copy link
Member

wq2012 commented Jul 23, 2020

The whole point of UIS-RNN is to learn conversational information from examples. If your UIS-RNN is trained on single-speaker utterance only, the trained model will be useless on multi-speaker audio.

@ashu170292
Copy link
Author

Thanks for your help, Quan.

The uis-rnn model trained on single speaker utterance, performs bad on multi-speaker utterance(This can be explained by the answer above). I am still finding it hard to build intuition around the following:

If I break down the multi speaker audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in sequence, I get different and fairly accurate predicted ids (around 91 % accuracy).

Any idea why would that happen?

@wq2012
Copy link
Member

wq2012 commented Jul 23, 2020

UIS-RNN is an algorithm for supervised learning. This means, you train on multi-speaker data, it will perform well on multi-speaker data. You train it on single-speaker data only, it will only perform well on single-speaker data. It's not supposed to perform well on scenarios that never appeared during training.

When I use the model on continuous audios, I get only one or two speaker ids. But, if I break down the audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in a sequence, I get different and fairly accurate cluster ids.

This seems unrelated to UIS-RNN. Sounds like a bug in your speaker embedding implementation. If you extract speaker embeddings from sliding windows, whether it is continuous audio or broken audio should not make much difference.

@wq2012 wq2012 closed this as completed Jul 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants