-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uis-rnn gives different result on broken audios and continuous audios #76
Comments
How did you train your uis-rnn network? Did you also train it on continuous audio? |
Q)How did you train your uis-rnn network? Q)Did you also train it on continuous audio? |
The whole point of UIS-RNN is to learn conversational information from examples. If your UIS-RNN is trained on single-speaker utterance only, the trained model will be useless on multi-speaker audio. |
Thanks for your help, Quan. The uis-rnn model trained on single speaker utterance, performs bad on multi-speaker utterance(This can be explained by the answer above). I am still finding it hard to build intuition around the following: If I break down the multi speaker audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in sequence, I get different and fairly accurate predicted ids (around 91 % accuracy). Any idea why would that happen? |
UIS-RNN is an algorithm for supervised learning. This means, you train on multi-speaker data, it will perform well on multi-speaker data. You train it on single-speaker data only, it will only perform well on single-speaker data. It's not supposed to perform well on scenarios that never appeared during training.
This seems unrelated to UIS-RNN. Sounds like a bug in your speaker embedding implementation. If you extract speaker embeddings from sliding windows, whether it is continuous audio or broken audio should not make much difference. |
Describe the question
I trained the uis-rnn model on embeddings obtained on timit data. I am calculating embedding over a 240 ms window with 50 % overlap. I am using this uis-rnn model to obtain speaker ids from real time audios. For this I am using TOEFL test audios which have 3-4 speakers per recording. Each recording is 50-60 seconds long
When I use the model on continuous audios, I get only one or two speaker ids. But, if I break down the audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in a sequence, I get different and fairly accurate cluster ids. This would mean that the model is performing differently on continuous and broken audios. For the continuous audios, I tried different level of overlap and different number of embeddings per second , but no improvement.
Attached is the TOEFL audio and the same audio broken into three parts that I am using to test the model. The three parts broken audios are corresponding to different speakers in the continuous audioi
toefl_continuous_recording.wav.zip
broken_toefl_3_spk_recording.zip
My background
Have I read the
README.md
file?Have I searched for similar questions from closed issues?
Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?
Have I tried to find the answers in the reference Speaker Diarization with LSTM?
Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?
The text was updated successfully, but these errors were encountered: