uis-rnn gives different result on broken audios and continuous audios #76

ashu170292 · 2020-07-23T19:37:06Z

Describe the question

I trained the uis-rnn model on embeddings obtained on timit data. I am calculating embedding over a 240 ms window with 50 % overlap. I am using this uis-rnn model to obtain speaker ids from real time audios. For this I am using TOEFL test audios which have 3-4 speakers per recording. Each recording is 50-60 seconds long

When I use the model on continuous audios, I get only one or two speaker ids. But, if I break down the audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in a sequence, I get different and fairly accurate cluster ids. This would mean that the model is performing differently on continuous and broken audios. For the continuous audios, I tried different level of overlap and different number of embeddings per second , but no improvement.

Attached is the TOEFL audio and the same audio broken into three parts that I am using to test the model. The three parts broken audios are corresponding to different speakers in the continuous audioi
toefl_continuous_recording.wav.zip
broken_toefl_3_spk_recording.zip

My background

Have I read the README.md file?

yes

Have I searched for similar questions from closed issues?

yes

Have I tried to find the answers in the paper Fully Supervised Speaker Diarization?

yes

Have I tried to find the answers in the reference Speaker Diarization with LSTM?

yes

Have I tried to find the answers in the reference Generalized End-to-End Loss for Speaker Verification?

yes

The text was updated successfully, but these errors were encountered:

wq2012 · 2020-07-23T20:19:02Z

How did you train your uis-rnn network?

Did you also train it on continuous audio?

ashu170292 · 2020-07-23T20:35:08Z

Q)How did you train your uis-rnn network?
Answer) To train the uis-rnn network, I made a train sequence as a single 2 -dim numpy array. I used around ~4000 utterances of timit. Each utterance has only one speaker. Each utterance is 4 seconds to 6 seconds long. For each utterance, embeddings were calculated and was appended in the train sequence.

Q)Did you also train it on continuous audio?
Answer) No

wq2012 · 2020-07-23T20:57:36Z

The whole point of UIS-RNN is to learn conversational information from examples. If your UIS-RNN is trained on single-speaker utterance only, the trained model will be useless on multi-speaker audio.

ashu170292 · 2020-07-23T21:09:17Z

Thanks for your help, Quan.

The uis-rnn model trained on single speaker utterance, performs bad on multi-speaker utterance(This can be explained by the answer above). I am still finding it hard to build intuition around the following:

If I break down the multi speaker audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in sequence, I get different and fairly accurate predicted ids (around 91 % accuracy).

Any idea why would that happen?

wq2012 · 2020-07-23T21:18:25Z

UIS-RNN is an algorithm for supervised learning. This means, you train on multi-speaker data, it will perform well on multi-speaker data. You train it on single-speaker data only, it will only perform well on single-speaker data. It's not supposed to perform well on scenarios that never appeared during training.

When I use the model on continuous audios, I get only one or two speaker ids. But, if I break down the audio corresponding to different speakers and concatenate the embeddings corresponding to the broken audios in a sequence, I get different and fairly accurate cluster ids.

This seems unrelated to UIS-RNN. Sounds like a bug in your speaker embedding implementation. If you extract speaker embeddings from sliding windows, whether it is continuous audio or broken audio should not make much difference.

ashu170292 added the question Further information is requested label Jul 23, 2020

wq2012 closed this as completed Jul 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uis-rnn gives different result on broken audios and continuous audios #76

uis-rnn gives different result on broken audios and continuous audios #76

ashu170292 commented Jul 23, 2020

wq2012 commented Jul 23, 2020

ashu170292 commented Jul 23, 2020

wq2012 commented Jul 23, 2020

ashu170292 commented Jul 23, 2020

wq2012 commented Jul 23, 2020

uis-rnn gives different result on broken audios and continuous audios #76

uis-rnn gives different result on broken audios and continuous audios #76

Comments

ashu170292 commented Jul 23, 2020

Describe the question

My background

wq2012 commented Jul 23, 2020

ashu170292 commented Jul 23, 2020

wq2012 commented Jul 23, 2020

ashu170292 commented Jul 23, 2020

wq2012 commented Jul 23, 2020