Question about reproduce the result #36

Shoawen0213 · 2022-04-23T06:19:52Z

Hi! It's me again, sorry for bothering you. I have several questions...
Q1.
I try to reproduce the results of the paper by using the following hyper-parameters.

SO~ I test them on the AMI test dataset and VoxConverse test dataset, but the result seems different.
In the AMI dataset, there are 24 wav files. I wrote a script containing several python codes to do the testing.
For each wav file, i use the command provided below.
python -m diart.demo $wavpath --tau=0.507 --rho=0.006 --delta=1.057 --output ./output/AMI_dia_fintetune/
After obtain each rttm files, i calculate the DER for each wav files, like "der = metric(reference, hypothesis)"
The reference rttm is from "AMI/MixHeadset.test.rttm"
them I calculate "sum of DER"/total file(24files in this case), i got 0.3408511916216123 (which means 34.085% DER).
Do i do something wrong....?
I could provide the rttm or the DER for each wav file.
The VoxConverse dataset is still processing. I'm afraid I misunderstood something, so I ask about the problem first...

BTW, I use the pyannote v1.1 to do the same things, them i got 0.48516973769490174 as final DER.
# v1.1
import torch
pipeline = torch.hub.load("pyannote/pyannote-audio", "dia")
diarization = pipeline({"audio": "xxx.wav"})
So o'm afraid that i did something wrong....

Q2
At the same time, I have another question.

Here show that you try lots of methods with different latency.
Does python -m diart.demo using the "5.0 latency" way which has the greatest result in the paper?
If the answer is yes, how to change different model for other latency for inference?
And how to train this part?

Again, thanks for your awesome project!!
Sorry for all those stupid questions...
Looking forward to your reply...

Shoawen0213 · 2022-04-24T06:49:47Z

Hi, for the VoxConverse dataset, the DER result is 0.23988039048252469.
The inference method is the same as the AMI test dataset mentioned above.

juanmc2005 · 2022-04-25T08:47:04Z

Hi @Shoawen0213, I recommend you read issue #15 for some background on the problem of expected outputs.

First of all (and this relates to your second question), the default latency in the demo is 500ms (see here), so looking at the performance of the system in Figure 5 (the one you posted) for that latency, it looks like you got the expected performance.
You can always run python -m diart.demo -h for more info on the arguments.

Aside from that, note that the DER should be calculated as the DER of the total false alarm, missed detection and confusion of the entire test set, and not as the average DER of the files. You can calculate this easily like this:

metric = DiarizationErrorRate()
for ref, hyp in zip(all_references, all_hypothesis):
    metric(ref, hyp)
final_der = abs(metric)

As mentioned in #15, this implementation is a bit different than the one used in the paper but normally the performance should be very close and possibly slightly better. I haven't had the time to measure this properly though, which is why that issue is still open.

Could you please post the DER you obtain with the method I just described?

juanmc2005 · 2022-04-25T08:55:42Z

Btw I recommend you replace these lines with:

pipeline.from_source(audio_source).subscribe(RTTMWriter(path=output_dir / "output.rttm"))

That way you get rid of buffering and plotting, which will accelerate inference quite a bit. This would be a nice option to add to the demo in the future actually.

Shoawen0213 · 2022-04-26T03:47:24Z

hi! thanks for your quick and useful reply! I will delve into #15 more, I have already seen that before I asked actually.
For Q1, just like your reply, seems that it matches the expected output shown on the paper! So, does it mean that if I want to test the latency of 1s, 2s ..., i just need to change the default latency without re-training? This question is beacause, according to Training, fine-tuning, and transfer learning with pyannote.audio , vad_task = VoiceActivityDetection(ami, duration=2.0, batch_size=128) means the model will ingest batches of 128 two seconds long audio chunks. This part is independent right? This means, I can train the model (2s chunk, 5s chunk...etc), but test for the latency of 0.5s~5s?

another question：
I follow your recommendation, I now have the whole hypothesis in one rttm file.
Forgive my confusion, i try to use code as followed (the reference rttm take 0.5s version for example)：
ref_path = "./expected_outputs/online/0.5s/AMI.rttm"
hyp_path = "./hypo_.txt"
ref_file = load_rttm(ref_path)
hyp_file = load_rttm(hyp_path)

for i in range(len(file_name)):
    ref_file = ref_file[file_name[i]]
    hyp_file = hyp_file[file_name[i]]
    metric(ref_file, hyp_file)

final_der = abs(metric)
the "file_name" means the wav file in AMI test dataset, ['EN2002a', 'EN2002b', 'EN2002c', 'EN2002d', 'ES2004a', 'ES2004b', 'ES2004c', 'ES2004d', 'ES2014a', 'ES2014b', 'ES2014c', 'ES2014d', 'IS1009a', 'IS1009b', 'IS1009c', 'IS1009d', 'TS3003a', 'TS3003b', 'TS3003c', 'TS3003d', 'TS3007a', 'TS3007b', 'TS3007c', 'TS3007d']
But it's not working actually...

BTW, Thanks for your recommendation!!!!!

Shoawen0213 · 2022-04-26T03:48:51Z

for ref, hyp in zip(all_references, all_hypothesis):
    metric(ref, hyp)
final_der = abs(metric)

I can't get does all_references and all_hypothesis are rttm file?

juanmc2005 · 2022-04-26T18:47:34Z

You can of course change the latency without retraining, that's one of the advantages of diart. You can modify this easily when running the demo, just add --latency 3 (for a 3s latency for example). Again, python -m diart.demo -h will give you more information about the accepted arguments.

The segmentation model can be trained for any chunk duration, just keep in mind that chunk_step <= latency <= chunk_duration.

Concerning the evaluation, you can take a look at the pyannote.metrics documentation to understand how it works.

juanmc2005 · 2022-05-08T16:01:14Z

Hi @Shoawen0213, I just merged PR #46 into develop.
It contains a diart.benchmark script that can run the pipeline and do the evaluation automatically for you.
Other features that may be interesting for your use case are a --no-plot argument to diart.demo (in case you don't want to use diart.benchmark) and a --gpu argument to run on gpu.

I'm aiming to release this as part of version 0.3.0. For now you can use these features by installing from the develop branch.

Shoawen0213 · 2022-05-09T14:49:16Z

Hi! @juanmc2005 thanks for your reply!!!
I'm still thinking and trying some parts of these.
thanks for your help!!!!!!
really really appreciates your help!

juanmc2005 added the question Further information is requested label Apr 25, 2022

juanmc2005 mentioned this issue May 4, 2022

Benchmark script + improvements and bug fixes #46

Merged

juanmc2005 closed this as completed May 8, 2022

juanmc2005 added this to the Version 0.3 milestone May 8, 2022

juanmc2005 mentioned this issue May 18, 2022

Version 0.3.0 #56

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about reproduce the result #36

Question about reproduce the result #36

Shoawen0213 commented Apr 23, 2022

Shoawen0213 commented Apr 24, 2022

juanmc2005 commented Apr 25, 2022

juanmc2005 commented Apr 25, 2022

Shoawen0213 commented Apr 26, 2022

Shoawen0213 commented Apr 26, 2022

juanmc2005 commented Apr 26, 2022

juanmc2005 commented May 8, 2022

Shoawen0213 commented May 9, 2022

Question about reproduce the result #36

Question about reproduce the result #36

Comments

Shoawen0213 commented Apr 23, 2022

Shoawen0213 commented Apr 24, 2022

juanmc2005 commented Apr 25, 2022

juanmc2005 commented Apr 25, 2022

Shoawen0213 commented Apr 26, 2022

Shoawen0213 commented Apr 26, 2022

juanmc2005 commented Apr 26, 2022

juanmc2005 commented May 8, 2022

Shoawen0213 commented May 9, 2022