Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about reproduce the result #36

Closed
Shoawen0213 opened this issue Apr 23, 2022 · 8 comments
Closed

Question about reproduce the result #36

Shoawen0213 opened this issue Apr 23, 2022 · 8 comments
Labels
question Further information is requested
Milestone

Comments

@Shoawen0213
Copy link

Hi! It's me again, sorry for bothering you. I have several questions...
Q1.
I try to reproduce the results of the paper by using the following hyper-parameters.
image
SO~ I test them on the AMI test dataset and VoxConverse test dataset, but the result seems different.
In the AMI dataset, there are 24 wav files. I wrote a script containing several python codes to do the testing.
For each wav file, i use the command provided below.
python -m diart.demo $wavpath --tau=0.507 --rho=0.006 --delta=1.057 --output ./output/AMI_dia_fintetune/
After obtain each rttm files, i calculate the DER for each wav files, like "der = metric(reference, hypothesis)"
The reference rttm is from "AMI/MixHeadset.test.rttm"
them I calculate "sum of DER"/total file(24files in this case), i got 0.3408511916216123 (which means 34.085% DER).
Do i do something wrong....?
I could provide the rttm or the DER for each wav file.
The VoxConverse dataset is still processing. I'm afraid I misunderstood something, so I ask about the problem first...

BTW, I use the pyannote v1.1 to do the same things, them i got 0.48516973769490174 as final DER.
# v1.1
import torch
pipeline = torch.hub.load("pyannote/pyannote-audio", "dia")
diarization = pipeline({"audio": "xxx.wav"})

So o'm afraid that i did something wrong....

Q2
At the same time, I have another question.
image
Here show that you try lots of methods with different latency.
Does python -m diart.demo using the "5.0 latency" way which has the greatest result in the paper?
If the answer is yes, how to change different model for other latency for inference?
And how to train this part?

Again, thanks for your awesome project!!
Sorry for all those stupid questions...
Looking forward to your reply...

@Shoawen0213
Copy link
Author

Hi, for the VoxConverse dataset, the DER result is 0.23988039048252469.
The inference method is the same as the AMI test dataset mentioned above.

@juanmc2005
Copy link
Owner

Hi @Shoawen0213, I recommend you read issue #15 for some background on the problem of expected outputs.

First of all (and this relates to your second question), the default latency in the demo is 500ms (see here), so looking at the performance of the system in Figure 5 (the one you posted) for that latency, it looks like you got the expected performance.
You can always run python -m diart.demo -h for more info on the arguments.

Aside from that, note that the DER should be calculated as the DER of the total false alarm, missed detection and confusion of the entire test set, and not as the average DER of the files. You can calculate this easily like this:

metric = DiarizationErrorRate()
for ref, hyp in zip(all_references, all_hypothesis):
    metric(ref, hyp)
final_der = abs(metric)

As mentioned in #15, this implementation is a bit different than the one used in the paper but normally the performance should be very close and possibly slightly better. I haven't had the time to measure this properly though, which is why that issue is still open.

Could you please post the DER you obtain with the method I just described?

@juanmc2005 juanmc2005 added the question Further information is requested label Apr 25, 2022
@juanmc2005
Copy link
Owner

Btw I recommend you replace these lines with:

pipeline.from_source(audio_source).subscribe(RTTMWriter(path=output_dir / "output.rttm"))

That way you get rid of buffering and plotting, which will accelerate inference quite a bit. This would be a nice option to add to the demo in the future actually.

@Shoawen0213
Copy link
Author

hi! thanks for your quick and useful reply! I will delve into #15 more, I have already seen that before I asked actually.
For Q1, just like your reply, seems that it matches the expected output shown on the paper! So, does it mean that if I want to test the latency of 1s, 2s ..., i just need to change the default latency without re-training? This question is beacause, according to Training, fine-tuning, and transfer learning with pyannote.audio , vad_task = VoiceActivityDetection(ami, duration=2.0, batch_size=128) means the model will ingest batches of 128 two seconds long audio chunks. This part is independent right? This means, I can train the model (2s chunk, 5s chunk...etc), but test for the latency of 0.5s~5s?

another question:
I follow your recommendation, I now have the whole hypothesis in one rttm file.
Forgive my confusion, i try to use code as followed (the reference rttm take 0.5s version for example):
ref_path = "./expected_outputs/online/0.5s/AMI.rttm"
hyp_path = "./hypo_.txt"
ref_file = load_rttm(ref_path)
hyp_file = load_rttm(hyp_path)

for i in range(len(file_name)):
    ref_file = ref_file[file_name[i]]
    hyp_file = hyp_file[file_name[i]]
    metric(ref_file, hyp_file)

final_der = abs(metric)
the "file_name" means the wav file in AMI test dataset, ['EN2002a', 'EN2002b', 'EN2002c', 'EN2002d', 'ES2004a', 'ES2004b', 'ES2004c', 'ES2004d', 'ES2014a', 'ES2014b', 'ES2014c', 'ES2014d', 'IS1009a', 'IS1009b', 'IS1009c', 'IS1009d', 'TS3003a', 'TS3003b', 'TS3003c', 'TS3003d', 'TS3007a', 'TS3007b', 'TS3007c', 'TS3007d']
But it's not working actually...

BTW, Thanks for your recommendation!!!!!

@Shoawen0213
Copy link
Author

for ref, hyp in zip(all_references, all_hypothesis):
    metric(ref, hyp)
final_der = abs(metric)

I can't get does all_references and all_hypothesis are rttm file?

@juanmc2005
Copy link
Owner

You can of course change the latency without retraining, that's one of the advantages of diart. You can modify this easily when running the demo, just add --latency 3 (for a 3s latency for example). Again, python -m diart.demo -h will give you more information about the accepted arguments.

The segmentation model can be trained for any chunk duration, just keep in mind that chunk_step <= latency <= chunk_duration.

Concerning the evaluation, you can take a look at the pyannote.metrics documentation to understand how it works.

@juanmc2005
Copy link
Owner

Hi @Shoawen0213, I just merged PR #46 into develop.
It contains a diart.benchmark script that can run the pipeline and do the evaluation automatically for you.
Other features that may be interesting for your use case are a --no-plot argument to diart.demo (in case you don't want to use diart.benchmark) and a --gpu argument to run on gpu.

I'm aiming to release this as part of version 0.3.0. For now you can use these features by installing from the develop branch.

@juanmc2005 juanmc2005 added this to the Version 0.3 milestone May 8, 2022
@Shoawen0213
Copy link
Author

Hi! @juanmc2005 thanks for your reply!!!
I'm still thinking and trying some parts of these.
thanks for your help!!!!!!
really really appreciates your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants