Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My output is different from expected_outputs #15

Closed
sablea opened this issue Dec 12, 2021 · 9 comments
Closed

My output is different from expected_outputs #15

sablea opened this issue Dec 12, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@sablea
Copy link

sablea commented Dec 12, 2021

I try to use the following command test the model:
python src/main.py path/to/voxconverse_test_wav/aepyx.wav --latency=0.5 --tau=0.576 --rho=0.915 --delta=0.648 --output results/vox_test_0.5s
But I compared my model output with 'expected_outputs/online/0.5s/VoxConverse.rttm', and found that my output ended early
for example:
The last line of my output is
SPEAKER aepyx 1 149.953 1.989 <NA> <NA> speaker0 <NA> <NA>
But regarding the last line of the 'expected_outputs/online/0.5s/VoxConverse.rttm' of the aepyx.wav file is
SPEAKER aepyx 1 168.063 0.507 <NA> <NA> F <NA> <NA>
They ended at different times.
I don’t know if the command I entered is wrong, or if it is due to other reasons

@juanmc2005 juanmc2005 added the bug Something isn't working label Dec 14, 2021
@juanmc2005
Copy link
Owner

Hi, thanks for reporting this issue.

The code that's implemented here is not exactly the same as the one that obtained that performance. This is because we wanted to release an implementation that's compatible with pyannote.audio, but more lightweight and with real-time streaming capabilities (namely through the use of RxPY).

One key difference that I already identified and might be causing the problem you report is #13. This means that the aggregation strategy for overlapping windows between buffers (see Figure 4 in the paper) needs to be a Hamming-weighted average instead of a normal average.

I don't have a lot of free time to implement this right now and the code that I have is not compatible with RxPY, but this is definitely something that will be fixed in the future.

If you happen to have the time, I would gladly merge a PR with this fix.

@sablea
Copy link
Author

sablea commented Dec 15, 2021

I modified it according to #13 the content in, but the effect was improved very little.

I'll try to do this fix gladly.But I don't understand what the problem is.
Could you tell me exactly where there is a bug in the code?Is it because of an internal problem with RxPY?

@juanmc2005
Copy link
Owner

juanmc2005 commented Dec 15, 2021

The problem is in this line:

intersection = np.stack([
    buffer.crop(required, fixed=required.duration)
    for buffer in buffers
])

Off the top of my head, it would look something like this:

hamming_windows = ... # They should be temporally aligned to the buffers
# Multiply by aligned window
if strategy == "hamming":
    buffers = [h * b for h, b in zip(hamming_windows, buffers)]
# Stack aligned output windows
intersection = np.stack([
    buffer.crop(required, fixed=required.duration)
    for buffer in buffers
])
# Divide by the sum of the weights
if strategy == "hamming":
    counts = np.hstack([
        h.crop(required, fixed=required.duration)
        for h in hamming_windows
    ]).sum(axis=1)
aggregation = np.sum(intersection, axis=0) / counts

It's not a problem with RxPY at all, what I meant is that the original code is not structured the way it should to work alongside RxPY seamlessly.

@juanmc2005
Copy link
Owner

I'm also planning to move this aggregation feature to the functional module (see #12) so it can be used independently from RxPY

@juanmc2005
Copy link
Owner

Hi @sablea, I've been working on issue #13 (see here) and I realized that the aggregation strategy shouldn't actually matter for latency=500ms (your case) because there's only 1 output window to aggregate. So I don't think this is causing the problem.

However, I also realized that the last chunks may not be correctly handled in the aggregate operator. In particular, it might be waiting to receive a chunk ending at file_duration + latency to output a prediction at time file_duration, which I think explains the behavior you report.

Could you please confirm that the rest of your RTTM output is identical to the expected output (i.e. from 0s-149s)?

@sablea
Copy link
Author

sablea commented Dec 17, 2021

I'm glad to see your update today!I think the new code can solve this issue well, although my test program is still running.

They are not exactly the same, and there is always a millisecond difference in each line.
I added a uem file, set the duration to 149s, and calculated the DER values as follows:
(only for aepyx.wav in the first 149s)
expected output: 0.33833992094861687
my RTTM output: 0.3342531135804318

if you want check it,click this,I share it through Google Drive.

Nice Work!

@juanmc2005
Copy link
Owner

Thanks for the quick answer! The few-millisecond difference might be due to the bug I fixed in #16.
The Hamming-weighted average should allow the pipeline to match the expected outputs on latencies > 500ms.
I will be working on fixing the problem of the output ending too soon.

@juanmc2005
Copy link
Owner

juanmc2005 commented Dec 17, 2021

Ok I just pushed a fix for the output truncation at the end, it turns out that UnreliableFileAudioSource was causing some strange behavior, so now the demo uses ReliableFileAudioSource and the pipeline doesn't necessarily regularize it, since it's already formatted to 5s chunks with 500ms shift.

The fix should work for latency=500ms but unfortunately it's still ending too early for higher latencies.
I actually think the original implementation had this bug as well so we might be able to squeeze some extra performance there.
I'll create a separate issue for that.

For now you can try to run it from the branch feat/hamming, I'll try to get that merged into develop next week.

Please let me know if you keep seeing discrepancies, I haven't been able to run tests on VoxConverse files.

@juanmc2005
Copy link
Owner

I've recently been working on #35 and this allowed me to quickly benchmark the performance of this implementation on some of the datasets.
I ran the pipeline with latency=5s for both AMI test and VoxConverse test and I got 27.3 and 17.1 of DER respectively, which is very close to the results of the paper (even a bit better for AMI). I'm still trying to understand where the differences come from exactly, apart from the known problems that I've addressed since this issue was opened.

In any case this will make it easy to benchmark other "flavors" of the pipeline as well as providing an easy-to-run baseline. I'm planning to release it as part of version 0.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants