My output is different from expected_outputs #15

sablea · 2021-12-12T11:13:31Z

I try to use the following command test the model:
python src/main.py path/to/voxconverse_test_wav/aepyx.wav --latency=0.5 --tau=0.576 --rho=0.915 --delta=0.648 --output results/vox_test_0.5s
But I compared my model output with 'expected_outputs/online/0.5s/VoxConverse.rttm', and found that my output ended early
for example:
The last line of my output is
SPEAKER aepyx 1 149.953 1.989 <NA> <NA> speaker0 <NA> <NA>
But regarding the last line of the 'expected_outputs/online/0.5s/VoxConverse.rttm' of the aepyx.wav file is
SPEAKER aepyx 1 168.063 0.507 <NA> <NA> F <NA> <NA>
They ended at different times.
I don’t know if the command I entered is wrong, or if it is due to other reasons

The text was updated successfully, but these errors were encountered:

juanmc2005 · 2021-12-14T10:49:25Z

Hi, thanks for reporting this issue.

The code that's implemented here is not exactly the same as the one that obtained that performance. This is because we wanted to release an implementation that's compatible with pyannote.audio, but more lightweight and with real-time streaming capabilities (namely through the use of RxPY).

One key difference that I already identified and might be causing the problem you report is #13. This means that the aggregation strategy for overlapping windows between buffers (see Figure 4 in the paper) needs to be a Hamming-weighted average instead of a normal average.

I don't have a lot of free time to implement this right now and the code that I have is not compatible with RxPY, but this is definitely something that will be fixed in the future.

If you happen to have the time, I would gladly merge a PR with this fix.

sablea · 2021-12-15T03:36:10Z

I modified it according to #13 the content in, but the effect was improved very little.

I'll try to do this fix gladly.But I don't understand what the problem is.
Could you tell me exactly where there is a bug in the code？Is it because of an internal problem with RxPY?

juanmc2005 · 2021-12-15T10:11:04Z

The problem is in this line:

intersection = np.stack([
    buffer.crop(required, fixed=required.duration)
    for buffer in buffers
])

Off the top of my head, it would look something like this:

hamming_windows = ... # They should be temporally aligned to the buffers
# Multiply by aligned window
if strategy == "hamming":
    buffers = [h * b for h, b in zip(hamming_windows, buffers)]
# Stack aligned output windows
intersection = np.stack([
    buffer.crop(required, fixed=required.duration)
    for buffer in buffers
])
# Divide by the sum of the weights
if strategy == "hamming":
    counts = np.hstack([
        h.crop(required, fixed=required.duration)
        for h in hamming_windows
    ]).sum(axis=1)
aggregation = np.sum(intersection, axis=0) / counts

It's not a problem with RxPY at all, what I meant is that the original code is not structured the way it should to work alongside RxPY seamlessly.

juanmc2005 · 2021-12-15T10:12:58Z

I'm also planning to move this aggregation feature to the functional module (see #12) so it can be used independently from RxPY

juanmc2005 · 2021-12-17T12:16:24Z

Hi @sablea, I've been working on issue #13 (see here) and I realized that the aggregation strategy shouldn't actually matter for latency=500ms (your case) because there's only 1 output window to aggregate. So I don't think this is causing the problem.

However, I also realized that the last chunks may not be correctly handled in the aggregate operator. In particular, it might be waiting to receive a chunk ending at file_duration + latency to output a prediction at time file_duration, which I think explains the behavior you report.

Could you please confirm that the rest of your RTTM output is identical to the expected output (i.e. from 0s-149s)?

sablea · 2021-12-17T13:00:11Z

I'm glad to see your update today!I think the new code can solve this issue well, although my test program is still running.

They are not exactly the same, and there is always a millisecond difference in each line.
I added a uem file, set the duration to 149s, and calculated the DER values as follows：
(only for aepyx.wav in the first 149s)
expected output: 0.33833992094861687
my RTTM output: 0.3342531135804318

if you want check it,click this,I share it through Google Drive.

Nice Work!

juanmc2005 · 2021-12-17T13:28:12Z

Thanks for the quick answer! The few-millisecond difference might be due to the bug I fixed in #16.
The Hamming-weighted average should allow the pipeline to match the expected outputs on latencies > 500ms.
I will be working on fixing the problem of the output ending too soon.

juanmc2005 · 2021-12-17T15:33:57Z

Ok I just pushed a fix for the output truncation at the end, it turns out that UnreliableFileAudioSource was causing some strange behavior, so now the demo uses ReliableFileAudioSource and the pipeline doesn't necessarily regularize it, since it's already formatted to 5s chunks with 500ms shift.

The fix should work for latency=500ms but unfortunately it's still ending too early for higher latencies.
I actually think the original implementation had this bug as well so we might be able to squeeze some extra performance there.
I'll create a separate issue for that.

For now you can try to run it from the branch feat/hamming, I'll try to get that merged into develop next week.

Please let me know if you keep seeing discrepancies, I haven't been able to run tests on VoxConverse files.

juanmc2005 · 2022-05-04T10:32:26Z

I've recently been working on #35 and this allowed me to quickly benchmark the performance of this implementation on some of the datasets.
I ran the pipeline with latency=5s for both AMI test and VoxConverse test and I got 27.3 and 17.1 of DER respectively, which is very close to the results of the paper (even a bit better for AMI). I'm still trying to understand where the differences come from exactly, apart from the known problems that I've addressed since this issue was opened.

In any case this will make it easy to benchmark other "flavors" of the pipeline as well as providing an easy-to-run baseline. I'm planning to release it as part of version 0.3.

juanmc2005 added the bug Something isn't working label Dec 14, 2021

juanmc2005 mentioned this issue Dec 16, 2021

Replace operators.aggregate() with functional.DelayedAggregation #16

Merged

This was referenced Dec 17, 2021

RTTM ends too soon when latency > 500ms #17

Closed

Add Hamming-weighted average to DelayedAggregation #18

Merged

juanmc2005 mentioned this issue Apr 25, 2022

Question about reproduce the result #36

Closed

juanmc2005 closed this as completed May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My output is different from expected_outputs #15

My output is different from expected_outputs #15

sablea commented Dec 12, 2021 •

edited

juanmc2005 commented Dec 14, 2021

sablea commented Dec 15, 2021

juanmc2005 commented Dec 15, 2021 •

edited

juanmc2005 commented Dec 15, 2021

juanmc2005 commented Dec 17, 2021

sablea commented Dec 17, 2021

juanmc2005 commented Dec 17, 2021

juanmc2005 commented Dec 17, 2021 •

edited

juanmc2005 commented May 4, 2022

My output is different from expected_outputs #15

My output is different from expected_outputs #15

Comments

sablea commented Dec 12, 2021 • edited

juanmc2005 commented Dec 14, 2021

sablea commented Dec 15, 2021

juanmc2005 commented Dec 15, 2021 • edited

juanmc2005 commented Dec 15, 2021

juanmc2005 commented Dec 17, 2021

sablea commented Dec 17, 2021

juanmc2005 commented Dec 17, 2021

juanmc2005 commented Dec 17, 2021 • edited

juanmc2005 commented May 4, 2022

sablea commented Dec 12, 2021 •

edited

juanmc2005 commented Dec 15, 2021 •

edited

juanmc2005 commented Dec 17, 2021 •

edited