Whisper chunking algorithm increases WER

### System Info

So I made a few experiments with whisper and seamless m4tv2 on the fleurs (concatenated files to 5min samples) dataset. I used the batching functionality by setting `chunk_length_s` to 30s and as is turns out the WER increases by **20%** over all languages compared to long form transcription (sequentially going through each file). Do you have the same behaviour? Is this a bug or expected to happen because of the chunking? 20% seems to be far too much from my point of view.

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Just use the default pipeline implementation of whisper on files which are a few minutes long. Its far worse when chunking is enabled.

### Expected behavior

I would expect the same transcription quality or maybe a few % less but 20% is far from that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Whisper chunking algorithm increases WER #37789

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Whisper chunking algorithm increases WER #37789

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions