Skip to content

Whisper chunking algorithm increases WER #37789

@asusdisciple

Description

@asusdisciple

System Info

So I made a few experiments with whisper and seamless m4tv2 on the fleurs (concatenated files to 5min samples) dataset. I used the batching functionality by setting chunk_length_s to 30s and as is turns out the WER increases by 20% over all languages compared to long form transcription (sequentially going through each file). Do you have the same behaviour? Is this a bug or expected to happen because of the chunking? 20% seems to be far too much from my point of view.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Just use the default pipeline implementation of whisper on files which are a few minutes long. Its far worse when chunking is enabled.

Expected behavior

I would expect the same transcription quality or maybe a few % less but 20% is far from that.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions