Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper language is not returned when using word timestamps #29520

Open
2 of 4 tasks
robinderat opened this issue Mar 7, 2024 · 3 comments
Open
2 of 4 tasks

Whisper language is not returned when using word timestamps #29520

robinderat opened this issue Mar 7, 2024 · 3 comments
Labels
Audio bug Feature request Request for a new feature

Comments

@robinderat
Copy link

System Info

  • transformers version: 4.38.2
  • Platform: macOS-13.6.1-x86_64-i386-64bit
  • Python version: 3.10.11
  • Huggingface_hub version: 0.21.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@sanchit-gandhi @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am using the automatic speech recognition pipeline with whisper large v3. I want to get both the language as detected by whisper and the timestamps for the individual words. Both behaviors are supported individually( return_language=True, return_timestamps="word"), however when combined, the language is no longer returned.

This is the code i am using:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, use_safetensors=True, torch_dtype=torch.float16, low_cpu_mem_usage=True,
)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch.float16,
    return_language=True
)

result = pipe(audio_filepath, return_timestamps="word")
print(result)

I have done some digging in the source code and I believe I have found the problem in tokenization_whisper.py

Here, when using word timestamps, the existing chunks containing the language are ignored and only the words in the chunk are returned .

if return_timestamps or return_language:
        for chunk in chunks:
            if not return_timestamps:
                chunk.pop("timestamp")
            else:
                chunk["timestamp"] = tuple(chunk["timestamp"])
            if not return_language:
                chunk.pop("language")

        if return_timestamps == "word":
            new_chunks = []
            for chunk in chunks:
                new_chunks.extend(chunk["words"])
            optional = {"chunks": new_chunks}
        else:
            optional = {"chunks": chunks}
    else:
        optional = {}
    return full_text, optional

Expected behavior

I expect to get both the language and the word timestamps when using return_language=True and return_timestamps="word"

Considering the source code mentioned above, I see 2 possible ways of achieving this.

  1. Add an extra list to the chunks when using return_timestamps='word'

This would result in the format

{
  "text":"FULL TEXT", 
  "chunks": [
    "language": "LANGUAGE", 
    "timestamp": TIMESTAMP, 
    "text": "CHUNK_TEXT", 
    "words": [
      "text": "WORD", 
      "timestamp": WORD_TIMESTAMP
    ]
  ]
}
  1. Add the language information to the words

This would maintain the current structure and simply add the language of the chunk to its words, giving the following format

{
  "text":"FULL TEXT", 
  "chunks": [
    "language": "LANGUAGE", 
    "timestamp": WORD_TIMESTAMP, 
    "text": "WORD"
  ]
}

I believe solution 1 could be achieved simply by removing the if return_timestamps == "word": statement completely.
I believe solution 2 could be achieved by replacing the new_chunks.extend(chunk["words"]) with

for word in chunk["words"]:
    word["language"] = chunk["language"]
    new_chunks.append(word)
@ArthurZucker
Copy link
Collaborator

That is expected IMO: how do you choose which language to return? Do you return a language for each chunk? For each text between timestamps? Which language do you return the one that is the most used? I think these where considerations. Also getting the language first should be pretty easy.

cc @ylacombe as well

@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Mar 25, 2024
@robinderat
Copy link
Author

I have implemented the 2nd approach I outlined in a fork of the project (which is not ideal) and it's working just fine. In my opinion, if you can predict the language for a chunk, there is no harm is saying all words in that chunk belong to that language.

If there is an easy way to get the language first that would be fine too, but I don't see how besides running the model twice, which is slow and wasteful

@kamilakesbi
Copy link
Contributor

Hi @robinderat,

Thank you so much for this question!

This is indeed a bug, and fixing it would be a very nice contribution. The second approach you suggest looks good as it would keep the structure of the output and just add the missing attributes.

Would you like to open a PR for this fix, given that you have implemented it in a forked project?

This would be a really valuable fix for the Whisper community :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audio bug Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

4 participants