Assistant model not working for different sized openai models when using pipeline for ASR #30407

rdib-equinor · 2024-04-23T05:47:57Z

System Info

transformers version: 4.40.0
Platform: macOS-14.4-arm64-arm-64bit
Python version: 3.12.1
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.28.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@Narsil @sanchit

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import numpy as np
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, AutomaticSpeechRecognitionPipeline, AutoModelForCausalLM

# load data to test
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:1]")
sample = dataset[0]

# load base model
model_id = "openai/whisper-large-v2"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# load distil model equivalent of base model
assistant_distil_model_id = "distil-whisper/distil-large-v2"
assistant_direct_distil_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_distil_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
assistant_pipe_distil_model = AutoModelForCausalLM.from_pretrained(
    assistant_distil_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# load tiny version of model from same origin (openai)
assistant_tiny_model_id = "openai/whisper-tiny"
assistant_direct_tiny_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_tiny_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
assistant_pipe_tiny_model = AutoModelForCausalLM.from_pretrained(
    assistant_tiny_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# load pipeline for base model
pipe = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    generate_kwargs={"language":"en"},
)

# display target output
print(processor.tokenizer.normalize(sample["text"]))

# successfully use both assisstant models without pipeline
inputs = processor(sample["audio"]["array"], sampling_rate=sample["audio"]["sampling_rate"], return_tensors="pt")
output = model.generate(**inputs, assistant_model=assistant_direct_distil_model, language="en")
print(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0])

inputs = processor(sample["audio"]["array"], sampling_rate=sample["audio"]["sampling_rate"], return_tensors="pt")
output = model.generate(**inputs, assistant_model=assistant_direct_tiny_model, language="en")
print(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0])

# fail to use same models within pipeline
inputs = {
    "sampling_rate": sample["audio"]["sampling_rate"],
    "raw": np.array(sample["audio"]["array"]),
}
output = pipe(inputs=inputs, generate_kwargs={"assistant_model":assistant_pipe_distil_model})["text"]
print(processor.tokenizer.normalize(output))

inputs = {
    "sampling_rate": sample["audio"]["sampling_rate"],
    "raw": np.array(sample["audio"]["array"]),
}
output = pipe(inputs=inputs, generate_kwargs={"assistant_model":assistant_pipe_tiny_model})["text"]
print(processor.tokenizer.normalize(output))

Expected behavior

Following the lovely blog post about speculative decoding, I have seen that it's possible to use it in a pipeline for ASR using a model and its distil variant. I have also seen that it's possible to use original whisper models of different sizes, e.g. large-v2 and tiny. However, when I try to use original models of different sizes via a pipeline, I get an error that is not obtained from using the same models outside of a pipeline.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-23T09:03:58Z

cc @sanchit-gandhi @gante

sanchit-gandhi · 2024-04-29T16:43:10Z

cc @kamilakesbi could you take a look here? Related: #29869

MrRace · 2024-05-20T13:07:51Z

same problem！

kamilakesbi · 2024-05-20T13:11:09Z

Hi @MrRace,

The PR associated to this issue should be merged soon ;)

kamilakesbi · 2024-05-23T09:08:08Z

#30637 has been merged, I'm closing this issue.

MrRace · 2024-06-15T02:35:07Z

@kamilakesbi
The new version transformers（transformers Version: 4.42.0.dev0） still has an error, and the specific error message is as follows:

ValueError: The main model and the assistant don't have compatible encoder-dependent input shapes. Ensure you load the assistant with the correct encoder-decoder class, e.g. `AutoModelForSpeechSeq2Seq` for Whisper.

I use whisper-tiny as assistant model, and whisper-base as main model.

MrRace · 2024-06-15T02:43:47Z

@kamilakesbi
My test script is as follows:

import time
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, AutoModelForCausalLM
from datasets import load_dataset
from tqdm import tqdm
from evaluate import load

model_dir = "/share_model_zoo/LLM"
model_id = "openai/whisper-base"
assistant_model_id = "openai/whisper-tiny"
model_path = os.path.join(model_dir, model_id)
assistant_model_path = os.path.join(model_dir, assistant_model_id)
print("model_path=", model_path)
print("assistant_model_path=", assistant_model_path)


model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)
print("Load model success")

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
    # pretraining_tp=1
)
assistant_model.config.pretraining_tp = 1
assistant_model.to(device)

processor = AutoProcessor.from_pretrained(model_path)

def assisted_generate_with_time(model, inputs, **kwargs):
    """

    Args:
        model:
        inputs:
        **kwargs:

    Returns:

    """
    start_time = time.time()
    outputs = model.generate(**inputs, assistant_model=assistant_model, **kwargs)
    generation_time = time.time() - start_time
    return outputs, generation_time
    

all_time = 0
predictions = []
references = []

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation", cache_dir="./tmp/")
print("Load dataset success")

for sample in tqdm(dataset):
    audio = sample["audio"]
    inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
    inputs = inputs.to(device=device, dtype=torch.float16)
    output, gen_time = assisted_generate_with_time(model, inputs)
    all_time += gen_time
    predictions.append(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0])
    references.append(processor.tokenizer._normalize(sample["text"]))

print("all_time=", all_time)

kamilakesbi · 2024-06-17T07:47:49Z

Hi @MrRace,

Thanks for this question!

When using Whisper models with different encoder sizes (such as whisper-tiny and whisper-base here), you should make sure that both models are defined using AutoModelForSpeechSeq2Seq (encoder-decoder model).

If you define the assistant model with AutoModelForCausalLM (decoder model), the assistant will consist of a decoder only. It will take as input the output of the main model's encoder. If the output size of the main and assistant encoders are different, you will get an error.

If instead you define your assistant using AutoModelForSpeechSeq2Seq, the assistant will be an encoder-decoder model. It will process the input with its own encoder before passing it to the decoder, and you won't get an error.

Hope it will help you!

amyeroberts added Generation Audio labels Apr 23, 2024

kamilakesbi mentioned this issue May 2, 2024

Whisper assistant decoding not working with pipeline #30611

Closed

4 tasks

gante mentioned this issue May 9, 2024

Whisper: fix asr pipeline with seq2seq assistant model #30726

Closed

sanchit-gandhi mentioned this issue May 15, 2024

Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size #30637

Merged

kamilakesbi self-assigned this May 16, 2024

kamilakesbi closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assistant model not working for different sized openai models when using pipeline for ASR #30407

Assistant model not working for different sized openai models when using pipeline for ASR #30407

rdib-equinor commented Apr 23, 2024 •

edited

amyeroberts commented Apr 23, 2024

sanchit-gandhi commented Apr 29, 2024

MrRace commented May 20, 2024

kamilakesbi commented May 20, 2024 •

edited by sanchit-gandhi

kamilakesbi commented May 23, 2024

MrRace commented Jun 15, 2024

MrRace commented Jun 15, 2024

kamilakesbi commented Jun 17, 2024

Assistant model not working for different sized openai models when using pipeline for ASR #30407

Assistant model not working for different sized openai models when using pipeline for ASR #30407

Comments

rdib-equinor commented Apr 23, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Apr 23, 2024

sanchit-gandhi commented Apr 29, 2024

MrRace commented May 20, 2024

kamilakesbi commented May 20, 2024 • edited by sanchit-gandhi

kamilakesbi commented May 23, 2024

MrRace commented Jun 15, 2024

MrRace commented Jun 15, 2024

kamilakesbi commented Jun 17, 2024

rdib-equinor commented Apr 23, 2024 •

edited

kamilakesbi commented May 20, 2024 •

edited by sanchit-gandhi