Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assistant model not working for different sized openai models when using pipeline for ASR #30407

Closed
1 of 4 tasks
rdib-equinor opened this issue Apr 23, 2024 · 8 comments
Closed
1 of 4 tasks
Assignees

Comments

@rdib-equinor
Copy link

rdib-equinor commented Apr 23, 2024

System Info

  • transformers version: 4.40.0
  • Platform: macOS-14.4-arm64-arm-64bit
  • Python version: 3.12.1
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help?

@Narsil @sanchit

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import numpy as np
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, AutomaticSpeechRecognitionPipeline, AutoModelForCausalLM

# load data to test
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation[:1]")
sample = dataset[0]

# load base model
model_id = "openai/whisper-large-v2"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# load distil model equivalent of base model
assistant_distil_model_id = "distil-whisper/distil-large-v2"
assistant_direct_distil_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_distil_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
assistant_pipe_distil_model = AutoModelForCausalLM.from_pretrained(
    assistant_distil_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# load tiny version of model from same origin (openai)
assistant_tiny_model_id = "openai/whisper-tiny"
assistant_direct_tiny_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    assistant_tiny_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
assistant_pipe_tiny_model = AutoModelForCausalLM.from_pretrained(
    assistant_tiny_model_id,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)

# load pipeline for base model
pipe = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    generate_kwargs={"language":"en"},
)

# display target output
print(processor.tokenizer.normalize(sample["text"]))

# successfully use both assisstant models without pipeline
inputs = processor(sample["audio"]["array"], sampling_rate=sample["audio"]["sampling_rate"], return_tensors="pt")
output = model.generate(**inputs, assistant_model=assistant_direct_distil_model, language="en")
print(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0])

inputs = processor(sample["audio"]["array"], sampling_rate=sample["audio"]["sampling_rate"], return_tensors="pt")
output = model.generate(**inputs, assistant_model=assistant_direct_tiny_model, language="en")
print(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0])

# fail to use same models within pipeline
inputs = {
    "sampling_rate": sample["audio"]["sampling_rate"],
    "raw": np.array(sample["audio"]["array"]),
}
output = pipe(inputs=inputs, generate_kwargs={"assistant_model":assistant_pipe_distil_model})["text"]
print(processor.tokenizer.normalize(output))

inputs = {
    "sampling_rate": sample["audio"]["sampling_rate"],
    "raw": np.array(sample["audio"]["array"]),
}
output = pipe(inputs=inputs, generate_kwargs={"assistant_model":assistant_pipe_tiny_model})["text"]
print(processor.tokenizer.normalize(output))

Expected behavior

Following the lovely blog post about speculative decoding, I have seen that it's possible to use it in a pipeline for ASR using a model and its distil variant. I have also seen that it's possible to use original whisper models of different sizes, e.g. large-v2 and tiny. However, when I try to use original models of different sizes via a pipeline, I get an error that is not obtained from using the same models outside of a pipeline.

@amyeroberts
Copy link
Collaborator

cc @sanchit-gandhi @gante

@sanchit-gandhi
Copy link
Contributor

cc @kamilakesbi could you take a look here? Related: #29869

@MrRace
Copy link

MrRace commented May 20, 2024

same problem!

@kamilakesbi
Copy link
Contributor

kamilakesbi commented May 20, 2024

Hi @MrRace,

The PR associated to this issue should be merged soon ;)

@kamilakesbi
Copy link
Contributor

#30637 has been merged, I'm closing this issue.

@MrRace
Copy link

MrRace commented Jun 15, 2024

@kamilakesbi
The new version transformers(transformers Version: 4.42.0.dev0) still has an error, and the specific error message is as follows:

ValueError: The main model and the assistant don't have compatible encoder-dependent input shapes. Ensure you load the assistant with the correct encoder-decoder class, e.g. `AutoModelForSpeechSeq2Seq` for Whisper.

I use whisper-tiny as assistant model, and whisper-base as main model.

@MrRace
Copy link

MrRace commented Jun 15, 2024

@kamilakesbi
My test script is as follows:

import time
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, AutoModelForCausalLM
from datasets import load_dataset
from tqdm import tqdm
from evaluate import load

model_dir = "/share_model_zoo/LLM"
model_id = "openai/whisper-base"
assistant_model_id = "openai/whisper-tiny"
model_path = os.path.join(model_dir, model_id)
assistant_model_path = os.path.join(model_dir, assistant_model_id)
print("model_path=", model_path)
print("assistant_model_path=", assistant_model_path)


model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
)
model.to(device)
print("Load model success")

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    attn_implementation="sdpa",
    # pretraining_tp=1
)
assistant_model.config.pretraining_tp = 1
assistant_model.to(device)

processor = AutoProcessor.from_pretrained(model_path)

def assisted_generate_with_time(model, inputs, **kwargs):
    """

    Args:
        model:
        inputs:
        **kwargs:

    Returns:

    """
    start_time = time.time()
    outputs = model.generate(**inputs, assistant_model=assistant_model, **kwargs)
    generation_time = time.time() - start_time
    return outputs, generation_time
    

all_time = 0
predictions = []
references = []

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation", cache_dir="./tmp/")
print("Load dataset success")

for sample in tqdm(dataset):
    audio = sample["audio"]
    inputs = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt")
    inputs = inputs.to(device=device, dtype=torch.float16)
    output, gen_time = assisted_generate_with_time(model, inputs)
    all_time += gen_time
    predictions.append(processor.batch_decode(output, skip_special_tokens=True, normalize=True)[0])
    references.append(processor.tokenizer._normalize(sample["text"]))

print("all_time=", all_time)

@kamilakesbi
Copy link
Contributor

Hi @MrRace,

Thanks for this question!

When using Whisper models with different encoder sizes (such as whisper-tiny and whisper-base here), you should make sure that both models are defined using AutoModelForSpeechSeq2Seq (encoder-decoder model).

If you define the assistant model with AutoModelForCausalLM (decoder model), the assistant will consist of a decoder only. It will take as input the output of the main model's encoder. If the output size of the main and assistant encoders are different, you will get an error.

If instead you define your assistant using AutoModelForSpeechSeq2Seq, the assistant will be an encoder-decoder model. It will process the input with its own encoder before passing it to the decoder, and you won't get an error.

Hope it will help you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants