Problem with latency #113

PetroMaslov · 2023-02-20T08:29:40Z

Hi!

I trained t5-large using LoRA config:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

I saved the model using the next part of code:
model.save_pretrained(path)

Have 2 questions:

It takes for me 2000 ms (on A100 40GB) to generate response for one example. Could you please help me how to reduce latency?
I tried to convert model to onnx using optimum but it doesn't work (because of not loading all weights)

tokenizer = T5Tokenizer.from_pretrained("model_path")
ort_model = ORTModelForSeq2SeqLM.from_pretrained("model_path", from_transformers=True, provider="CUDAExecutionProvider")
onnx_pipe = pipeline("text2text-generation", model=ort_model, tokenizer=tokenizer, device="cuda", 
                     batch_size=8, max_length=512, truncation=True)

Maybe I need to save the model in another way?
Could you please help me to understand what I do wrong?

The text was updated successfully, but these errors were encountered:

TRT-BradleyB · 2023-02-22T09:51:20Z

I've yet to use the library - just looked at the documentation. Could be something along the lines of:

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=True, r=8, lora_alpha=32, lora_dropout=0.1
)

model = ORTModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)

Note the inference mode=True. If I read the paper right I believe this should calculate W' = W0 + BA -- updating the weights such that it behaves as a normal transformer of size W0.

pacman100 · 2023-02-24T07:02:38Z

Hello @PetroMaslov, For using ONNX with PEFT LoRA model, please refer to the example in the PR #118. Let us know if that solves the issue.

PetroMaslov · 2023-02-24T08:23:19Z

@pacman100
great, thank you very much.
I checked that for me onnx (flan-t5-large) is faster than deepspeed. Does it so or I do something wrong?

github-actions · 2023-04-03T15:03:33Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

ingo-m · 2023-07-06T15:47:35Z

@PetroMaslov after converting the LoRA finetuned T5-large model to ONNX, did you see degraded model outputs? Using a bigscience/bloom base model, I can perform inference after exporting to ONNX, but the model predictions become nonsensical 🤔 #670

github-actions bot closed this as completed Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with latency #113

Problem with latency #113

PetroMaslov commented Feb 20, 2023

TRT-BradleyB commented Feb 22, 2023 •

edited

Loading

pacman100 commented Feb 24, 2023

PetroMaslov commented Feb 24, 2023 •

edited

Loading

github-actions bot commented Apr 3, 2023

ingo-m commented Jul 6, 2023

Problem with latency #113

Problem with latency #113

Comments

PetroMaslov commented Feb 20, 2023

TRT-BradleyB commented Feb 22, 2023 • edited Loading

pacman100 commented Feb 24, 2023

PetroMaslov commented Feb 24, 2023 • edited Loading

github-actions bot commented Apr 3, 2023

ingo-m commented Jul 6, 2023

TRT-BradleyB commented Feb 22, 2023 •

edited

Loading

PetroMaslov commented Feb 24, 2023 •

edited

Loading