Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with latency #113

Closed
PetroMaslov opened this issue Feb 20, 2023 · 5 comments
Closed

Problem with latency #113

PetroMaslov opened this issue Feb 20, 2023 · 5 comments

Comments

@PetroMaslov
Copy link

Hi!

I trained t5-large using LoRA config:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM"
)

I saved the model using the next part of code:
model.save_pretrained(path)

Have 2 questions:

  • It takes for me 2000 ms (on A100 40GB) to generate response for one example. Could you please help me how to reduce latency?
  • I tried to convert model to onnx using optimum but it doesn't work (because of not loading all weights)
tokenizer = T5Tokenizer.from_pretrained("model_path")
ort_model = ORTModelForSeq2SeqLM.from_pretrained("model_path", from_transformers=True, provider="CUDAExecutionProvider")
onnx_pipe = pipeline("text2text-generation", model=ort_model, tokenizer=tokenizer, device="cuda", 
                     batch_size=8, max_length=512, truncation=True)

Maybe I need to save the model in another way?
Could you please help me to understand what I do wrong?

@TRT-BradleyB
Copy link

TRT-BradleyB commented Feb 22, 2023

I've yet to use the library - just looked at the documentation. Could be something along the lines of:

peft_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=True, r=8, lora_alpha=32, lora_dropout=0.1
)

model = ORTModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)

Note the inference mode=True. If I read the paper right I believe this should calculate W' = W0 + BA -- updating the weights such that it behaves as a normal transformer of size W0.

@pacman100
Copy link
Contributor

Hello @PetroMaslov, For using ONNX with PEFT LoRA model, please refer to the example in the PR #118. Let us know if that solves the issue.

@PetroMaslov
Copy link
Author

PetroMaslov commented Feb 24, 2023

@pacman100
great, thank you very much.
I checked that for me onnx (flan-t5-large) is faster than deepspeed. Does it so or I do something wrong?

@github-actions
Copy link

github-actions bot commented Apr 3, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@ingo-m
Copy link

ingo-m commented Jul 6, 2023

@PetroMaslov after converting the LoRA finetuned T5-large model to ONNX, did you see degraded model outputs? Using a bigscience/bloom base model, I can perform inference after exporting to ONNX, but the model predictions become nonsensical 🤔 #670

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants