-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
input_ids are not moved to GPU #62
Comments
Hi, try this updated example and note the memory limits: https://github.com/h2oai/h2ogpt/blob/main/pipeline_example.py |
Thanks for the reply! So this is just loading a different and smaller model, and in 8bit? I had tried running your models in 8bit before, but unfortunately I can't get bitsandbytes to work on my current setup due to a "UserWarning: The installed version of bitsandbytes was compiled without GPU support", and I don't have time to troubleshoot that. But thanks anyway for publishing what is to my knowledge the first LLM with a permissive license that is coherent in languages other than English! |
Understood, no problem. If you had a big enough GPU, you don't need 8bit. The main issue I think you had was device_map='auto', which will spread out the model/generation to multiple devices (GPU/CPU) and torch/transformers don't really do it perfectly. But if you have a 24GB board, 16-bit won't work for 12B model, you'd need to use even 6B. 20B would also require 48GB, and would be at edge of OOM allowing only short generations (short new tokens and short inputs). If you have multiple GPUs, then it will spread that across both with device_map='auto', and in many cases that will work. But it can also fail due to torch/transformers bugs in some cases with same mismatch for (e.g.) cuda:0 and cuda:1. So GPU memory is tight constraint. If you give more info about your GPU might could still help. |
I am running a dual 3090 setup, which is why I expected it to work. But as you state, there clearly is something going wrong under the hood where the inputs are not moved to GPU for whatever reason. That's why I asked if I can manually move input_ids to GPU with your pipeline. |
Yes, a 24GB board should work with 6B and if set load_in_8bit=False to avoid need for bitsandbytes. However, a 12B model will not fit on 24GB with load_in_8bit=False. |
FYI, I just performed an update of transformers (for entirely unrelated reasons), installing 4.29.0.dev0 directly from github because PyPi was behind. Running oig-oasst1-256-20b, I am no longer getting the message about input_ids being on the wrong device, and during inference nvidia-smi shows both GPUs being under load. For reference, this is the code I'm running:
Some unrelated notes/feedback:
results in
You're probably well aware of the limitations, but I thought I'd post my observations nonetheless. |
Hi, thanks for the excellent feedback. Really appreciate it. Here's some help with those options. One needs to set do_sample=True for any of those logits modifications to be applied, else just greedy generation is performed: huggingface/transformers#22405 (comment) In case you are curious, check out: https://huggingface.co/blog/how-to-generate So, if you'd like more variance (due to top_p or top_k) or you'd like more creative expression, you need to set do_sample=True. Otherwise all those options are ignored. I'll try to make that clearer in the gradio UI. |
I'm running this locally with downloaded h2oai_pipeline:
`import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")
generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)
res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])`
And while the generation works, I get this Warning:
Setting
pad_token_idto
eos_token_id:0 for open-end generation. /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1359: UserWarning: You are calling .generate() with the
input_idsbeing on a device type different than your model's device.
input_idsis on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put
input_idsto the correct device by calling for example input_ids = input_ids.to('cuda') before running
.generate(). warnings.warn(
Question 1: How do I make your custom pipeline move the input_ids to GPU?
Question 2: How do I make your custom pipeline set the pad_token_id to suppress the info log?
Question 3: The response from your custom pipeline is just plain text, no history. How do I build a conversation?
Thanks!
The text was updated successfully, but these errors were encountered: