Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input_ids are not moved to GPU #62

Closed
marco-ve opened this issue Apr 20, 2023 · 7 comments
Closed

input_ids are not moved to GPU #62

marco-ve opened this issue Apr 20, 2023 · 7 comments

Comments

@marco-ve
Copy link

I'm running this locally with downloaded h2oai_pipeline:

`import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")

generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])`

And while the generation works, I get this Warning:

Setting pad_token_idtoeos_token_id:0 for open-end generation. /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1359: UserWarning: You are calling .generate() with the input_idsbeing on a device type different than your model's device.input_idsis on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have putinput_idsto the correct device by calling for example input_ids = input_ids.to('cuda') before running.generate(). warnings.warn(

Question 1: How do I make your custom pipeline move the input_ids to GPU?

Question 2: How do I make your custom pipeline set the pad_token_id to suppress the info log?

Question 3: The response from your custom pipeline is just plain text, no history. How do I build a conversation?

Thanks!

@pseudotensor
Copy link
Collaborator

Hi, try this updated example and note the memory limits:

https://github.com/h2oai/h2ogpt/blob/main/pipeline_example.py

@marco-ve
Copy link
Author

Thanks for the reply!

So this is just loading a different and smaller model, and in 8bit?

I had tried running your models in 8bit before, but unfortunately I can't get bitsandbytes to work on my current setup due to a "UserWarning: The installed version of bitsandbytes was compiled without GPU support", and I don't have time to troubleshoot that.

But thanks anyway for publishing what is to my knowledge the first LLM with a permissive license that is coherent in languages other than English!

@pseudotensor
Copy link
Collaborator

pseudotensor commented Apr 21, 2023

Understood, no problem. If you had a big enough GPU, you don't need 8bit. The main issue I think you had was device_map='auto', which will spread out the model/generation to multiple devices (GPU/CPU) and torch/transformers don't really do it perfectly.

But if you have a 24GB board, 16-bit won't work for 12B model, you'd need to use even 6B. 20B would also require 48GB, and would be at edge of OOM allowing only short generations (short new tokens and short inputs).

If you have multiple GPUs, then it will spread that across both with device_map='auto', and in many cases that will work. But it can also fail due to torch/transformers bugs in some cases with same mismatch for (e.g.) cuda:0 and cuda:1.

So GPU memory is tight constraint. If you give more info about your GPU might could still help.

@marco-ve
Copy link
Author

I am running a dual 3090 setup, which is why I expected it to work. But as you state, there clearly is something going wrong under the hood where the inputs are not moved to GPU for whatever reason. That's why I asked if I can manually move input_ids to GPU with your pipeline.

@pseudotensor
Copy link
Collaborator

Yes, a 24GB board should work with 6B and if set load_in_8bit=False to avoid need for bitsandbytes. However, a 12B model will not fit on 24GB with load_in_8bit=False.

@marco-ve
Copy link
Author

FYI, I just performed an update of transformers (for entirely unrelated reasons), installing 4.29.0.dev0 directly from github because PyPi was behind.

Running oig-oasst1-256-20b, I am no longer getting the message about input_ids being on the wrong device, and during inference nvidia-smi shows both GPUs being under load.

For reference, this is the code I'm running:

import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")

generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer, temperature=0.8)

res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])

Some unrelated notes/feedback:

  • The model's German language text generation seems more coherent than e.g. Dolly 2.0, although it still tends to be factually incorrect.
  • Providing a temperature parameter to the pipeline is accepted without error, but does not seem to have any effect. I may be misunderstanding the huggingface TextGenerationPipeline API here by assuming that the pipeline calls model.generate() with the provided temperature at some point down the stack.
  • Independent of this, completions seem to be deterministic, as running the same prompt ("drinking water") several times results in the exact same response.
  • Completions remain deterministic even when increasing max_new_tokens. I assume the model just tends to produce EOS within the default 100 tokens.
  • This is also suggested by the behaviour when forcing longer completions by asking for a specific word count:
res = generate_text("Explain to me the difference between nuclear fission and fusion in 300 words.", return_full_text=True, max_new_tokens=200)
print(res[0]["generated_text"])

results in

Nuclear fission and fusion are two different types of nuclear reactions that occur in the nucleus of an atom. Nuclear fission is the splitting of the nucleus of an atom into two smaller nuclei, while nuclear fusion is the joining of two nuclei to form a single larger nucleus.

Fission is a process that occurs naturally in the environment, and is the process by which a large nucleus splits into two smaller nuclei. Fusion is a process that occurs in the laboratory, and is the process by which two smaller nuclei are joined to form a single larger nucleus.

Fission is a process that occurs naturally in the environment, and is the process by which a large nucleus splits into two smaller nuclei. Fusion is a process that occurs in the laboratory, and is the process by which two smaller nuclei are joined to form a single larger nucleus.

Fission is a process that occurs naturally in the environment, and is the process by which a large nucleus splits into two smaller nuclei. Fusion is a process that

You're probably well aware of the limitations, but I thought I'd post my observations nonetheless.

@pseudotensor
Copy link
Collaborator

Hi, thanks for the excellent feedback. Really appreciate it.

Here's some help with those options. One needs to set do_sample=True for any of those logits modifications to be applied, else just greedy generation is performed:

huggingface/transformers#22405 (comment)

In case you are curious, check out:

https://huggingface.co/blog/how-to-generate

So, if you'd like more variance (due to top_p or top_k) or you'd like more creative expression, you need to set do_sample=True. Otherwise all those options are ignored. I'll try to make that clearer in the gradio UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants