input_ids are not moved to GPU #62

marco-ve · 2023-04-20T14:38:29Z

I'm running this locally with downloaded h2oai_pipeline:

`import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")

generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)

res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])`

And while the generation works, I get this Warning:

Setting pad_token_idtoeos_token_id:0 for open-end generation. /opt/conda/lib/python3.10/site-packages/transformers/generation/utils.py:1359: UserWarning: You are calling .generate() with the input_idsbeing on a device type different than your model's device.input_idsis on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have putinput_idsto the correct device by calling for example input_ids = input_ids.to('cuda') before running.generate(). warnings.warn(

Question 1: How do I make your custom pipeline move the input_ids to GPU?

Question 2: How do I make your custom pipeline set the pad_token_id to suppress the info log?

Question 3: The response from your custom pipeline is just plain text, no history. How do I build a conversation?

Thanks!

The text was updated successfully, but these errors were encountered:

pseudotensor · 2023-04-20T22:04:57Z

Hi, try this updated example and note the memory limits:

https://github.com/h2oai/h2ogpt/blob/main/pipeline_example.py

marco-ve · 2023-04-21T07:36:40Z

Thanks for the reply!

So this is just loading a different and smaller model, and in 8bit?

I had tried running your models in 8bit before, but unfortunately I can't get bitsandbytes to work on my current setup due to a "UserWarning: The installed version of bitsandbytes was compiled without GPU support", and I don't have time to troubleshoot that.

But thanks anyway for publishing what is to my knowledge the first LLM with a permissive license that is coherent in languages other than English!

pseudotensor · 2023-04-21T07:39:29Z

Understood, no problem. If you had a big enough GPU, you don't need 8bit. The main issue I think you had was device_map='auto', which will spread out the model/generation to multiple devices (GPU/CPU) and torch/transformers don't really do it perfectly.

But if you have a 24GB board, 16-bit won't work for 12B model, you'd need to use even 6B. 20B would also require 48GB, and would be at edge of OOM allowing only short generations (short new tokens and short inputs).

If you have multiple GPUs, then it will spread that across both with device_map='auto', and in many cases that will work. But it can also fail due to torch/transformers bugs in some cases with same mismatch for (e.g.) cuda:0 and cuda:1.

So GPU memory is tight constraint. If you give more info about your GPU might could still help.

marco-ve · 2023-04-21T07:49:10Z

I am running a dual 3090 setup, which is why I expected it to work. But as you state, there clearly is something going wrong under the hood where the inputs are not moved to GPU for whatever reason. That's why I asked if I can manually move input_ids to GPU with your pipeline.

pseudotensor · 2023-04-21T10:59:53Z

Yes, a 24GB board should work with 6B and if set load_in_8bit=False to avoid need for bitsandbytes. However, a 12B model will not fit on 24GB with load_in_8bit=False.

marco-ve · 2023-04-21T13:25:46Z

FYI, I just performed an update of transformers (for entirely unrelated reasons), installing 4.29.0.dev0 directly from github because PyPi was behind.

Running oig-oasst1-256-20b, I am no longer getting the message about input_ids being on the wrong device, and during inference nvidia-smi shows both GPUs being under load.

For reference, this is the code I'm running:

import torch
from h2oai_pipeline import H2OTextGenerationPipeline
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("h2oai/h2ogpt-oig-oasst1-256-20b", torch_dtype=torch.bfloat16, device_map="auto")

generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer, temperature=0.8)

res = generate_text("Why is drinking water so healthy?", return_full_text=True, max_new_tokens=100)
print(res[0]["generated_text"])

Some unrelated notes/feedback:

The model's German language text generation seems more coherent than e.g. Dolly 2.0, although it still tends to be factually incorrect.
Providing a temperature parameter to the pipeline is accepted without error, but does not seem to have any effect. I may be misunderstanding the huggingface TextGenerationPipeline API here by assuming that the pipeline calls model.generate() with the provided temperature at some point down the stack.
Independent of this, completions seem to be deterministic, as running the same prompt ("drinking water") several times results in the exact same response.
Completions remain deterministic even when increasing max_new_tokens. I assume the model just tends to produce EOS within the default 100 tokens.
This is also suggested by the behaviour when forcing longer completions by asking for a specific word count:

res = generate_text("Explain to me the difference between nuclear fission and fusion in 300 words.", return_full_text=True, max_new_tokens=200)
print(res[0]["generated_text"])

results in

Nuclear fission and fusion are two different types of nuclear reactions that occur in the nucleus of an atom. Nuclear fission is the splitting of the nucleus of an atom into two smaller nuclei, while nuclear fusion is the joining of two nuclei to form a single larger nucleus.

Fission is a process that occurs naturally in the environment, and is the process by which a large nucleus splits into two smaller nuclei. Fusion is a process that occurs in the laboratory, and is the process by which two smaller nuclei are joined to form a single larger nucleus.

Fission is a process that occurs naturally in the environment, and is the process by which a large nucleus splits into two smaller nuclei. Fusion is a process that occurs in the laboratory, and is the process by which two smaller nuclei are joined to form a single larger nucleus.

Fission is a process that occurs naturally in the environment, and is the process by which a large nucleus splits into two smaller nuclei. Fusion is a process that

You're probably well aware of the limitations, but I thought I'd post my observations nonetheless.

pseudotensor · 2023-04-22T08:13:06Z

Hi, thanks for the excellent feedback. Really appreciate it.

Here's some help with those options. One needs to set do_sample=True for any of those logits modifications to be applied, else just greedy generation is performed:

huggingface/transformers#22405 (comment)

In case you are curious, check out:

https://huggingface.co/blog/how-to-generate

So, if you'd like more variance (due to top_p or top_k) or you'd like more creative expression, you need to set do_sample=True. Otherwise all those options are ignored. I'll try to make that clearer in the gradio UI.

pseudotensor closed this as completed Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input_ids are not moved to GPU #62

input_ids are not moved to GPU #62

marco-ve commented Apr 20, 2023

pseudotensor commented Apr 20, 2023

marco-ve commented Apr 21, 2023

pseudotensor commented Apr 21, 2023 •

edited

Loading

marco-ve commented Apr 21, 2023

pseudotensor commented Apr 21, 2023

marco-ve commented Apr 21, 2023

pseudotensor commented Apr 22, 2023

input_ids are not moved to GPU #62

input_ids are not moved to GPU #62

Comments

marco-ve commented Apr 20, 2023

pseudotensor commented Apr 20, 2023

marco-ve commented Apr 21, 2023

pseudotensor commented Apr 21, 2023 • edited Loading

marco-ve commented Apr 21, 2023

pseudotensor commented Apr 21, 2023

marco-ve commented Apr 21, 2023

pseudotensor commented Apr 22, 2023

pseudotensor commented Apr 21, 2023 •

edited

Loading