Couldn't inference with gpu #24

linggong2023 · 2023-07-11T03:39:38Z

When I load the model and perform inference using the Hugging Face framework, I noticed that although the model is loaded into GPU memory, the GPU usage remains at 0% while the CPU usage is at 100%. Here is the code:
def load_openchat_model(model_path:str,device_map):
model = LlamaForCausalLM.from_pretrained(
model_path,
load_in_8bit=False,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
model.to("cuda:0")
model.eval()
return model

inference code:
def infer_hf(input_text:str,model,tokenizer,device):
generation_config = dict(
temperature=0.8,
top_k=40,
top_p=0.9,
do_sample=True,
num_beams=1,
repetition_penalty=1.3,
max_new_tokens=400,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
with torch.inference_mode():
input_ids = tokenizer(input_text, return_tensors="pt")
generation_output = model.generate(
input_ids=input_ids["input_ids"].to(device),
attention_mask=input_ids['attention_mask'].to(device),
**generation_config
)
s = generation_output[0]
output = tokenizer.decode(s)
print(output)
I set device to "cuda:0"

imoneoi · 2023-07-21T10:03:34Z

Have you solved the issue? By the way, you can use the provided inference server.

imoneoi · 2023-08-01T18:25:37Z

Closing this issue now. If you have further questions, plz re-open.

imoneoi closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Couldn't inference with gpu #24

Couldn't inference with gpu #24

linggong2023 commented Jul 11, 2023

imoneoi commented Jul 21, 2023

imoneoi commented Aug 1, 2023

Couldn't inference with gpu #24

Couldn't inference with gpu #24

Comments

linggong2023 commented Jul 11, 2023

imoneoi commented Jul 21, 2023

imoneoi commented Aug 1, 2023